Self-Attention-Based Text Encoder for Enhancing DMGAN Text-to-Image Generation

Generating images that align with textual input using text-to-image (TTI) generation models is a challenging task. Generative adversarial network (GAN) based TTI models can produce realistic and semantically consistent images. A bidirectional long-short-term memory (LSTM) network is commonly employe...

Full description

Saved in:
Bibliographic Details
Main Authors: Remya Gopalakrishnan, Sambangi Naveen, Shaeen Kalathil, P. V. Sudeep
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11079408/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Generating images that align with textual input using text-to-image (TTI) generation models is a challenging task. Generative adversarial network (GAN) based TTI models can produce realistic and semantically consistent images. A bidirectional long-short-term memory (LSTM) network is commonly employed with the text encoder of GAN-based TTI models to extract text features. However, the text encoder faces the risk of information loss with longer input text or keyword removal due to the inherent sequential nature of LSTM. The popular text attention mechanism that captures relevant textual information is an option to tackle this challenge. In this paper, we propose a text encoding approach with a text self-attention mechanism to produce a superior TTI output quality. For this purpose, we modified and trained the dynamic memory GAN (DMGAN) TTI model. In our experiments, we trained and tested the TTI model on the CUB and MS-COCO datasets. Our results show that our modified DMGAN TTI model generates realistic images and outperforms the base TTI model. We analyzed the TTI models qualitatively and quantitatively in terms of FID, IS, R-precision, and CLIP score values.
ISSN:2169-3536