Self-Attention-Based Text Encoder for Enhancing DMGAN Text-to-Image Generation
Generating images that align with textual input using text-to-image (TTI) generation models is a challenging task. Generative adversarial network (GAN) based TTI models can produce realistic and semantically consistent images. A bidirectional long-short-term memory (LSTM) network is commonly employe...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11079408/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Generating images that align with textual input using text-to-image (TTI) generation models is a challenging task. Generative adversarial network (GAN) based TTI models can produce realistic and semantically consistent images. A bidirectional long-short-term memory (LSTM) network is commonly employed with the text encoder of GAN-based TTI models to extract text features. However, the text encoder faces the risk of information loss with longer input text or keyword removal due to the inherent sequential nature of LSTM. The popular text attention mechanism that captures relevant textual information is an option to tackle this challenge. In this paper, we propose a text encoding approach with a text self-attention mechanism to produce a superior TTI output quality. For this purpose, we modified and trained the dynamic memory GAN (DMGAN) TTI model. In our experiments, we trained and tested the TTI model on the CUB and MS-COCO datasets. Our results show that our modified DMGAN TTI model generates realistic images and outperforms the base TTI model. We analyzed the TTI models qualitatively and quantitatively in terms of FID, IS, R-precision, and CLIP score values. |
|---|---|
| ISSN: | 2169-3536 |