Integrating Abstract Meaning Representation to Enhance Transformer-Based Image Captioning
Although recent image captioning models have achieved substantial progress, they still encounter limitations in capturing abstract semantics, resulting in insufficient semantic depth and limited diversity in expression. Meanwhile, Abstract Meaning Representation (AMR), a form of abstract semantic re...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11058972/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850086430279204864 |
|---|---|
| author | Nguyen Van Thinh Tran Lang Van The Thanh |
| author_facet | Nguyen Van Thinh Tran Lang Van The Thanh |
| author_sort | Nguyen Van Thinh |
| collection | DOAJ |
| description | Although recent image captioning models have achieved substantial progress, they still encounter limitations in capturing abstract semantics, resulting in insufficient semantic depth and limited diversity in expression. Meanwhile, Abstract Meaning Representation (AMR), a form of abstract semantic representation, has been successfully applied in various natural language processing tasks. However, exploiting AMR in multimodal contexts, particularly for image captioning, remains largely unexplored. To address these limitations, this paper proposes a novel image captioning model within an encoder-decoder framework that leverages the abstract semantics of images through AMR. Specifically, AMR is incorporated into the model in two ways: 1) extracting AMR from ground-truth captions and 2) converting the image’s relational graph into an AMR-like graph to enrich abstract semantics. These AMR embeddings are fused with object-region features and relational-graph embeddings via a cross-modal attention mechanism. Additionally, embeddings from the AMR-like graph are integrated into the Transformer decoder using a masked multi-head attention mechanism to enhance semantic coherence during caption generation. Experimental results on the MS COCO and Flickr30k datasets demonstrate that the proposed model achieves superior captioning accuracy compared to recent state-of-the-art methods, confirming the effectiveness of incorporating AMR in image captioning tasks. |
| format | Article |
| id | doaj-art-7dfc91fd5a824b4c969bd295bd49da1c |
| institution | DOAJ |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-7dfc91fd5a824b4c969bd295bd49da1c2025-08-20T02:43:29ZengIEEEIEEE Access2169-35362025-01-011311252811255110.1109/ACCESS.2025.358412811058972Integrating Abstract Meaning Representation to Enhance Transformer-Based Image CaptioningNguyen Van Thinh0https://orcid.org/0000-0002-7543-5207Tran Lang1https://orcid.org/0000-0002-8925-5549Van The Thanh2https://orcid.org/0000-0001-8408-2004Vietnam Academy of Science and Technology (VAST), Graduate University of Science and Technology, Hanoi, VietnamJournal Editorial Department, Ho Chi Minh City University of Foreign Languages and Information Technology (HUFLIT), Ho Chi Minh City, VietnamFaculty of Information Technology, Ho Chi Minh City University of Education (HCMUE), Ho Chi Minh City, VietnamAlthough recent image captioning models have achieved substantial progress, they still encounter limitations in capturing abstract semantics, resulting in insufficient semantic depth and limited diversity in expression. Meanwhile, Abstract Meaning Representation (AMR), a form of abstract semantic representation, has been successfully applied in various natural language processing tasks. However, exploiting AMR in multimodal contexts, particularly for image captioning, remains largely unexplored. To address these limitations, this paper proposes a novel image captioning model within an encoder-decoder framework that leverages the abstract semantics of images through AMR. Specifically, AMR is incorporated into the model in two ways: 1) extracting AMR from ground-truth captions and 2) converting the image’s relational graph into an AMR-like graph to enrich abstract semantics. These AMR embeddings are fused with object-region features and relational-graph embeddings via a cross-modal attention mechanism. Additionally, embeddings from the AMR-like graph are integrated into the Transformer decoder using a masked multi-head attention mechanism to enhance semantic coherence during caption generation. Experimental results on the MS COCO and Flickr30k datasets demonstrate that the proposed model achieves superior captioning accuracy compared to recent state-of-the-art methods, confirming the effectiveness of incorporating AMR in image captioning tasks.https://ieeexplore.ieee.org/document/11058972/Image captioningabstract meaning representationrelationship graphtransformerdeep neural network |
| spellingShingle | Nguyen Van Thinh Tran Lang Van The Thanh Integrating Abstract Meaning Representation to Enhance Transformer-Based Image Captioning IEEE Access Image captioning abstract meaning representation relationship graph transformer deep neural network |
| title | Integrating Abstract Meaning Representation to Enhance Transformer-Based Image Captioning |
| title_full | Integrating Abstract Meaning Representation to Enhance Transformer-Based Image Captioning |
| title_fullStr | Integrating Abstract Meaning Representation to Enhance Transformer-Based Image Captioning |
| title_full_unstemmed | Integrating Abstract Meaning Representation to Enhance Transformer-Based Image Captioning |
| title_short | Integrating Abstract Meaning Representation to Enhance Transformer-Based Image Captioning |
| title_sort | integrating abstract meaning representation to enhance transformer based image captioning |
| topic | Image captioning abstract meaning representation relationship graph transformer deep neural network |
| url | https://ieeexplore.ieee.org/document/11058972/ |
| work_keys_str_mv | AT nguyenvanthinh integratingabstractmeaningrepresentationtoenhancetransformerbasedimagecaptioning AT tranlang integratingabstractmeaningrepresentationtoenhancetransformerbasedimagecaptioning AT vanthethanh integratingabstractmeaningrepresentationtoenhancetransformerbasedimagecaptioning |