Integrating Abstract Meaning Representation to Enhance Transformer-Based Image Captioning

Although recent image captioning models have achieved substantial progress, they still encounter limitations in capturing abstract semantics, resulting in insufficient semantic depth and limited diversity in expression. Meanwhile, Abstract Meaning Representation (AMR), a form of abstract semantic re...

Full description

Saved in:

Bibliographic Details
Main Authors:	Nguyen Van Thinh, Tran Lang, Van The Thanh
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Image captioning abstract meaning representation relationship graph transformer deep neural network
Online Access:	https://ieeexplore.ieee.org/document/11058972/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850086430279204864
author	Nguyen Van Thinh Tran Lang Van The Thanh
author_facet	Nguyen Van Thinh Tran Lang Van The Thanh
author_sort	Nguyen Van Thinh
collection	DOAJ
description	Although recent image captioning models have achieved substantial progress, they still encounter limitations in capturing abstract semantics, resulting in insufficient semantic depth and limited diversity in expression. Meanwhile, Abstract Meaning Representation (AMR), a form of abstract semantic representation, has been successfully applied in various natural language processing tasks. However, exploiting AMR in multimodal contexts, particularly for image captioning, remains largely unexplored. To address these limitations, this paper proposes a novel image captioning model within an encoder-decoder framework that leverages the abstract semantics of images through AMR. Specifically, AMR is incorporated into the model in two ways: 1) extracting AMR from ground-truth captions and 2) converting the image’s relational graph into an AMR-like graph to enrich abstract semantics. These AMR embeddings are fused with object-region features and relational-graph embeddings via a cross-modal attention mechanism. Additionally, embeddings from the AMR-like graph are integrated into the Transformer decoder using a masked multi-head attention mechanism to enhance semantic coherence during caption generation. Experimental results on the MS COCO and Flickr30k datasets demonstrate that the proposed model achieves superior captioning accuracy compared to recent state-of-the-art methods, confirming the effectiveness of incorporating AMR in image captioning tasks.
format	Article
id	doaj-art-7dfc91fd5a824b4c969bd295bd49da1c
institution	DOAJ
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-7dfc91fd5a824b4c969bd295bd49da1c2025-08-20T02:43:29ZengIEEEIEEE Access2169-35362025-01-011311252811255110.1109/ACCESS.2025.358412811058972Integrating Abstract Meaning Representation to Enhance Transformer-Based Image CaptioningNguyen Van Thinh0https://orcid.org/0000-0002-7543-5207Tran Lang1https://orcid.org/0000-0002-8925-5549Van The Thanh2https://orcid.org/0000-0001-8408-2004Vietnam Academy of Science and Technology (VAST), Graduate University of Science and Technology, Hanoi, VietnamJournal Editorial Department, Ho Chi Minh City University of Foreign Languages and Information Technology (HUFLIT), Ho Chi Minh City, VietnamFaculty of Information Technology, Ho Chi Minh City University of Education (HCMUE), Ho Chi Minh City, VietnamAlthough recent image captioning models have achieved substantial progress, they still encounter limitations in capturing abstract semantics, resulting in insufficient semantic depth and limited diversity in expression. Meanwhile, Abstract Meaning Representation (AMR), a form of abstract semantic representation, has been successfully applied in various natural language processing tasks. However, exploiting AMR in multimodal contexts, particularly for image captioning, remains largely unexplored. To address these limitations, this paper proposes a novel image captioning model within an encoder-decoder framework that leverages the abstract semantics of images through AMR. Specifically, AMR is incorporated into the model in two ways: 1) extracting AMR from ground-truth captions and 2) converting the image’s relational graph into an AMR-like graph to enrich abstract semantics. These AMR embeddings are fused with object-region features and relational-graph embeddings via a cross-modal attention mechanism. Additionally, embeddings from the AMR-like graph are integrated into the Transformer decoder using a masked multi-head attention mechanism to enhance semantic coherence during caption generation. Experimental results on the MS COCO and Flickr30k datasets demonstrate that the proposed model achieves superior captioning accuracy compared to recent state-of-the-art methods, confirming the effectiveness of incorporating AMR in image captioning tasks.https://ieeexplore.ieee.org/document/11058972/Image captioningabstract meaning representationrelationship graphtransformerdeep neural network
spellingShingle	Nguyen Van Thinh Tran Lang Van The Thanh Integrating Abstract Meaning Representation to Enhance Transformer-Based Image Captioning IEEE Access Image captioning abstract meaning representation relationship graph transformer deep neural network
title	Integrating Abstract Meaning Representation to Enhance Transformer-Based Image Captioning
title_full	Integrating Abstract Meaning Representation to Enhance Transformer-Based Image Captioning
title_fullStr	Integrating Abstract Meaning Representation to Enhance Transformer-Based Image Captioning
title_full_unstemmed	Integrating Abstract Meaning Representation to Enhance Transformer-Based Image Captioning
title_short	Integrating Abstract Meaning Representation to Enhance Transformer-Based Image Captioning
title_sort	integrating abstract meaning representation to enhance transformer based image captioning
topic	Image captioning abstract meaning representation relationship graph transformer deep neural network
url	https://ieeexplore.ieee.org/document/11058972/
work_keys_str_mv	AT nguyenvanthinh integratingabstractmeaningrepresentationtoenhancetransformerbasedimagecaptioning AT tranlang integratingabstractmeaningrepresentationtoenhancetransformerbasedimagecaptioning AT vanthethanh integratingabstractmeaningrepresentationtoenhancetransformerbasedimagecaptioning

Integrating Abstract Meaning Representation to Enhance Transformer-Based Image Captioning

Similar Items