Integrating Abstract Meaning Representation to Enhance Transformer-Based Image Captioning

Although recent image captioning models have achieved substantial progress, they still encounter limitations in capturing abstract semantics, resulting in insufficient semantic depth and limited diversity in expression. Meanwhile, Abstract Meaning Representation (AMR), a form of abstract semantic re...

Full description

Saved in:
Bibliographic Details
Main Authors: Nguyen Van Thinh, Tran Lang, Van The Thanh
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11058972/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850086430279204864
author Nguyen Van Thinh
Tran Lang
Van The Thanh
author_facet Nguyen Van Thinh
Tran Lang
Van The Thanh
author_sort Nguyen Van Thinh
collection DOAJ
description Although recent image captioning models have achieved substantial progress, they still encounter limitations in capturing abstract semantics, resulting in insufficient semantic depth and limited diversity in expression. Meanwhile, Abstract Meaning Representation (AMR), a form of abstract semantic representation, has been successfully applied in various natural language processing tasks. However, exploiting AMR in multimodal contexts, particularly for image captioning, remains largely unexplored. To address these limitations, this paper proposes a novel image captioning model within an encoder-decoder framework that leverages the abstract semantics of images through AMR. Specifically, AMR is incorporated into the model in two ways: 1) extracting AMR from ground-truth captions and 2) converting the image’s relational graph into an AMR-like graph to enrich abstract semantics. These AMR embeddings are fused with object-region features and relational-graph embeddings via a cross-modal attention mechanism. Additionally, embeddings from the AMR-like graph are integrated into the Transformer decoder using a masked multi-head attention mechanism to enhance semantic coherence during caption generation. Experimental results on the MS COCO and Flickr30k datasets demonstrate that the proposed model achieves superior captioning accuracy compared to recent state-of-the-art methods, confirming the effectiveness of incorporating AMR in image captioning tasks.
format Article
id doaj-art-7dfc91fd5a824b4c969bd295bd49da1c
institution DOAJ
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-7dfc91fd5a824b4c969bd295bd49da1c2025-08-20T02:43:29ZengIEEEIEEE Access2169-35362025-01-011311252811255110.1109/ACCESS.2025.358412811058972Integrating Abstract Meaning Representation to Enhance Transformer-Based Image CaptioningNguyen Van Thinh0https://orcid.org/0000-0002-7543-5207Tran Lang1https://orcid.org/0000-0002-8925-5549Van The Thanh2https://orcid.org/0000-0001-8408-2004Vietnam Academy of Science and Technology (VAST), Graduate University of Science and Technology, Hanoi, VietnamJournal Editorial Department, Ho Chi Minh City University of Foreign Languages and Information Technology (HUFLIT), Ho Chi Minh City, VietnamFaculty of Information Technology, Ho Chi Minh City University of Education (HCMUE), Ho Chi Minh City, VietnamAlthough recent image captioning models have achieved substantial progress, they still encounter limitations in capturing abstract semantics, resulting in insufficient semantic depth and limited diversity in expression. Meanwhile, Abstract Meaning Representation (AMR), a form of abstract semantic representation, has been successfully applied in various natural language processing tasks. However, exploiting AMR in multimodal contexts, particularly for image captioning, remains largely unexplored. To address these limitations, this paper proposes a novel image captioning model within an encoder-decoder framework that leverages the abstract semantics of images through AMR. Specifically, AMR is incorporated into the model in two ways: 1) extracting AMR from ground-truth captions and 2) converting the image’s relational graph into an AMR-like graph to enrich abstract semantics. These AMR embeddings are fused with object-region features and relational-graph embeddings via a cross-modal attention mechanism. Additionally, embeddings from the AMR-like graph are integrated into the Transformer decoder using a masked multi-head attention mechanism to enhance semantic coherence during caption generation. Experimental results on the MS COCO and Flickr30k datasets demonstrate that the proposed model achieves superior captioning accuracy compared to recent state-of-the-art methods, confirming the effectiveness of incorporating AMR in image captioning tasks.https://ieeexplore.ieee.org/document/11058972/Image captioningabstract meaning representationrelationship graphtransformerdeep neural network
spellingShingle Nguyen Van Thinh
Tran Lang
Van The Thanh
Integrating Abstract Meaning Representation to Enhance Transformer-Based Image Captioning
IEEE Access
Image captioning
abstract meaning representation
relationship graph
transformer
deep neural network
title Integrating Abstract Meaning Representation to Enhance Transformer-Based Image Captioning
title_full Integrating Abstract Meaning Representation to Enhance Transformer-Based Image Captioning
title_fullStr Integrating Abstract Meaning Representation to Enhance Transformer-Based Image Captioning
title_full_unstemmed Integrating Abstract Meaning Representation to Enhance Transformer-Based Image Captioning
title_short Integrating Abstract Meaning Representation to Enhance Transformer-Based Image Captioning
title_sort integrating abstract meaning representation to enhance transformer based image captioning
topic Image captioning
abstract meaning representation
relationship graph
transformer
deep neural network
url https://ieeexplore.ieee.org/document/11058972/
work_keys_str_mv AT nguyenvanthinh integratingabstractmeaningrepresentationtoenhancetransformerbasedimagecaptioning
AT tranlang integratingabstractmeaningrepresentationtoenhancetransformerbasedimagecaptioning
AT vanthethanh integratingabstractmeaningrepresentationtoenhancetransformerbasedimagecaptioning