Improving Visual Question Answering by Image Captioning

Visual Question Answering (VQA) is a challenging task that bridges the computer vision and natural language processing communities. It provide natural language answers to questions related to an associated image. Most existing VQA methods focus on the fusion and inference of visual features with the...

Full description

Saved in:
Bibliographic Details
Main Authors: Xiangjun Shao, Hongsong Dong, Guangsheng Wu
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10918635/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850092614467977216
author Xiangjun Shao
Hongsong Dong
Guangsheng Wu
author_facet Xiangjun Shao
Hongsong Dong
Guangsheng Wu
author_sort Xiangjun Shao
collection DOAJ
description Visual Question Answering (VQA) is a challenging task that bridges the computer vision and natural language processing communities. It provide natural language answers to questions related to an associated image. Most existing VQA methods focus on the fusion and inference of visual features with the textual question. However, visual features often lack the necessary semantic information required to answer the questions accurately. To address this limitation, we propose a novel approach called Question-Guided Parallel Attention (QGPA), which effectively leverages the semantic information provided by an embedded image captioning model to answer related questions. First, we introduce an Attention-Aware (AA) mechanism that extends the traditional attention mechanism, helping to filter out incorrect or irrelevant information during answer prediction. Second, QGPA incorporates AA, which simultaneously utilizes visual features and semantic information from the embedded image captioning model to answer questions. Experiments results demonstrate that the accuracy of “Overall” of our proposed model delivers 72.57% and 72.55% on the test-dev and test-std split set of VQA-v2.0 dataset, respectively, which outperforms most existing VQA methods. The experiment results and ablation studies demonstrate that the proposed method has good performance.
format Article
id doaj-art-e4668c2298b04e33a1b617526d430b6e
institution DOAJ
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-e4668c2298b04e33a1b617526d430b6e2025-08-20T02:42:05ZengIEEEIEEE Access2169-35362025-01-0113462994631110.1109/ACCESS.2025.354947810918635Improving Visual Question Answering by Image CaptioningXiangjun Shao0https://orcid.org/0000-0002-1401-621XHongsong Dong1https://orcid.org/0000-0002-5355-1995Guangsheng Wu2https://orcid.org/0000-0002-7739-9422School of Computer and Electrical Engineering, Hunan University of Arts and Science, Changde, ChinaDepartment of Computer Science, Lyuliang University, Lüliang, ChinaSchool of Mathematics and Computer Science, Xinyu University, Xinyu, ChinaVisual Question Answering (VQA) is a challenging task that bridges the computer vision and natural language processing communities. It provide natural language answers to questions related to an associated image. Most existing VQA methods focus on the fusion and inference of visual features with the textual question. However, visual features often lack the necessary semantic information required to answer the questions accurately. To address this limitation, we propose a novel approach called Question-Guided Parallel Attention (QGPA), which effectively leverages the semantic information provided by an embedded image captioning model to answer related questions. First, we introduce an Attention-Aware (AA) mechanism that extends the traditional attention mechanism, helping to filter out incorrect or irrelevant information during answer prediction. Second, QGPA incorporates AA, which simultaneously utilizes visual features and semantic information from the embedded image captioning model to answer questions. Experiments results demonstrate that the accuracy of “Overall” of our proposed model delivers 72.57% and 72.55% on the test-dev and test-std split set of VQA-v2.0 dataset, respectively, which outperforms most existing VQA methods. The experiment results and ablation studies demonstrate that the proposed method has good performance.https://ieeexplore.ieee.org/document/10918635/Deep learningimage captioningmultimodal learningvisual question answering
spellingShingle Xiangjun Shao
Hongsong Dong
Guangsheng Wu
Improving Visual Question Answering by Image Captioning
IEEE Access
Deep learning
image captioning
multimodal learning
visual question answering
title Improving Visual Question Answering by Image Captioning
title_full Improving Visual Question Answering by Image Captioning
title_fullStr Improving Visual Question Answering by Image Captioning
title_full_unstemmed Improving Visual Question Answering by Image Captioning
title_short Improving Visual Question Answering by Image Captioning
title_sort improving visual question answering by image captioning
topic Deep learning
image captioning
multimodal learning
visual question answering
url https://ieeexplore.ieee.org/document/10918635/
work_keys_str_mv AT xiangjunshao improvingvisualquestionansweringbyimagecaptioning
AT hongsongdong improvingvisualquestionansweringbyimagecaptioning
AT guangshengwu improvingvisualquestionansweringbyimagecaptioning