Improving Visual Question Answering by Image Captioning

Visual Question Answering (VQA) is a challenging task that bridges the computer vision and natural language processing communities. It provide natural language answers to questions related to an associated image. Most existing VQA methods focus on the fusion and inference of visual features with the...

Full description

Saved in:

Bibliographic Details
Main Authors:	Xiangjun Shao, Hongsong Dong, Guangsheng Wu
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Deep learning image captioning multimodal learning visual question answering
Online Access:	https://ieeexplore.ieee.org/document/10918635/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850092614467977216
author	Xiangjun Shao Hongsong Dong Guangsheng Wu
author_facet	Xiangjun Shao Hongsong Dong Guangsheng Wu
author_sort	Xiangjun Shao
collection	DOAJ
description	Visual Question Answering (VQA) is a challenging task that bridges the computer vision and natural language processing communities. It provide natural language answers to questions related to an associated image. Most existing VQA methods focus on the fusion and inference of visual features with the textual question. However, visual features often lack the necessary semantic information required to answer the questions accurately. To address this limitation, we propose a novel approach called Question-Guided Parallel Attention (QGPA), which effectively leverages the semantic information provided by an embedded image captioning model to answer related questions. First, we introduce an Attention-Aware (AA) mechanism that extends the traditional attention mechanism, helping to filter out incorrect or irrelevant information during answer prediction. Second, QGPA incorporates AA, which simultaneously utilizes visual features and semantic information from the embedded image captioning model to answer questions. Experiments results demonstrate that the accuracy of “Overall” of our proposed model delivers 72.57% and 72.55% on the test-dev and test-std split set of VQA-v2.0 dataset, respectively, which outperforms most existing VQA methods. The experiment results and ablation studies demonstrate that the proposed method has good performance.
format	Article
id	doaj-art-e4668c2298b04e33a1b617526d430b6e
institution	DOAJ
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-e4668c2298b04e33a1b617526d430b6e2025-08-20T02:42:05ZengIEEEIEEE Access2169-35362025-01-0113462994631110.1109/ACCESS.2025.354947810918635Improving Visual Question Answering by Image CaptioningXiangjun Shao0https://orcid.org/0000-0002-1401-621XHongsong Dong1https://orcid.org/0000-0002-5355-1995Guangsheng Wu2https://orcid.org/0000-0002-7739-9422School of Computer and Electrical Engineering, Hunan University of Arts and Science, Changde, ChinaDepartment of Computer Science, Lyuliang University, Lüliang, ChinaSchool of Mathematics and Computer Science, Xinyu University, Xinyu, ChinaVisual Question Answering (VQA) is a challenging task that bridges the computer vision and natural language processing communities. It provide natural language answers to questions related to an associated image. Most existing VQA methods focus on the fusion and inference of visual features with the textual question. However, visual features often lack the necessary semantic information required to answer the questions accurately. To address this limitation, we propose a novel approach called Question-Guided Parallel Attention (QGPA), which effectively leverages the semantic information provided by an embedded image captioning model to answer related questions. First, we introduce an Attention-Aware (AA) mechanism that extends the traditional attention mechanism, helping to filter out incorrect or irrelevant information during answer prediction. Second, QGPA incorporates AA, which simultaneously utilizes visual features and semantic information from the embedded image captioning model to answer questions. Experiments results demonstrate that the accuracy of “Overall” of our proposed model delivers 72.57% and 72.55% on the test-dev and test-std split set of VQA-v2.0 dataset, respectively, which outperforms most existing VQA methods. The experiment results and ablation studies demonstrate that the proposed method has good performance.https://ieeexplore.ieee.org/document/10918635/Deep learningimage captioningmultimodal learningvisual question answering
spellingShingle	Xiangjun Shao Hongsong Dong Guangsheng Wu Improving Visual Question Answering by Image Captioning IEEE Access Deep learning image captioning multimodal learning visual question answering
title	Improving Visual Question Answering by Image Captioning
title_full	Improving Visual Question Answering by Image Captioning
title_fullStr	Improving Visual Question Answering by Image Captioning
title_full_unstemmed	Improving Visual Question Answering by Image Captioning
title_short	Improving Visual Question Answering by Image Captioning
title_sort	improving visual question answering by image captioning
topic	Deep learning image captioning multimodal learning visual question answering
url	https://ieeexplore.ieee.org/document/10918635/
work_keys_str_mv	AT xiangjunshao improvingvisualquestionansweringbyimagecaptioning AT hongsongdong improvingvisualquestionansweringbyimagecaptioning AT guangshengwu improvingvisualquestionansweringbyimagecaptioning

Improving Visual Question Answering by Image Captioning

Similar Items