Improving Visual Question Answering by Image Captioning

Visual Question Answering (VQA) is a challenging task that bridges the computer vision and natural language processing communities. It provide natural language answers to questions related to an associated image. Most existing VQA methods focus on the fusion and inference of visual features with the...

Full description

Saved in:

Bibliographic Details
Main Authors:	Xiangjun Shao, Hongsong Dong, Guangsheng Wu
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Deep learning image captioning multimodal learning visual question answering
Online Access:	https://ieeexplore.ieee.org/document/10918635/
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Visual Question Answering (VQA) is a challenging task that bridges the computer vision and natural language processing communities. It provide natural language answers to questions related to an associated image. Most existing VQA methods focus on the fusion and inference of visual features with the textual question. However, visual features often lack the necessary semantic information required to answer the questions accurately. To address this limitation, we propose a novel approach called Question-Guided Parallel Attention (QGPA), which effectively leverages the semantic information provided by an embedded image captioning model to answer related questions. First, we introduce an Attention-Aware (AA) mechanism that extends the traditional attention mechanism, helping to filter out incorrect or irrelevant information during answer prediction. Second, QGPA incorporates AA, which simultaneously utilizes visual features and semantic information from the embedded image captioning model to answer questions. Experiments results demonstrate that the accuracy of “Overall” of our proposed model delivers 72.57% and 72.55% on the test-dev and test-std split set of VQA-v2.0 dataset, respectively, which outperforms most existing VQA methods. The experiment results and ablation studies demonstrate that the proposed method has good performance.
ISSN:	2169-3536

Improving Visual Question Answering by Image Captioning

Similar Items