Improving Visual Question Answering by Image Captioning
Visual Question Answering (VQA) is a challenging task that bridges the computer vision and natural language processing communities. It provide natural language answers to questions related to an associated image. Most existing VQA methods focus on the fusion and inference of visual features with the...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10918635/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850092614467977216 |
|---|---|
| author | Xiangjun Shao Hongsong Dong Guangsheng Wu |
| author_facet | Xiangjun Shao Hongsong Dong Guangsheng Wu |
| author_sort | Xiangjun Shao |
| collection | DOAJ |
| description | Visual Question Answering (VQA) is a challenging task that bridges the computer vision and natural language processing communities. It provide natural language answers to questions related to an associated image. Most existing VQA methods focus on the fusion and inference of visual features with the textual question. However, visual features often lack the necessary semantic information required to answer the questions accurately. To address this limitation, we propose a novel approach called Question-Guided Parallel Attention (QGPA), which effectively leverages the semantic information provided by an embedded image captioning model to answer related questions. First, we introduce an Attention-Aware (AA) mechanism that extends the traditional attention mechanism, helping to filter out incorrect or irrelevant information during answer prediction. Second, QGPA incorporates AA, which simultaneously utilizes visual features and semantic information from the embedded image captioning model to answer questions. Experiments results demonstrate that the accuracy of “Overall” of our proposed model delivers 72.57% and 72.55% on the test-dev and test-std split set of VQA-v2.0 dataset, respectively, which outperforms most existing VQA methods. The experiment results and ablation studies demonstrate that the proposed method has good performance. |
| format | Article |
| id | doaj-art-e4668c2298b04e33a1b617526d430b6e |
| institution | DOAJ |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-e4668c2298b04e33a1b617526d430b6e2025-08-20T02:42:05ZengIEEEIEEE Access2169-35362025-01-0113462994631110.1109/ACCESS.2025.354947810918635Improving Visual Question Answering by Image CaptioningXiangjun Shao0https://orcid.org/0000-0002-1401-621XHongsong Dong1https://orcid.org/0000-0002-5355-1995Guangsheng Wu2https://orcid.org/0000-0002-7739-9422School of Computer and Electrical Engineering, Hunan University of Arts and Science, Changde, ChinaDepartment of Computer Science, Lyuliang University, Lüliang, ChinaSchool of Mathematics and Computer Science, Xinyu University, Xinyu, ChinaVisual Question Answering (VQA) is a challenging task that bridges the computer vision and natural language processing communities. It provide natural language answers to questions related to an associated image. Most existing VQA methods focus on the fusion and inference of visual features with the textual question. However, visual features often lack the necessary semantic information required to answer the questions accurately. To address this limitation, we propose a novel approach called Question-Guided Parallel Attention (QGPA), which effectively leverages the semantic information provided by an embedded image captioning model to answer related questions. First, we introduce an Attention-Aware (AA) mechanism that extends the traditional attention mechanism, helping to filter out incorrect or irrelevant information during answer prediction. Second, QGPA incorporates AA, which simultaneously utilizes visual features and semantic information from the embedded image captioning model to answer questions. Experiments results demonstrate that the accuracy of “Overall” of our proposed model delivers 72.57% and 72.55% on the test-dev and test-std split set of VQA-v2.0 dataset, respectively, which outperforms most existing VQA methods. The experiment results and ablation studies demonstrate that the proposed method has good performance.https://ieeexplore.ieee.org/document/10918635/Deep learningimage captioningmultimodal learningvisual question answering |
| spellingShingle | Xiangjun Shao Hongsong Dong Guangsheng Wu Improving Visual Question Answering by Image Captioning IEEE Access Deep learning image captioning multimodal learning visual question answering |
| title | Improving Visual Question Answering by Image Captioning |
| title_full | Improving Visual Question Answering by Image Captioning |
| title_fullStr | Improving Visual Question Answering by Image Captioning |
| title_full_unstemmed | Improving Visual Question Answering by Image Captioning |
| title_short | Improving Visual Question Answering by Image Captioning |
| title_sort | improving visual question answering by image captioning |
| topic | Deep learning image captioning multimodal learning visual question answering |
| url | https://ieeexplore.ieee.org/document/10918635/ |
| work_keys_str_mv | AT xiangjunshao improvingvisualquestionansweringbyimagecaptioning AT hongsongdong improvingvisualquestionansweringbyimagecaptioning AT guangshengwu improvingvisualquestionansweringbyimagecaptioning |