A Semantic Weight Adaptive Model Based on Visual Question Answering

Visual Question Answering (VQA) is an advanced artificial intelligence task that combines computer vision and natural language processing technologies. Its core objective is to enable computers to accurately answer natural language questions posed by users about image content, with these questions b...

Full description

Saved in:
Bibliographic Details
Main Authors: Li Huimin, Li Xuan, Chen Yan
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10633287/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849764240679763968
author Li Huimin
Li Xuan
Chen Yan
author_facet Li Huimin
Li Xuan
Chen Yan
author_sort Li Huimin
collection DOAJ
description Visual Question Answering (VQA) is an advanced artificial intelligence task that combines computer vision and natural language processing technologies. Its core objective is to enable computers to accurately answer natural language questions posed by users about image content, with these questions being either open-ended or closed-ended. For instance, the system must address closed-ended questions such as “Does the image contain 11 goats?” and open-ended ones like “Where was this photo taken?” To accomplish this task, computers must not only deeply analyze image content but also precisely comprehend and respond to complex natural language expressions.However, current VQA models often struggle when dealing with questions requiring deep semantic analysis due to their inability to fully capture the semantic information within the questions. This limitation significantly hinders the models’ capacity to decipher complex relationships between objects in images and perform high-level semantic reasoning.To address this challenge and recognizing the differing natures of open-ended and closed-ended tasks, we innovatively propose a conditional reasoning model called the Semantic Weight Adaptive Model Network (SWAMN). The crux of this model lies in its ability to automatically extract task-relevant information from questions to dynamically guide the fusion process of multimodal features. This means that SWAMN can more intelligently integrate image and language information to provide more accurate answers to user questions.To validate the effectiveness of the SWAMN model, we conducted extensive ablation studies on the benchmark dataset VQA V2.0. Through both qualitative and quantitative evaluations, we not only delved into the fundamental reasons for the model’s outstanding performance but also demonstrated that SWAMN achieved an overall accuracy of 70.82% on test-dev, significantly surpassing current state-of-the-art models and setting a new milestone in the field of VQA.
format Article
id doaj-art-ec3fae65bd8a4ccb8147eeb45b0a394c
institution DOAJ
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-ec3fae65bd8a4ccb8147eeb45b0a394c2025-08-20T03:05:11ZengIEEEIEEE Access2169-35362025-01-0113591705917610.1109/ACCESS.2024.344212910633287A Semantic Weight Adaptive Model Based on Visual Question AnsweringLi Huimin0https://orcid.org/0000-0001-6434-4476Li Xuan1Chen Yan2The Third Research Institute, Ministry of Public Security, Shanghai, ChinaThe Third Research Institute, Ministry of Public Security, Shanghai, ChinaThe Third Research Institute, Ministry of Public Security, Shanghai, ChinaVisual Question Answering (VQA) is an advanced artificial intelligence task that combines computer vision and natural language processing technologies. Its core objective is to enable computers to accurately answer natural language questions posed by users about image content, with these questions being either open-ended or closed-ended. For instance, the system must address closed-ended questions such as “Does the image contain 11 goats?” and open-ended ones like “Where was this photo taken?” To accomplish this task, computers must not only deeply analyze image content but also precisely comprehend and respond to complex natural language expressions.However, current VQA models often struggle when dealing with questions requiring deep semantic analysis due to their inability to fully capture the semantic information within the questions. This limitation significantly hinders the models’ capacity to decipher complex relationships between objects in images and perform high-level semantic reasoning.To address this challenge and recognizing the differing natures of open-ended and closed-ended tasks, we innovatively propose a conditional reasoning model called the Semantic Weight Adaptive Model Network (SWAMN). The crux of this model lies in its ability to automatically extract task-relevant information from questions to dynamically guide the fusion process of multimodal features. This means that SWAMN can more intelligently integrate image and language information to provide more accurate answers to user questions.To validate the effectiveness of the SWAMN model, we conducted extensive ablation studies on the benchmark dataset VQA V2.0. Through both qualitative and quantitative evaluations, we not only delved into the fundamental reasons for the model’s outstanding performance but also demonstrated that SWAMN achieved an overall accuracy of 70.82% on test-dev, significantly surpassing current state-of-the-art models and setting a new milestone in the field of VQA.https://ieeexplore.ieee.org/document/10633287/Visual questions answering (VQA)conditional reasoningopen-ended questionsclosed-ended questionsmulti-modal feature fusion
spellingShingle Li Huimin
Li Xuan
Chen Yan
A Semantic Weight Adaptive Model Based on Visual Question Answering
IEEE Access
Visual questions answering (VQA)
conditional reasoning
open-ended questions
closed-ended questions
multi-modal feature fusion
title A Semantic Weight Adaptive Model Based on Visual Question Answering
title_full A Semantic Weight Adaptive Model Based on Visual Question Answering
title_fullStr A Semantic Weight Adaptive Model Based on Visual Question Answering
title_full_unstemmed A Semantic Weight Adaptive Model Based on Visual Question Answering
title_short A Semantic Weight Adaptive Model Based on Visual Question Answering
title_sort semantic weight adaptive model based on visual question answering
topic Visual questions answering (VQA)
conditional reasoning
open-ended questions
closed-ended questions
multi-modal feature fusion
url https://ieeexplore.ieee.org/document/10633287/
work_keys_str_mv AT lihuimin asemanticweightadaptivemodelbasedonvisualquestionanswering
AT lixuan asemanticweightadaptivemodelbasedonvisualquestionanswering
AT chenyan asemanticweightadaptivemodelbasedonvisualquestionanswering
AT lihuimin semanticweightadaptivemodelbasedonvisualquestionanswering
AT lixuan semanticweightadaptivemodelbasedonvisualquestionanswering
AT chenyan semanticweightadaptivemodelbasedonvisualquestionanswering