A Semantic Weight Adaptive Model Based on Visual Question Answering

Visual Question Answering (VQA) is an advanced artificial intelligence task that combines computer vision and natural language processing technologies. Its core objective is to enable computers to accurately answer natural language questions posed by users about image content, with these questions b...

Full description

Saved in:
Bibliographic Details
Main Authors: Li Huimin, Li Xuan, Chen Yan
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10633287/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Visual Question Answering (VQA) is an advanced artificial intelligence task that combines computer vision and natural language processing technologies. Its core objective is to enable computers to accurately answer natural language questions posed by users about image content, with these questions being either open-ended or closed-ended. For instance, the system must address closed-ended questions such as “Does the image contain 11 goats?” and open-ended ones like “Where was this photo taken?” To accomplish this task, computers must not only deeply analyze image content but also precisely comprehend and respond to complex natural language expressions.However, current VQA models often struggle when dealing with questions requiring deep semantic analysis due to their inability to fully capture the semantic information within the questions. This limitation significantly hinders the models’ capacity to decipher complex relationships between objects in images and perform high-level semantic reasoning.To address this challenge and recognizing the differing natures of open-ended and closed-ended tasks, we innovatively propose a conditional reasoning model called the Semantic Weight Adaptive Model Network (SWAMN). The crux of this model lies in its ability to automatically extract task-relevant information from questions to dynamically guide the fusion process of multimodal features. This means that SWAMN can more intelligently integrate image and language information to provide more accurate answers to user questions.To validate the effectiveness of the SWAMN model, we conducted extensive ablation studies on the benchmark dataset VQA V2.0. Through both qualitative and quantitative evaluations, we not only delved into the fundamental reasons for the model’s outstanding performance but also demonstrated that SWAMN achieved an overall accuracy of 70.82% on test-dev, significantly surpassing current state-of-the-art models and setting a new milestone in the field of VQA.
ISSN:2169-3536