BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering

Visual Question Answering (VQA) is a challenging problem of Artificial Intelligence (AI) that requires an understanding of natural language and computer vision to respond to inquiries based on visual content within images. Research on VQA has gained immense traction due to its wide range of applicat...

Full description

Saved in:
Bibliographic Details
Main Authors: Md. Shalha Mucha Bhuyan, Eftekhar Hossain, Khaleda Akhter Sathi, Md. Azad Hossain, M. Ali Akber Dewan
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10878995/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849716080557162496
author Md. Shalha Mucha Bhuyan
Eftekhar Hossain
Khaleda Akhter Sathi
Md. Azad Hossain
M. Ali Akber Dewan
author_facet Md. Shalha Mucha Bhuyan
Eftekhar Hossain
Khaleda Akhter Sathi
Md. Azad Hossain
M. Ali Akber Dewan
author_sort Md. Shalha Mucha Bhuyan
collection DOAJ
description Visual Question Answering (VQA) is a challenging problem of Artificial Intelligence (AI) that requires an understanding of natural language and computer vision to respond to inquiries based on visual content within images. Research on VQA has gained immense traction due to its wide range of applications in aiding visually impaired individuals, enhancing human-computer interaction, facilitating content-based image retrieval systems, etc. While there has been extensive research on VQA, most were predominantly focused on English, often overlooking the complexity associated with low-resource languages, especially in Bengali. To facilitate research in this arena, we have developed a large scale Bengali Visual Question Answering (BVQA) dataset by harnessing the in-context learning abilities of the Large Language Model (LLM). Our BVQA dataset encompasses around 17,800 diverse open-ended QA Pairs generated from the human-annotated captions of &#x2248;3,500 images. Replicating existing VQA systems for a low-resource language poses significant challenges due to the complex nature of their architectures and adaptations for particular languages. To overcome this challenge, we proposed Multimodal CRoss-Attention Network (<monospace>MCRAN</monospace>), a novel framework that leverages pretrained transformer architectures to encode the visual and textual information. Furthermore, our method utilizes a multi-head attention mechanism to generate three distinct vision-language representations and fuses them using a sophisticated gating mechanism to answer the query regarding an image. Extensive experiments on BVQA dataset show that the proposed method outperformed the existing baseline across various answer categories. The benchmark and source code is available at <monospace><uri>https://github.com/eftekhar-hossain/Bengali-VQA</uri></monospace>.
format Article
id doaj-art-36a75d04fc6f4edbae3b790979c3a34c
institution DOAJ
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-36a75d04fc6f4edbae3b790979c3a34c2025-08-20T03:13:08ZengIEEEIEEE Access2169-35362025-01-0113275702758610.1109/ACCESS.2025.354038810878995BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question AnsweringMd. Shalha Mucha Bhuyan0https://orcid.org/0009-0002-3875-2526Eftekhar Hossain1https://orcid.org/0000-0003-4575-0596Khaleda Akhter Sathi2https://orcid.org/0000-0003-0031-9284Md. Azad Hossain3https://orcid.org/0000-0002-8251-5168M. Ali Akber Dewan4https://orcid.org/0000-0001-6347-7509Department of Electronics and Telecommunication Engineering, Chittagong University of Engineering and Technology, Chattogram, BangladeshDepartment of Electronics and Telecommunication Engineering, Chittagong University of Engineering and Technology, Chattogram, BangladeshDepartment of Electronics and Telecommunication Engineering, Chittagong University of Engineering and Technology, Chattogram, BangladeshDepartment of Electronics and Telecommunication Engineering, Chittagong University of Engineering and Technology, Chattogram, BangladeshSchool of Computing and Information Systems, Faculty of Science and Technology, Athabasca University, Athabasca, AB, CanadaVisual Question Answering (VQA) is a challenging problem of Artificial Intelligence (AI) that requires an understanding of natural language and computer vision to respond to inquiries based on visual content within images. Research on VQA has gained immense traction due to its wide range of applications in aiding visually impaired individuals, enhancing human-computer interaction, facilitating content-based image retrieval systems, etc. While there has been extensive research on VQA, most were predominantly focused on English, often overlooking the complexity associated with low-resource languages, especially in Bengali. To facilitate research in this arena, we have developed a large scale Bengali Visual Question Answering (BVQA) dataset by harnessing the in-context learning abilities of the Large Language Model (LLM). Our BVQA dataset encompasses around 17,800 diverse open-ended QA Pairs generated from the human-annotated captions of &#x2248;3,500 images. Replicating existing VQA systems for a low-resource language poses significant challenges due to the complex nature of their architectures and adaptations for particular languages. To overcome this challenge, we proposed Multimodal CRoss-Attention Network (<monospace>MCRAN</monospace>), a novel framework that leverages pretrained transformer architectures to encode the visual and textual information. Furthermore, our method utilizes a multi-head attention mechanism to generate three distinct vision-language representations and fuses them using a sophisticated gating mechanism to answer the query regarding an image. Extensive experiments on BVQA dataset show that the proposed method outperformed the existing baseline across various answer categories. The benchmark and source code is available at <monospace><uri>https://github.com/eftekhar-hossain/Bengali-VQA</uri></monospace>.https://ieeexplore.ieee.org/document/10878995/Visual question answeringmultimodal deep learninglarge language modelnatural language processingmulti-head attention mechanism
spellingShingle Md. Shalha Mucha Bhuyan
Eftekhar Hossain
Khaleda Akhter Sathi
Md. Azad Hossain
M. Ali Akber Dewan
BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering
IEEE Access
Visual question answering
multimodal deep learning
large language model
natural language processing
multi-head attention mechanism
title BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering
title_full BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering
title_fullStr BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering
title_full_unstemmed BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering
title_short BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering
title_sort bvqa connecting language and vision through multimodal attention for open ended question answering
topic Visual question answering
multimodal deep learning
large language model
natural language processing
multi-head attention mechanism
url https://ieeexplore.ieee.org/document/10878995/
work_keys_str_mv AT mdshalhamuchabhuyan bvqaconnectinglanguageandvisionthroughmultimodalattentionforopenendedquestionanswering
AT eftekharhossain bvqaconnectinglanguageandvisionthroughmultimodalattentionforopenendedquestionanswering
AT khaledaakhtersathi bvqaconnectinglanguageandvisionthroughmultimodalattentionforopenendedquestionanswering
AT mdazadhossain bvqaconnectinglanguageandvisionthroughmultimodalattentionforopenendedquestionanswering
AT maliakberdewan bvqaconnectinglanguageandvisionthroughmultimodalattentionforopenendedquestionanswering