BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering
Visual Question Answering (VQA) is a challenging problem of Artificial Intelligence (AI) that requires an understanding of natural language and computer vision to respond to inquiries based on visual content within images. Research on VQA has gained immense traction due to its wide range of applicat...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10878995/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849716080557162496 |
|---|---|
| author | Md. Shalha Mucha Bhuyan Eftekhar Hossain Khaleda Akhter Sathi Md. Azad Hossain M. Ali Akber Dewan |
| author_facet | Md. Shalha Mucha Bhuyan Eftekhar Hossain Khaleda Akhter Sathi Md. Azad Hossain M. Ali Akber Dewan |
| author_sort | Md. Shalha Mucha Bhuyan |
| collection | DOAJ |
| description | Visual Question Answering (VQA) is a challenging problem of Artificial Intelligence (AI) that requires an understanding of natural language and computer vision to respond to inquiries based on visual content within images. Research on VQA has gained immense traction due to its wide range of applications in aiding visually impaired individuals, enhancing human-computer interaction, facilitating content-based image retrieval systems, etc. While there has been extensive research on VQA, most were predominantly focused on English, often overlooking the complexity associated with low-resource languages, especially in Bengali. To facilitate research in this arena, we have developed a large scale Bengali Visual Question Answering (BVQA) dataset by harnessing the in-context learning abilities of the Large Language Model (LLM). Our BVQA dataset encompasses around 17,800 diverse open-ended QA Pairs generated from the human-annotated captions of ≈3,500 images. Replicating existing VQA systems for a low-resource language poses significant challenges due to the complex nature of their architectures and adaptations for particular languages. To overcome this challenge, we proposed Multimodal CRoss-Attention Network (<monospace>MCRAN</monospace>), a novel framework that leverages pretrained transformer architectures to encode the visual and textual information. Furthermore, our method utilizes a multi-head attention mechanism to generate three distinct vision-language representations and fuses them using a sophisticated gating mechanism to answer the query regarding an image. Extensive experiments on BVQA dataset show that the proposed method outperformed the existing baseline across various answer categories. The benchmark and source code is available at <monospace><uri>https://github.com/eftekhar-hossain/Bengali-VQA</uri></monospace>. |
| format | Article |
| id | doaj-art-36a75d04fc6f4edbae3b790979c3a34c |
| institution | DOAJ |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-36a75d04fc6f4edbae3b790979c3a34c2025-08-20T03:13:08ZengIEEEIEEE Access2169-35362025-01-0113275702758610.1109/ACCESS.2025.354038810878995BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question AnsweringMd. Shalha Mucha Bhuyan0https://orcid.org/0009-0002-3875-2526Eftekhar Hossain1https://orcid.org/0000-0003-4575-0596Khaleda Akhter Sathi2https://orcid.org/0000-0003-0031-9284Md. Azad Hossain3https://orcid.org/0000-0002-8251-5168M. Ali Akber Dewan4https://orcid.org/0000-0001-6347-7509Department of Electronics and Telecommunication Engineering, Chittagong University of Engineering and Technology, Chattogram, BangladeshDepartment of Electronics and Telecommunication Engineering, Chittagong University of Engineering and Technology, Chattogram, BangladeshDepartment of Electronics and Telecommunication Engineering, Chittagong University of Engineering and Technology, Chattogram, BangladeshDepartment of Electronics and Telecommunication Engineering, Chittagong University of Engineering and Technology, Chattogram, BangladeshSchool of Computing and Information Systems, Faculty of Science and Technology, Athabasca University, Athabasca, AB, CanadaVisual Question Answering (VQA) is a challenging problem of Artificial Intelligence (AI) that requires an understanding of natural language and computer vision to respond to inquiries based on visual content within images. Research on VQA has gained immense traction due to its wide range of applications in aiding visually impaired individuals, enhancing human-computer interaction, facilitating content-based image retrieval systems, etc. While there has been extensive research on VQA, most were predominantly focused on English, often overlooking the complexity associated with low-resource languages, especially in Bengali. To facilitate research in this arena, we have developed a large scale Bengali Visual Question Answering (BVQA) dataset by harnessing the in-context learning abilities of the Large Language Model (LLM). Our BVQA dataset encompasses around 17,800 diverse open-ended QA Pairs generated from the human-annotated captions of ≈3,500 images. Replicating existing VQA systems for a low-resource language poses significant challenges due to the complex nature of their architectures and adaptations for particular languages. To overcome this challenge, we proposed Multimodal CRoss-Attention Network (<monospace>MCRAN</monospace>), a novel framework that leverages pretrained transformer architectures to encode the visual and textual information. Furthermore, our method utilizes a multi-head attention mechanism to generate three distinct vision-language representations and fuses them using a sophisticated gating mechanism to answer the query regarding an image. Extensive experiments on BVQA dataset show that the proposed method outperformed the existing baseline across various answer categories. The benchmark and source code is available at <monospace><uri>https://github.com/eftekhar-hossain/Bengali-VQA</uri></monospace>.https://ieeexplore.ieee.org/document/10878995/Visual question answeringmultimodal deep learninglarge language modelnatural language processingmulti-head attention mechanism |
| spellingShingle | Md. Shalha Mucha Bhuyan Eftekhar Hossain Khaleda Akhter Sathi Md. Azad Hossain M. Ali Akber Dewan BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering IEEE Access Visual question answering multimodal deep learning large language model natural language processing multi-head attention mechanism |
| title | BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering |
| title_full | BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering |
| title_fullStr | BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering |
| title_full_unstemmed | BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering |
| title_short | BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering |
| title_sort | bvqa connecting language and vision through multimodal attention for open ended question answering |
| topic | Visual question answering multimodal deep learning large language model natural language processing multi-head attention mechanism |
| url | https://ieeexplore.ieee.org/document/10878995/ |
| work_keys_str_mv | AT mdshalhamuchabhuyan bvqaconnectinglanguageandvisionthroughmultimodalattentionforopenendedquestionanswering AT eftekharhossain bvqaconnectinglanguageandvisionthroughmultimodalattentionforopenendedquestionanswering AT khaledaakhtersathi bvqaconnectinglanguageandvisionthroughmultimodalattentionforopenendedquestionanswering AT mdazadhossain bvqaconnectinglanguageandvisionthroughmultimodalattentionforopenendedquestionanswering AT maliakberdewan bvqaconnectinglanguageandvisionthroughmultimodalattentionforopenendedquestionanswering |