A Multi-Modal Attentive Framework That Can Interpret Text (MMAT)

Deep learning algorithms have demonstrated exceptional performance on various computer vision and natural language processing tasks. However, for machines to learn information signals, they must understand and have enough reasoning power to respond to general questions based on the linguistic featur...

Full description

Saved in:
Bibliographic Details
Main Authors: Vijay Kumari, Sarthak Gupta, Yashvardhan Sharma, Lavika Goel
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11072709/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850100875339497472
author Vijay Kumari
Sarthak Gupta
Yashvardhan Sharma
Lavika Goel
author_facet Vijay Kumari
Sarthak Gupta
Yashvardhan Sharma
Lavika Goel
author_sort Vijay Kumari
collection DOAJ
description Deep learning algorithms have demonstrated exceptional performance on various computer vision and natural language processing tasks. However, for machines to learn information signals, they must understand and have enough reasoning power to respond to general questions based on the linguistic features present in images. Questions such as “What temperature is my oven set to?” need the models to understand objects in the images visually and then spatially identify the text associated with them. The existing Visual Question Answering model fails to recognize linguistic features present in the images, which is crucial for assisting the visually impaired. This paper aims to deal with the task of a visual question answering system that can do reasoning with text, optical character recognition (OCR), and visual modalities. The proposed Visual Question Answering model focuses on the image’s most relevant part by using an attention mechanism and passing all the features to the fusion encoder after getting pairwise attention, where the model is inclined toward the OCR-Linguistic features. The proposed model uses the dynamic pointer network instead of classification for iterative answer prediction with a focal loss function to overcome the class imbalance problem. On the TextVQA dataset, the proposed model obtains an accuracy of 46.8% and an average of 55.21% on the STVQA dataset. The results indicate the effectiveness of the proposed approach and suggest a Multi-Modal Attentive Framework that can learn individual text, object, and OCR features and then predict answers based on the text in the image.
format Article
id doaj-art-0afdfbcc2b744ed9ab840ba2cf8d41f4
institution DOAJ
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-0afdfbcc2b744ed9ab840ba2cf8d41f42025-08-20T02:40:11ZengIEEEIEEE Access2169-35362025-01-011312195512196910.1109/ACCESS.2025.358690011072709A Multi-Modal Attentive Framework That Can Interpret Text (MMAT)Vijay Kumari0https://orcid.org/0000-0001-5279-3346Sarthak Gupta1Yashvardhan Sharma2https://orcid.org/0000-0002-6024-1872Lavika Goel3Department of CSIS, Birla Institute of Technology and Science Pilani, Pilani Campus, Jhunjhunu, Rajasthan, IndiaDepartment of CSIS, Birla Institute of Technology and Science Pilani, Pilani Campus, Jhunjhunu, Rajasthan, IndiaDepartment of CSIS, Birla Institute of Technology and Science Pilani, Pilani Campus, Jhunjhunu, Rajasthan, IndiaDepartment of CSE, Malaviya National Institute of Technology, Jaipur, Rajasthan, IndiaDeep learning algorithms have demonstrated exceptional performance on various computer vision and natural language processing tasks. However, for machines to learn information signals, they must understand and have enough reasoning power to respond to general questions based on the linguistic features present in images. Questions such as “What temperature is my oven set to?” need the models to understand objects in the images visually and then spatially identify the text associated with them. The existing Visual Question Answering model fails to recognize linguistic features present in the images, which is crucial for assisting the visually impaired. This paper aims to deal with the task of a visual question answering system that can do reasoning with text, optical character recognition (OCR), and visual modalities. The proposed Visual Question Answering model focuses on the image’s most relevant part by using an attention mechanism and passing all the features to the fusion encoder after getting pairwise attention, where the model is inclined toward the OCR-Linguistic features. The proposed model uses the dynamic pointer network instead of classification for iterative answer prediction with a focal loss function to overcome the class imbalance problem. On the TextVQA dataset, the proposed model obtains an accuracy of 46.8% and an average of 55.21% on the STVQA dataset. The results indicate the effectiveness of the proposed approach and suggest a Multi-Modal Attentive Framework that can learn individual text, object, and OCR features and then predict answers based on the text in the image.https://ieeexplore.ieee.org/document/11072709/Visual question answering system (VQA)text visual question answering system (Text-VQA)optical character recognition (OCR)attention mechanismnatural language processing (NLP)
spellingShingle Vijay Kumari
Sarthak Gupta
Yashvardhan Sharma
Lavika Goel
A Multi-Modal Attentive Framework That Can Interpret Text (MMAT)
IEEE Access
Visual question answering system (VQA)
text visual question answering system (Text-VQA)
optical character recognition (OCR)
attention mechanism
natural language processing (NLP)
title A Multi-Modal Attentive Framework That Can Interpret Text (MMAT)
title_full A Multi-Modal Attentive Framework That Can Interpret Text (MMAT)
title_fullStr A Multi-Modal Attentive Framework That Can Interpret Text (MMAT)
title_full_unstemmed A Multi-Modal Attentive Framework That Can Interpret Text (MMAT)
title_short A Multi-Modal Attentive Framework That Can Interpret Text (MMAT)
title_sort multi modal attentive framework that can interpret text mmat
topic Visual question answering system (VQA)
text visual question answering system (Text-VQA)
optical character recognition (OCR)
attention mechanism
natural language processing (NLP)
url https://ieeexplore.ieee.org/document/11072709/
work_keys_str_mv AT vijaykumari amultimodalattentiveframeworkthatcaninterprettextmmat
AT sarthakgupta amultimodalattentiveframeworkthatcaninterprettextmmat
AT yashvardhansharma amultimodalattentiveframeworkthatcaninterprettextmmat
AT lavikagoel amultimodalattentiveframeworkthatcaninterprettextmmat
AT vijaykumari multimodalattentiveframeworkthatcaninterprettextmmat
AT sarthakgupta multimodalattentiveframeworkthatcaninterprettextmmat
AT yashvardhansharma multimodalattentiveframeworkthatcaninterprettextmmat
AT lavikagoel multimodalattentiveframeworkthatcaninterprettextmmat