Weight-based multi-stream model for Multi-Modal Video Question Answering

There has been a tremendous success in individual domains of Computer Vision, Natural Language Processing, and Knowledge Representation. Videos are a rich source of information with the multi-modal data forms of images, audio, and optionally subtitles blended. Current research is going on in combini...

Full description

Saved in:
Bibliographic Details
Main Authors: Mohith Rajesh, Sanjiv Sridhar, Chinmay Kulkarni, Aaditya Shah, Natarajan S
Format: Article
Language:English
Published: LibraryPress@UF 2023-05-01
Series:Proceedings of the International Florida Artificial Intelligence Research Society Conference
Subjects:
Online Access:https://journals.flvc.org/FLAIRS/article/view/133306
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850271284523433984
author Mohith Rajesh
Sanjiv Sridhar
Chinmay Kulkarni
Aaditya Shah
Natarajan S
author_facet Mohith Rajesh
Sanjiv Sridhar
Chinmay Kulkarni
Aaditya Shah
Natarajan S
author_sort Mohith Rajesh
collection DOAJ
description There has been a tremendous success in individual domains of Computer Vision, Natural Language Processing, and Knowledge Representation. Videos are a rich source of information with the multi-modal data forms of images, audio, and optionally subtitles blended. Current research is going on in combining these individual domains which have given rise to topics such as image captioning, visual question answering, and video question answering. Video Question Answering is a model which combines research topics like object detection and recognition, temporal information processing, visual attention, and natural language processing. In this paper, we propose a model with Attention Mechanism for Video Question Answering that assigns varying weights to the many pieces of information the video encompasses. The model combines the question with 3 streams i.e., video's frames, subtitles, and objects to get the most probable answer. The model also receives the set of answer candidates as input and predicts one of them as the most probable answer since it has been trained and tested on the TVQA dataset.
format Article
id doaj-art-97d7849cc6bf4d75abe041e48e96dcda
institution OA Journals
issn 2334-0754
2334-0762
language English
publishDate 2023-05-01
publisher LibraryPress@UF
record_format Article
series Proceedings of the International Florida Artificial Intelligence Research Society Conference
spelling doaj-art-97d7849cc6bf4d75abe041e48e96dcda2025-08-20T01:52:18ZengLibraryPress@UFProceedings of the International Florida Artificial Intelligence Research Society Conference2334-07542334-07622023-05-013610.32473/flairs.36.13330669612Weight-based multi-stream model for Multi-Modal Video Question AnsweringMohith Rajesh0https://orcid.org/0000-0002-3621-4946Sanjiv Sridhar1https://orcid.org/0000-0002-4080-5191Chinmay Kulkarni2https://orcid.org/0000-0003-2935-9861Aaditya Shah3Natarajan S4https://orcid.org/0000-0002-8689-5137PES UniversityPES UniversityPES UniversityPES UniversityPES UniversityThere has been a tremendous success in individual domains of Computer Vision, Natural Language Processing, and Knowledge Representation. Videos are a rich source of information with the multi-modal data forms of images, audio, and optionally subtitles blended. Current research is going on in combining these individual domains which have given rise to topics such as image captioning, visual question answering, and video question answering. Video Question Answering is a model which combines research topics like object detection and recognition, temporal information processing, visual attention, and natural language processing. In this paper, we propose a model with Attention Mechanism for Video Question Answering that assigns varying weights to the many pieces of information the video encompasses. The model combines the question with 3 streams i.e., video's frames, subtitles, and objects to get the most probable answer. The model also receives the set of answer candidates as input and predicts one of them as the most probable answer since it has been trained and tested on the TVQA dataset.https://journals.flvc.org/FLAIRS/article/view/133306video question answeringattention mechanismcomputer visionnatural language processingneural networkspretrained modelstransfer learningweight-based multi-stream modeltvqa datasetclipvision transformersdebertamultimediamulti-modal
spellingShingle Mohith Rajesh
Sanjiv Sridhar
Chinmay Kulkarni
Aaditya Shah
Natarajan S
Weight-based multi-stream model for Multi-Modal Video Question Answering
Proceedings of the International Florida Artificial Intelligence Research Society Conference
video question answering
attention mechanism
computer vision
natural language processing
neural networks
pretrained models
transfer learning
weight-based multi-stream model
tvqa dataset
clip
vision transformers
deberta
multimedia
multi-modal
title Weight-based multi-stream model for Multi-Modal Video Question Answering
title_full Weight-based multi-stream model for Multi-Modal Video Question Answering
title_fullStr Weight-based multi-stream model for Multi-Modal Video Question Answering
title_full_unstemmed Weight-based multi-stream model for Multi-Modal Video Question Answering
title_short Weight-based multi-stream model for Multi-Modal Video Question Answering
title_sort weight based multi stream model for multi modal video question answering
topic video question answering
attention mechanism
computer vision
natural language processing
neural networks
pretrained models
transfer learning
weight-based multi-stream model
tvqa dataset
clip
vision transformers
deberta
multimedia
multi-modal
url https://journals.flvc.org/FLAIRS/article/view/133306
work_keys_str_mv AT mohithrajesh weightbasedmultistreammodelformultimodalvideoquestionanswering
AT sanjivsridhar weightbasedmultistreammodelformultimodalvideoquestionanswering
AT chinmaykulkarni weightbasedmultistreammodelformultimodalvideoquestionanswering
AT aadityashah weightbasedmultistreammodelformultimodalvideoquestionanswering
AT natarajans weightbasedmultistreammodelformultimodalvideoquestionanswering