MSAM:Video Question Answering Based on Multi-Stage Attention Model

The video question answering (VideoQA) task requires understanding of semantic information of both the video and question to generate the answer.At present, it is difficult for VideoQA methods that are based on attention model to fully understand and accurately locate video information related to th...

Full description

Saved in:
Bibliographic Details
Main Authors: LIANG Li-li, LIU Xin-yu, SUN Guang-lu, ZHU Su-xia
Format: Article
Language:zho
Published: Harbin University of Science and Technology Publications 2022-08-01
Series:Journal of Harbin University of Science and Technology
Subjects:
Online Access:https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=2123
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849713292609585152
author LIANG Li-li
LIU Xin-yu
SUN Guang-lu
ZHU Su-xia
author_facet LIANG Li-li
LIU Xin-yu
SUN Guang-lu
ZHU Su-xia
author_sort LIANG Li-li
collection DOAJ
description The video question answering (VideoQA) task requires understanding of semantic information of both the video and question to generate the answer.At present, it is difficult for VideoQA methods that are based on attention model to fully understand and accurately locate video information related to the question.To solve this problem, a multi-stage attention model network (MSAMN) is proposed. This network extracts multi-modal features such as video, audio and text and feeds these features into the multi-stage attention model (MSAM), which is able to accurately locate the video information through a stage-by-stage localization method.In order to improve the effectiveness of feature fusion, a triplemodal compact concat bilinear (TCCB) algorithm is proposed to calculate the correlation between different modal features.This network is tested on the ZJL dataset.The average accuracy rate is 54.3%, which is nearly 15% higher than the traditional method and nearly 7% higher than the exist method.
format Article
id doaj-art-56ff8cac676b4e11bc465a4afd92add8
institution DOAJ
issn 1007-2683
language zho
publishDate 2022-08-01
publisher Harbin University of Science and Technology Publications
record_format Article
series Journal of Harbin University of Science and Technology
spelling doaj-art-56ff8cac676b4e11bc465a4afd92add82025-08-20T03:13:59ZzhoHarbin University of Science and Technology PublicationsJournal of Harbin University of Science and Technology1007-26832022-08-01270410711710.15938/j.jhust.2022.04.014MSAM:Video Question Answering Based on Multi-Stage Attention ModelLIANG Li-li0LIU Xin-yu1SUN Guang-lu2ZHU Su-xia3School of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, ChinaSchool of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, ChinaSchool of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, ChinaSchool of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, ChinaThe video question answering (VideoQA) task requires understanding of semantic information of both the video and question to generate the answer.At present, it is difficult for VideoQA methods that are based on attention model to fully understand and accurately locate video information related to the question.To solve this problem, a multi-stage attention model network (MSAMN) is proposed. This network extracts multi-modal features such as video, audio and text and feeds these features into the multi-stage attention model (MSAM), which is able to accurately locate the video information through a stage-by-stage localization method.In order to improve the effectiveness of feature fusion, a triplemodal compact concat bilinear (TCCB) algorithm is proposed to calculate the correlation between different modal features.This network is tested on the ZJL dataset.The average accuracy rate is 54.3%, which is nearly 15% higher than the traditional method and nearly 7% higher than the exist method.https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=2123video question answeringmulti-stage attention modelmulti-modal feature fusion
spellingShingle LIANG Li-li
LIU Xin-yu
SUN Guang-lu
ZHU Su-xia
MSAM:Video Question Answering Based on Multi-Stage Attention Model
Journal of Harbin University of Science and Technology
video question answering
multi-stage attention model
multi-modal feature fusion
title MSAM:Video Question Answering Based on Multi-Stage Attention Model
title_full MSAM:Video Question Answering Based on Multi-Stage Attention Model
title_fullStr MSAM:Video Question Answering Based on Multi-Stage Attention Model
title_full_unstemmed MSAM:Video Question Answering Based on Multi-Stage Attention Model
title_short MSAM:Video Question Answering Based on Multi-Stage Attention Model
title_sort msam video question answering based on multi stage attention model
topic video question answering
multi-stage attention model
multi-modal feature fusion
url https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=2123
work_keys_str_mv AT lianglili msamvideoquestionansweringbasedonmultistageattentionmodel
AT liuxinyu msamvideoquestionansweringbasedonmultistageattentionmodel
AT sunguanglu msamvideoquestionansweringbasedonmultistageattentionmodel
AT zhusuxia msamvideoquestionansweringbasedonmultistageattentionmodel