MSAM：Video Question Answering Based on Multi-Stage Attention Model

The video question answering (VideoQA) task requires understanding of semantic information of both the video and question to generate the answer.At present, it is difficult for VideoQA methods that are based on attention model to fully understand and accurately locate video information related to th...

Full description

Saved in:

Bibliographic Details
Main Authors:	LIANG Li-li, LIU Xin-yu, SUN Guang-lu, ZHU Su-xia
Format:	Article
Language:	zho
Published:	Harbin University of Science and Technology Publications 2022-08-01
Series:	Journal of Harbin University of Science and Technology
Subjects:	video question answering multi-stage attention model multi-modal feature fusion
Online Access:	https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=2123
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849713292609585152
author	LIANG Li-li LIU Xin-yu SUN Guang-lu ZHU Su-xia
author_facet	LIANG Li-li LIU Xin-yu SUN Guang-lu ZHU Su-xia
author_sort	LIANG Li-li
collection	DOAJ
description	The video question answering (VideoQA) task requires understanding of semantic information of both the video and question to generate the answer.At present, it is difficult for VideoQA methods that are based on attention model to fully understand and accurately locate video information related to the question.To solve this problem, a multi-stage attention model network (MSAMN) is proposed. This network extracts multi-modal features such as video, audio and text and feeds these features into the multi-stage attention model (MSAM), which is able to accurately locate the video information through a stage-by-stage localization method.In order to improve the effectiveness of feature fusion, a triplemodal compact concat bilinear (TCCB) algorithm is proposed to calculate the correlation between different modal features.This network is tested on the ZJL dataset.The average accuracy rate is 54.3%, which is nearly 15% higher than the traditional method and nearly 7% higher than the exist method.
format	Article
id	doaj-art-56ff8cac676b4e11bc465a4afd92add8
institution	DOAJ
issn	1007-2683
language	zho
publishDate	2022-08-01
publisher	Harbin University of Science and Technology Publications
record_format	Article
series	Journal of Harbin University of Science and Technology
spelling	doaj-art-56ff8cac676b4e11bc465a4afd92add82025-08-20T03:13:59ZzhoHarbin University of Science and Technology PublicationsJournal of Harbin University of Science and Technology1007-26832022-08-01270410711710.15938/j.jhust.2022.04.014MSAM：Video Question Answering Based on Multi-Stage Attention ModelLIANG Li-li0LIU Xin-yu1SUN Guang-lu2ZHU Su-xia3School of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, ChinaSchool of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, ChinaSchool of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, ChinaSchool of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, ChinaThe video question answering (VideoQA) task requires understanding of semantic information of both the video and question to generate the answer.At present, it is difficult for VideoQA methods that are based on attention model to fully understand and accurately locate video information related to the question.To solve this problem, a multi-stage attention model network (MSAMN) is proposed. This network extracts multi-modal features such as video, audio and text and feeds these features into the multi-stage attention model (MSAM), which is able to accurately locate the video information through a stage-by-stage localization method.In order to improve the effectiveness of feature fusion, a triplemodal compact concat bilinear (TCCB) algorithm is proposed to calculate the correlation between different modal features.This network is tested on the ZJL dataset.The average accuracy rate is 54.3%, which is nearly 15% higher than the traditional method and nearly 7% higher than the exist method.https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=2123video question answeringmulti-stage attention modelmulti-modal feature fusion
spellingShingle	LIANG Li-li LIU Xin-yu SUN Guang-lu ZHU Su-xia MSAM：Video Question Answering Based on Multi-Stage Attention Model Journal of Harbin University of Science and Technology video question answering multi-stage attention model multi-modal feature fusion
title	MSAM：Video Question Answering Based on Multi-Stage Attention Model
title_full	MSAM：Video Question Answering Based on Multi-Stage Attention Model
title_fullStr	MSAM：Video Question Answering Based on Multi-Stage Attention Model
title_full_unstemmed	MSAM：Video Question Answering Based on Multi-Stage Attention Model
title_short	MSAM：Video Question Answering Based on Multi-Stage Attention Model
title_sort	msam video question answering based on multi stage attention model
topic	video question answering multi-stage attention model multi-modal feature fusion
url	https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=2123
work_keys_str_mv	AT lianglili msamvideoquestionansweringbasedonmultistageattentionmodel AT liuxinyu msamvideoquestionansweringbasedonmultistageattentionmodel AT sunguanglu msamvideoquestionansweringbasedonmultistageattentionmodel AT zhusuxia msamvideoquestionansweringbasedonmultistageattentionmodel

MSAM：Video Question Answering Based on Multi-Stage Attention Model

Similar Items