MSAM:Video Question Answering Based on Multi-Stage Attention Model

The video question answering (VideoQA) task requires understanding of semantic information of both the video and question to generate the answer.At present, it is difficult for VideoQA methods that are based on attention model to fully understand and accurately locate video information related to th...

Full description

Saved in:
Bibliographic Details
Main Authors: LIANG Li-li, LIU Xin-yu, SUN Guang-lu, ZHU Su-xia
Format: Article
Language:zho
Published: Harbin University of Science and Technology Publications 2022-08-01
Series:Journal of Harbin University of Science and Technology
Subjects:
Online Access:https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=2123
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The video question answering (VideoQA) task requires understanding of semantic information of both the video and question to generate the answer.At present, it is difficult for VideoQA methods that are based on attention model to fully understand and accurately locate video information related to the question.To solve this problem, a multi-stage attention model network (MSAMN) is proposed. This network extracts multi-modal features such as video, audio and text and feeds these features into the multi-stage attention model (MSAM), which is able to accurately locate the video information through a stage-by-stage localization method.In order to improve the effectiveness of feature fusion, a triplemodal compact concat bilinear (TCCB) algorithm is proposed to calculate the correlation between different modal features.This network is tested on the ZJL dataset.The average accuracy rate is 54.3%, which is nearly 15% higher than the traditional method and nearly 7% higher than the exist method.
ISSN:1007-2683