Deep Memory Fusion Model for Long Video Question Answering

Long video question answering contains rich multimodal semantic information and inference information. At present, it is difficult for video question answering models based on recurrent neural networks to fully retain important memory information, to ignore irrelevant redundant information and to ac...

Full description

Saved in:

Bibliographic Details
Main Authors:	SUN Guanglu, WU Meng, QIU Jing, LIANG Lili
Format:	Article
Language:	zho
Published:	Harbin University of Science and Technology Publications 2021-02-01
Series:	Journal of Harbin University of Science and Technology
Subjects:	video question answering long video understanding memory network attention mechanism multimodal fusion
Online Access:	https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=1911
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Long video question answering contains rich multimodal semantic information and inference information. At present, it is difficult for video question answering models based on recurrent neural networks to fully retain important memory information, to ignore irrelevant redundant information and to achieve efficient fusion of memory information. To solve this problem, a deep memory fusion model is proposed based on the memory network.This model uses the memory component of the memory network to effectively retain the fusion features of video clips and subtitles. A multimodal similarity matching method is proposed to filter redundant memory information. After the first fusion based on convolutional network and the secondary fusion based on attention mechanism, the context representation of the entire video is built for the answer generation. The model is tested on the MovieQA dataset. The average accuracy rate is 39.78%, which is nearly 10% higher than the traditional method and nearly 5% higher than that of the state of theart method. The accuracy is significantly improved, and the generalization performance is good.
ISSN:	1007-2683

Deep Memory Fusion Model for Long Video Question Answering

Similar Items