Deep Memory Fusion Model for Long Video Question Answering

Long video question answering contains rich multimodal semantic information and inference information. At present, it is difficult for video question answering models based on recurrent neural networks to fully retain important memory information, to ignore irrelevant redundant information and to ac...

Full description

Saved in:
Bibliographic Details
Main Authors: SUN Guanglu, WU Meng, QIU Jing, LIANG Lili
Format: Article
Language:zho
Published: Harbin University of Science and Technology Publications 2021-02-01
Series:Journal of Harbin University of Science and Technology
Subjects:
Online Access:https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=1911
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850044104844509184
author SUN Guanglu
WU Meng
QIU Jing
LIANG Lili
author_facet SUN Guanglu
WU Meng
QIU Jing
LIANG Lili
author_sort SUN Guanglu
collection DOAJ
description Long video question answering contains rich multimodal semantic information and inference information. At present, it is difficult for video question answering models based on recurrent neural networks to fully retain important memory information, to ignore irrelevant redundant information and to achieve efficient fusion of memory information. To solve this problem, a deep memory fusion model is proposed based on the memory network.This model uses the memory component of the memory network to effectively retain the fusion features of video clips and subtitles. A multimodal similarity matching method is proposed to filter redundant memory information. After the first fusion based on convolutional network and the secondary fusion based on attention mechanism, the context representation of the entire video is built for the answer generation. The model is tested on the MovieQA dataset. The average accuracy rate is 39.78%, which is nearly 10% higher than the traditional method and nearly 5% higher than that of the state of theart method. The accuracy is significantly improved, and the generalization performance is good.
format Article
id doaj-art-c6ffd43332a44872b575a009894d80f9
institution DOAJ
issn 1007-2683
language zho
publishDate 2021-02-01
publisher Harbin University of Science and Technology Publications
record_format Article
series Journal of Harbin University of Science and Technology
spelling doaj-art-c6ffd43332a44872b575a009894d80f92025-08-20T02:55:03ZzhoHarbin University of Science and Technology PublicationsJournal of Harbin University of Science and Technology1007-26832021-02-0126011810.15938/j.jhust.2021.01.001Deep Memory Fusion Model for Long Video Question AnsweringSUN Guanglu0WU Meng1QIU Jing2LIANG Lili3School of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080,ChinaSchool of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080,ChinaSchool of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080,ChinaSchool of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080,ChinaLong video question answering contains rich multimodal semantic information and inference information. At present, it is difficult for video question answering models based on recurrent neural networks to fully retain important memory information, to ignore irrelevant redundant information and to achieve efficient fusion of memory information. To solve this problem, a deep memory fusion model is proposed based on the memory network.This model uses the memory component of the memory network to effectively retain the fusion features of video clips and subtitles. A multimodal similarity matching method is proposed to filter redundant memory information. After the first fusion based on convolutional network and the secondary fusion based on attention mechanism, the context representation of the entire video is built for the answer generation. The model is tested on the MovieQA dataset. The average accuracy rate is 39.78%, which is nearly 10% higher than the traditional method and nearly 5% higher than that of the state of theart method. The accuracy is significantly improved, and the generalization performance is good.https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=1911video question answeringlong video understandingmemory networkattention mechanismmultimodal fusion
spellingShingle SUN Guanglu
WU Meng
QIU Jing
LIANG Lili
Deep Memory Fusion Model for Long Video Question Answering
Journal of Harbin University of Science and Technology
video question answering
long video understanding
memory network
attention mechanism
multimodal fusion
title Deep Memory Fusion Model for Long Video Question Answering
title_full Deep Memory Fusion Model for Long Video Question Answering
title_fullStr Deep Memory Fusion Model for Long Video Question Answering
title_full_unstemmed Deep Memory Fusion Model for Long Video Question Answering
title_short Deep Memory Fusion Model for Long Video Question Answering
title_sort deep memory fusion model for long video question answering
topic video question answering
long video understanding
memory network
attention mechanism
multimodal fusion
url https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=1911
work_keys_str_mv AT sunguanglu deepmemoryfusionmodelforlongvideoquestionanswering
AT wumeng deepmemoryfusionmodelforlongvideoquestionanswering
AT qiujing deepmemoryfusionmodelforlongvideoquestionanswering
AT lianglili deepmemoryfusionmodelforlongvideoquestionanswering