Deep Memory Fusion Model for Long Video Question Answering
Long video question answering contains rich multimodal semantic information and inference information. At present, it is difficult for video question answering models based on recurrent neural networks to fully retain important memory information, to ignore irrelevant redundant information and to ac...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | zho |
| Published: |
Harbin University of Science and Technology Publications
2021-02-01
|
| Series: | Journal of Harbin University of Science and Technology |
| Subjects: | |
| Online Access: | https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=1911 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850044104844509184 |
|---|---|
| author | SUN Guanglu WU Meng QIU Jing LIANG Lili |
| author_facet | SUN Guanglu WU Meng QIU Jing LIANG Lili |
| author_sort | SUN Guanglu |
| collection | DOAJ |
| description | Long video question answering contains rich multimodal semantic information and inference information. At present, it is difficult for video question answering models based on recurrent neural networks to fully retain important memory information, to ignore irrelevant redundant information and to achieve efficient fusion of memory information. To solve this problem, a deep memory fusion model is proposed based on the memory network.This model uses the memory component of the memory network to effectively retain the fusion features of video clips and subtitles. A multimodal similarity matching method is proposed to filter redundant memory information. After the first fusion based on convolutional network and the secondary fusion based on attention mechanism, the context representation of the entire video is built for the answer generation. The model is tested on the MovieQA dataset. The average accuracy rate is 39.78%, which is nearly 10% higher than the traditional method and nearly 5% higher than that of the state of theart method. The accuracy is significantly improved, and the generalization performance is good. |
| format | Article |
| id | doaj-art-c6ffd43332a44872b575a009894d80f9 |
| institution | DOAJ |
| issn | 1007-2683 |
| language | zho |
| publishDate | 2021-02-01 |
| publisher | Harbin University of Science and Technology Publications |
| record_format | Article |
| series | Journal of Harbin University of Science and Technology |
| spelling | doaj-art-c6ffd43332a44872b575a009894d80f92025-08-20T02:55:03ZzhoHarbin University of Science and Technology PublicationsJournal of Harbin University of Science and Technology1007-26832021-02-0126011810.15938/j.jhust.2021.01.001Deep Memory Fusion Model for Long Video Question AnsweringSUN Guanglu0WU Meng1QIU Jing2LIANG Lili3School of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080,ChinaSchool of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080,ChinaSchool of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080,ChinaSchool of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080,ChinaLong video question answering contains rich multimodal semantic information and inference information. At present, it is difficult for video question answering models based on recurrent neural networks to fully retain important memory information, to ignore irrelevant redundant information and to achieve efficient fusion of memory information. To solve this problem, a deep memory fusion model is proposed based on the memory network.This model uses the memory component of the memory network to effectively retain the fusion features of video clips and subtitles. A multimodal similarity matching method is proposed to filter redundant memory information. After the first fusion based on convolutional network and the secondary fusion based on attention mechanism, the context representation of the entire video is built for the answer generation. The model is tested on the MovieQA dataset. The average accuracy rate is 39.78%, which is nearly 10% higher than the traditional method and nearly 5% higher than that of the state of theart method. The accuracy is significantly improved, and the generalization performance is good.https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=1911video question answeringlong video understandingmemory networkattention mechanismmultimodal fusion |
| spellingShingle | SUN Guanglu WU Meng QIU Jing LIANG Lili Deep Memory Fusion Model for Long Video Question Answering Journal of Harbin University of Science and Technology video question answering long video understanding memory network attention mechanism multimodal fusion |
| title | Deep Memory Fusion Model for Long Video Question Answering |
| title_full | Deep Memory Fusion Model for Long Video Question Answering |
| title_fullStr | Deep Memory Fusion Model for Long Video Question Answering |
| title_full_unstemmed | Deep Memory Fusion Model for Long Video Question Answering |
| title_short | Deep Memory Fusion Model for Long Video Question Answering |
| title_sort | deep memory fusion model for long video question answering |
| topic | video question answering long video understanding memory network attention mechanism multimodal fusion |
| url | https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=1911 |
| work_keys_str_mv | AT sunguanglu deepmemoryfusionmodelforlongvideoquestionanswering AT wumeng deepmemoryfusionmodelforlongvideoquestionanswering AT qiujing deepmemoryfusionmodelforlongvideoquestionanswering AT lianglili deepmemoryfusionmodelforlongvideoquestionanswering |