Deep Memory Fusion Model for Long Video Question Answering
Long video question answering contains rich multimodal semantic information and inference information. At present, it is difficult for video question answering models based on recurrent neural networks to fully retain important memory information, to ignore irrelevant redundant information and to ac...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | zho |
| Published: |
Harbin University of Science and Technology Publications
2021-02-01
|
| Series: | Journal of Harbin University of Science and Technology |
| Subjects: | |
| Online Access: | https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=1911 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Long video question answering contains rich multimodal semantic information and inference information. At present, it is difficult for video question answering models based on recurrent neural networks to fully retain important memory information, to ignore irrelevant redundant information and to achieve efficient fusion of memory information. To solve this problem, a deep memory fusion model is proposed based on the memory network.This model uses the memory component of the memory network to effectively retain the fusion features of video clips and subtitles. A multimodal similarity matching method is proposed to filter redundant memory information. After the first fusion based on convolutional network and the secondary fusion based on attention mechanism, the context representation of the entire video is built for the answer generation. The model is tested on the MovieQA dataset. The average accuracy rate is 39.78%, which is nearly 10% higher than the traditional method and nearly 5% higher than that of the state of theart method. The accuracy is significantly improved, and the generalization performance is good. |
|---|---|
| ISSN: | 1007-2683 |