MSAM:Video Question Answering Based on Multi-Stage Attention Model
The video question answering (VideoQA) task requires understanding of semantic information of both the video and question to generate the answer.At present, it is difficult for VideoQA methods that are based on attention model to fully understand and accurately locate video information related to th...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | zho |
| Published: |
Harbin University of Science and Technology Publications
2022-08-01
|
| Series: | Journal of Harbin University of Science and Technology |
| Subjects: | |
| Online Access: | https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=2123 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | The video question answering (VideoQA) task requires understanding of semantic information of both the video and question to generate the answer.At present, it is difficult for VideoQA methods that are based on attention model to fully understand and accurately locate video information related to the question.To solve this problem, a multi-stage attention model network (MSAMN) is proposed. This network extracts multi-modal features such as video, audio and text and feeds these features into the multi-stage attention model (MSAM), which is able to accurately locate the video information through a stage-by-stage localization method.In order to improve the effectiveness of feature fusion, a triplemodal compact concat bilinear (TCCB) algorithm is proposed to calculate the correlation between different modal features.This network is tested on the ZJL dataset.The average accuracy rate is 54.3%, which is nearly 15% higher than the traditional method and nearly 7% higher than the exist method. |
|---|---|
| ISSN: | 1007-2683 |