MSAM:Video Question Answering Based on Multi-Stage Attention Model
The video question answering (VideoQA) task requires understanding of semantic information of both the video and question to generate the answer.At present, it is difficult for VideoQA methods that are based on attention model to fully understand and accurately locate video information related to th...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | zho |
| Published: |
Harbin University of Science and Technology Publications
2022-08-01
|
| Series: | Journal of Harbin University of Science and Technology |
| Subjects: | |
| Online Access: | https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=2123 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849713292609585152 |
|---|---|
| author | LIANG Li-li LIU Xin-yu SUN Guang-lu ZHU Su-xia |
| author_facet | LIANG Li-li LIU Xin-yu SUN Guang-lu ZHU Su-xia |
| author_sort | LIANG Li-li |
| collection | DOAJ |
| description | The video question answering (VideoQA) task requires understanding of semantic information of both the video and question to generate the answer.At present, it is difficult for VideoQA methods that are based on attention model to fully understand and accurately locate video information related to the question.To solve this problem, a multi-stage attention model network (MSAMN) is proposed. This network extracts multi-modal features such as video, audio and text and feeds these features into the multi-stage attention model (MSAM), which is able to accurately locate the video information through a stage-by-stage localization method.In order to improve the effectiveness of feature fusion, a triplemodal compact concat bilinear (TCCB) algorithm is proposed to calculate the correlation between different modal features.This network is tested on the ZJL dataset.The average accuracy rate is 54.3%, which is nearly 15% higher than the traditional method and nearly 7% higher than the exist method. |
| format | Article |
| id | doaj-art-56ff8cac676b4e11bc465a4afd92add8 |
| institution | DOAJ |
| issn | 1007-2683 |
| language | zho |
| publishDate | 2022-08-01 |
| publisher | Harbin University of Science and Technology Publications |
| record_format | Article |
| series | Journal of Harbin University of Science and Technology |
| spelling | doaj-art-56ff8cac676b4e11bc465a4afd92add82025-08-20T03:13:59ZzhoHarbin University of Science and Technology PublicationsJournal of Harbin University of Science and Technology1007-26832022-08-01270410711710.15938/j.jhust.2022.04.014MSAM:Video Question Answering Based on Multi-Stage Attention ModelLIANG Li-li0LIU Xin-yu1SUN Guang-lu2ZHU Su-xia3School of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, ChinaSchool of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, ChinaSchool of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, ChinaSchool of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, ChinaThe video question answering (VideoQA) task requires understanding of semantic information of both the video and question to generate the answer.At present, it is difficult for VideoQA methods that are based on attention model to fully understand and accurately locate video information related to the question.To solve this problem, a multi-stage attention model network (MSAMN) is proposed. This network extracts multi-modal features such as video, audio and text and feeds these features into the multi-stage attention model (MSAM), which is able to accurately locate the video information through a stage-by-stage localization method.In order to improve the effectiveness of feature fusion, a triplemodal compact concat bilinear (TCCB) algorithm is proposed to calculate the correlation between different modal features.This network is tested on the ZJL dataset.The average accuracy rate is 54.3%, which is nearly 15% higher than the traditional method and nearly 7% higher than the exist method.https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=2123video question answeringmulti-stage attention modelmulti-modal feature fusion |
| spellingShingle | LIANG Li-li LIU Xin-yu SUN Guang-lu ZHU Su-xia MSAM:Video Question Answering Based on Multi-Stage Attention Model Journal of Harbin University of Science and Technology video question answering multi-stage attention model multi-modal feature fusion |
| title | MSAM:Video Question Answering Based on Multi-Stage Attention Model |
| title_full | MSAM:Video Question Answering Based on Multi-Stage Attention Model |
| title_fullStr | MSAM:Video Question Answering Based on Multi-Stage Attention Model |
| title_full_unstemmed | MSAM:Video Question Answering Based on Multi-Stage Attention Model |
| title_short | MSAM:Video Question Answering Based on Multi-Stage Attention Model |
| title_sort | msam video question answering based on multi stage attention model |
| topic | video question answering multi-stage attention model multi-modal feature fusion |
| url | https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=2123 |
| work_keys_str_mv | AT lianglili msamvideoquestionansweringbasedonmultistageattentionmodel AT liuxinyu msamvideoquestionansweringbasedonmultistageattentionmodel AT sunguanglu msamvideoquestionansweringbasedonmultistageattentionmodel AT zhusuxia msamvideoquestionansweringbasedonmultistageattentionmodel |