Global context-aware attention model for weakly-supervised temporal action localization
Temporal action localization (TAL) is a significant and challenging task in the field of video understanding. It aims to locate the start and end timestamps of the actions in a video and recognize their categories. However, efficient action localization often requires extensive precise annotations....
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-08-01
|
| Series: | Alexandria Engineering Journal |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S1110016825006179 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849229294788673536 |
|---|---|
| author | Weina Fu Wenxiang Zhang Jing Long Gautam Srivastava Shuai Liu |
| author_facet | Weina Fu Wenxiang Zhang Jing Long Gautam Srivastava Shuai Liu |
| author_sort | Weina Fu |
| collection | DOAJ |
| description | Temporal action localization (TAL) is a significant and challenging task in the field of video understanding. It aims to locate the start and end timestamps of the actions in a video and recognize their categories. However, efficient action localization often requires extensive precise annotations. Therefore, the researchers propose weakly-supervised temporal action localization (WTAL), which aims to locate action instances in a video using only video level annotations. The existing WTAL methods lack the ability to distinguish the action context information effectively, including the pre-action and post-action scenes, which blur the action boundary and lead to the inaccurate action location. To solve the above problems, this paper proposes a global context-aware attention model (GCAM). Firstly, GCAM designs the mask attention module (MAM) to restrict the model's receptive field and make the model focus on localized features related to the action context. It enhances the ability to distinguish the action context information and clearly locate the start and end timestamps of the actions. Secondly, GCAM introduces the context broadcasting module (CBM), which supplements the global context information to keep the features intact in temporal dimension. This module solves the issue that the model overemphasizes the localized features due to the addition of the MAM. Extensive experiments on the THUMOS14 and ActivityNet1.2 datasets demonstrate the effectiveness of GCAM. On the THUMOS14 dataset, GCAM achieves an average mean average precision (mAP) of 49.5 %, representing a 2.2 % improvement over existing WTAL methods. On the ActivityNet1.2 dataset, GCAM achieves an average mAP of 27.2 %, representing a 0.3 % improvement over existing WTAL methods. These results highlight the superior performance of GCAM in accurately localizing actions in videos. |
| format | Article |
| id | doaj-art-382cc91503984cb2989ba6246e1728d4 |
| institution | Kabale University |
| issn | 1110-0168 |
| language | English |
| publishDate | 2025-08-01 |
| publisher | Elsevier |
| record_format | Article |
| series | Alexandria Engineering Journal |
| spelling | doaj-art-382cc91503984cb2989ba6246e1728d42025-08-22T04:55:22ZengElsevierAlexandria Engineering Journal1110-01682025-08-01127435510.1016/j.aej.2025.05.006Global context-aware attention model for weakly-supervised temporal action localizationWeina Fu0Wenxiang Zhang1Jing Long2Gautam Srivastava3Shuai Liu4School of Information Science and Engineering, Hunan Normal University, Changsha 41008l, China; Institute of Interdisciplinary Studies, Hunan Normal University, Changsha 410081, ChinaSchool of Information Science and Engineering, Hunan Normal University, Changsha 41008l, China; Institute of Interdisciplinary Studies, Hunan Normal University, Changsha 410081, ChinaSchool of Information Science and Engineering, Hunan Normal University, Changsha 41008l, ChinaDepartment of Math and Computer Science, Brandon University, Brandon R7A 6A9, Canada; Research Centre for Interneural Computing, China Medical University, Taichung, Taiwan; Centre for Research Impact & Outcome, Chitkara University Institute of Engineering and Technology, Chitkara University, Rajpura, Punjab 140401, India; Corresponding author at: Department of Math and Computer Science, Brandon University, Brandon R7A 6A9, Canada.Institute of Interdisciplinary Studies, Hunan Normal University, Changsha 410081, China; School of Educational Science, Hunan Normal University, Changsha 410081, China; Corresponding author at: Institute of Interdisciplinary Studies, Hunan Normal University, Changsha 410081, China.Temporal action localization (TAL) is a significant and challenging task in the field of video understanding. It aims to locate the start and end timestamps of the actions in a video and recognize their categories. However, efficient action localization often requires extensive precise annotations. Therefore, the researchers propose weakly-supervised temporal action localization (WTAL), which aims to locate action instances in a video using only video level annotations. The existing WTAL methods lack the ability to distinguish the action context information effectively, including the pre-action and post-action scenes, which blur the action boundary and lead to the inaccurate action location. To solve the above problems, this paper proposes a global context-aware attention model (GCAM). Firstly, GCAM designs the mask attention module (MAM) to restrict the model's receptive field and make the model focus on localized features related to the action context. It enhances the ability to distinguish the action context information and clearly locate the start and end timestamps of the actions. Secondly, GCAM introduces the context broadcasting module (CBM), which supplements the global context information to keep the features intact in temporal dimension. This module solves the issue that the model overemphasizes the localized features due to the addition of the MAM. Extensive experiments on the THUMOS14 and ActivityNet1.2 datasets demonstrate the effectiveness of GCAM. On the THUMOS14 dataset, GCAM achieves an average mean average precision (mAP) of 49.5 %, representing a 2.2 % improvement over existing WTAL methods. On the ActivityNet1.2 dataset, GCAM achieves an average mAP of 27.2 %, representing a 0.3 % improvement over existing WTAL methods. These results highlight the superior performance of GCAM in accurately localizing actions in videos.http://www.sciencedirect.com/science/article/pii/S1110016825006179Temporal action localizationWeakly-supervised LearningGlobal context-awareMask attentionAction context |
| spellingShingle | Weina Fu Wenxiang Zhang Jing Long Gautam Srivastava Shuai Liu Global context-aware attention model for weakly-supervised temporal action localization Alexandria Engineering Journal Temporal action localization Weakly-supervised Learning Global context-aware Mask attention Action context |
| title | Global context-aware attention model for weakly-supervised temporal action localization |
| title_full | Global context-aware attention model for weakly-supervised temporal action localization |
| title_fullStr | Global context-aware attention model for weakly-supervised temporal action localization |
| title_full_unstemmed | Global context-aware attention model for weakly-supervised temporal action localization |
| title_short | Global context-aware attention model for weakly-supervised temporal action localization |
| title_sort | global context aware attention model for weakly supervised temporal action localization |
| topic | Temporal action localization Weakly-supervised Learning Global context-aware Mask attention Action context |
| url | http://www.sciencedirect.com/science/article/pii/S1110016825006179 |
| work_keys_str_mv | AT weinafu globalcontextawareattentionmodelforweaklysupervisedtemporalactionlocalization AT wenxiangzhang globalcontextawareattentionmodelforweaklysupervisedtemporalactionlocalization AT jinglong globalcontextawareattentionmodelforweaklysupervisedtemporalactionlocalization AT gautamsrivastava globalcontextawareattentionmodelforweaklysupervisedtemporalactionlocalization AT shuailiu globalcontextawareattentionmodelforweaklysupervisedtemporalactionlocalization |