Manet: motion-aware network for video action recognition
Abstract Video action recognition is a fundamental task in video understanding. Actions in videos may vary at different speeds or scales, and it is difficult to cope with a wide variety of actions by relying on a single spatio-temporal scale to extract features. To address this problem, we propose a...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Springer
2025-02-01
|
Series: | Complex & Intelligent Systems |
Subjects: | |
Online Access: | https://doi.org/10.1007/s40747-024-01774-9 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1823861511226916864 |
---|---|
author | Xiaoyang Li Wenzhu Yang Kanglin Wang Tiebiao Wang Chen Zhang |
author_facet | Xiaoyang Li Wenzhu Yang Kanglin Wang Tiebiao Wang Chen Zhang |
author_sort | Xiaoyang Li |
collection | DOAJ |
description | Abstract Video action recognition is a fundamental task in video understanding. Actions in videos may vary at different speeds or scales, and it is difficult to cope with a wide variety of actions by relying on a single spatio-temporal scale to extract features. To address this problem, we propose a Motion-Aware Network (MANet), which includes three key modules: (1) Local Motion Encoding Module (LMEM) for capturing local motion features, (2) Spatio-Temporal Excitation Module (STEM) for extracting multi-granular motion information, and (3) Multiple Temporal Aggregation Module (MTAM) for modeling multi-scale temporal information. The MANet, equipped with these modules, can capture multi-granularity spatio-temporal cues. We conducted extensive experiments on five mainstream datasets, Something-Something V1 & V2, Jester, Diving48, and UCF-101, to validate the effectiveness of MANet. The MANet achieves competitive performance on Something-Something V1 (52.5%), Something-Something V2 (63.6%), Jester (95.9%), Diving48 (81.8%) and UCF-101 (86.2%). In addition, we visualize the feature representation of the MANet using Grad-CAM to validate its effectiveness. |
format | Article |
id | doaj-art-dffdeef3ae174bb1bf4aef6ef336a795 |
institution | Kabale University |
issn | 2199-4536 2198-6053 |
language | English |
publishDate | 2025-02-01 |
publisher | Springer |
record_format | Article |
series | Complex & Intelligent Systems |
spelling | doaj-art-dffdeef3ae174bb1bf4aef6ef336a7952025-02-09T13:00:54ZengSpringerComplex & Intelligent Systems2199-45362198-60532025-02-0111311810.1007/s40747-024-01774-9Manet: motion-aware network for video action recognitionXiaoyang Li0Wenzhu Yang1Kanglin Wang2Tiebiao Wang3Chen Zhang4School of Cyber Security and Computer, Hebei UniversitySchool of Cyber Security and Computer, Hebei UniversitySchool of Cyber Security and Computer, Hebei UniversitySchool of Cyber Security and Computer, Hebei UniversitySchool of Cyber Security and Computer, Hebei UniversityAbstract Video action recognition is a fundamental task in video understanding. Actions in videos may vary at different speeds or scales, and it is difficult to cope with a wide variety of actions by relying on a single spatio-temporal scale to extract features. To address this problem, we propose a Motion-Aware Network (MANet), which includes three key modules: (1) Local Motion Encoding Module (LMEM) for capturing local motion features, (2) Spatio-Temporal Excitation Module (STEM) for extracting multi-granular motion information, and (3) Multiple Temporal Aggregation Module (MTAM) for modeling multi-scale temporal information. The MANet, equipped with these modules, can capture multi-granularity spatio-temporal cues. We conducted extensive experiments on five mainstream datasets, Something-Something V1 & V2, Jester, Diving48, and UCF-101, to validate the effectiveness of MANet. The MANet achieves competitive performance on Something-Something V1 (52.5%), Something-Something V2 (63.6%), Jester (95.9%), Diving48 (81.8%) and UCF-101 (86.2%). In addition, we visualize the feature representation of the MANet using Grad-CAM to validate its effectiveness.https://doi.org/10.1007/s40747-024-01774-9Video action recognitionMultiple temporal aggregationFeature excitationLocal motion encoding |
spellingShingle | Xiaoyang Li Wenzhu Yang Kanglin Wang Tiebiao Wang Chen Zhang Manet: motion-aware network for video action recognition Complex & Intelligent Systems Video action recognition Multiple temporal aggregation Feature excitation Local motion encoding |
title | Manet: motion-aware network for video action recognition |
title_full | Manet: motion-aware network for video action recognition |
title_fullStr | Manet: motion-aware network for video action recognition |
title_full_unstemmed | Manet: motion-aware network for video action recognition |
title_short | Manet: motion-aware network for video action recognition |
title_sort | manet motion aware network for video action recognition |
topic | Video action recognition Multiple temporal aggregation Feature excitation Local motion encoding |
url | https://doi.org/10.1007/s40747-024-01774-9 |
work_keys_str_mv | AT xiaoyangli manetmotionawarenetworkforvideoactionrecognition AT wenzhuyang manetmotionawarenetworkforvideoactionrecognition AT kanglinwang manetmotionawarenetworkforvideoactionrecognition AT tiebiaowang manetmotionawarenetworkforvideoactionrecognition AT chenzhang manetmotionawarenetworkforvideoactionrecognition |