Manet: motion-aware network for video action recognition

Abstract Video action recognition is a fundamental task in video understanding. Actions in videos may vary at different speeds or scales, and it is difficult to cope with a wide variety of actions by relying on a single spatio-temporal scale to extract features. To address this problem, we propose a...

Full description

Saved in:
Bibliographic Details
Main Authors: Xiaoyang Li, Wenzhu Yang, Kanglin Wang, Tiebiao Wang, Chen Zhang
Format: Article
Language:English
Published: Springer 2025-02-01
Series:Complex & Intelligent Systems
Subjects:
Online Access:https://doi.org/10.1007/s40747-024-01774-9
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1823861511226916864
author Xiaoyang Li
Wenzhu Yang
Kanglin Wang
Tiebiao Wang
Chen Zhang
author_facet Xiaoyang Li
Wenzhu Yang
Kanglin Wang
Tiebiao Wang
Chen Zhang
author_sort Xiaoyang Li
collection DOAJ
description Abstract Video action recognition is a fundamental task in video understanding. Actions in videos may vary at different speeds or scales, and it is difficult to cope with a wide variety of actions by relying on a single spatio-temporal scale to extract features. To address this problem, we propose a Motion-Aware Network (MANet), which includes three key modules: (1) Local Motion Encoding Module (LMEM) for capturing local motion features, (2) Spatio-Temporal Excitation Module (STEM) for extracting multi-granular motion information, and (3) Multiple Temporal Aggregation Module (MTAM) for modeling multi-scale temporal information. The MANet, equipped with these modules, can capture multi-granularity spatio-temporal cues. We conducted extensive experiments on five mainstream datasets, Something-Something V1 & V2, Jester, Diving48, and UCF-101, to validate the effectiveness of MANet. The MANet achieves competitive performance on Something-Something V1 (52.5%), Something-Something V2 (63.6%), Jester (95.9%), Diving48 (81.8%) and UCF-101 (86.2%). In addition, we visualize the feature representation of the MANet using Grad-CAM to validate its effectiveness.
format Article
id doaj-art-dffdeef3ae174bb1bf4aef6ef336a795
institution Kabale University
issn 2199-4536
2198-6053
language English
publishDate 2025-02-01
publisher Springer
record_format Article
series Complex & Intelligent Systems
spelling doaj-art-dffdeef3ae174bb1bf4aef6ef336a7952025-02-09T13:00:54ZengSpringerComplex & Intelligent Systems2199-45362198-60532025-02-0111311810.1007/s40747-024-01774-9Manet: motion-aware network for video action recognitionXiaoyang Li0Wenzhu Yang1Kanglin Wang2Tiebiao Wang3Chen Zhang4School of Cyber Security and Computer, Hebei UniversitySchool of Cyber Security and Computer, Hebei UniversitySchool of Cyber Security and Computer, Hebei UniversitySchool of Cyber Security and Computer, Hebei UniversitySchool of Cyber Security and Computer, Hebei UniversityAbstract Video action recognition is a fundamental task in video understanding. Actions in videos may vary at different speeds or scales, and it is difficult to cope with a wide variety of actions by relying on a single spatio-temporal scale to extract features. To address this problem, we propose a Motion-Aware Network (MANet), which includes three key modules: (1) Local Motion Encoding Module (LMEM) for capturing local motion features, (2) Spatio-Temporal Excitation Module (STEM) for extracting multi-granular motion information, and (3) Multiple Temporal Aggregation Module (MTAM) for modeling multi-scale temporal information. The MANet, equipped with these modules, can capture multi-granularity spatio-temporal cues. We conducted extensive experiments on five mainstream datasets, Something-Something V1 & V2, Jester, Diving48, and UCF-101, to validate the effectiveness of MANet. The MANet achieves competitive performance on Something-Something V1 (52.5%), Something-Something V2 (63.6%), Jester (95.9%), Diving48 (81.8%) and UCF-101 (86.2%). In addition, we visualize the feature representation of the MANet using Grad-CAM to validate its effectiveness.https://doi.org/10.1007/s40747-024-01774-9Video action recognitionMultiple temporal aggregationFeature excitationLocal motion encoding
spellingShingle Xiaoyang Li
Wenzhu Yang
Kanglin Wang
Tiebiao Wang
Chen Zhang
Manet: motion-aware network for video action recognition
Complex & Intelligent Systems
Video action recognition
Multiple temporal aggregation
Feature excitation
Local motion encoding
title Manet: motion-aware network for video action recognition
title_full Manet: motion-aware network for video action recognition
title_fullStr Manet: motion-aware network for video action recognition
title_full_unstemmed Manet: motion-aware network for video action recognition
title_short Manet: motion-aware network for video action recognition
title_sort manet motion aware network for video action recognition
topic Video action recognition
Multiple temporal aggregation
Feature excitation
Local motion encoding
url https://doi.org/10.1007/s40747-024-01774-9
work_keys_str_mv AT xiaoyangli manetmotionawarenetworkforvideoactionrecognition
AT wenzhuyang manetmotionawarenetworkforvideoactionrecognition
AT kanglinwang manetmotionawarenetworkforvideoactionrecognition
AT tiebiaowang manetmotionawarenetworkforvideoactionrecognition
AT chenzhang manetmotionawarenetworkforvideoactionrecognition