YOLO-Act: Unified Spatiotemporal Detection of Human Actions Across Multi-Frame Sequences
Automated action recognition has become essential in the surveillance, healthcare, and multimedia retrieval industries owing to the rapid proliferation of video data. This paper introduces YOLO-Act, a novel spatiotemporal action detection model that enhances the object detection capabilities of YOLO...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-05-01
|
| Series: | Sensors |
| Subjects: | |
| Online Access: | https://www.mdpi.com/1424-8220/25/10/3013 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849327240346599424 |
|---|---|
| author | Nada Alzahrani Ouiem Bchir Mohamed Maher Ben Ismail |
| author_facet | Nada Alzahrani Ouiem Bchir Mohamed Maher Ben Ismail |
| author_sort | Nada Alzahrani |
| collection | DOAJ |
| description | Automated action recognition has become essential in the surveillance, healthcare, and multimedia retrieval industries owing to the rapid proliferation of video data. This paper introduces YOLO-Act, a novel spatiotemporal action detection model that enhances the object detection capabilities of YOLOv8 to efficiently manage complex action dynamics within video sequences. YOLO-Act achieves precise and efficient action recognition by integrating keyframe extraction, action tracking, and class fusion. The model depicts essential temporal dynamics without the computational overhead of continuous frame processing by leveraging the adaptive selection of three keyframes representing the beginning, middle, and end of the actions. Compared with state-of-the-art approaches such as the Lagrangian Action Recognition Transformer (LART), YOLO-Act exhibits superior performance with a mean average precision (mAP) of 73.28 in experiments conducted on the AVA dataset, resulting in a gain of +28.18 mAP. Furthermore, YOLO-Act achieves this higher accuracy with significantly lower FLOPs, demonstrating its efficiency in computational resource utilization. The results highlight the advantages of incorporating precise tracking, effective spatial detection, and temporal consistency to address the challenges associated with video-based action detection. |
| format | Article |
| id | doaj-art-4de1d8d2e2af42c3a1233c8259109176 |
| institution | Kabale University |
| issn | 1424-8220 |
| language | English |
| publishDate | 2025-05-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Sensors |
| spelling | doaj-art-4de1d8d2e2af42c3a1233c82591091762025-08-20T03:47:57ZengMDPI AGSensors1424-82202025-05-012510301310.3390/s25103013YOLO-Act: Unified Spatiotemporal Detection of Human Actions Across Multi-Frame SequencesNada Alzahrani0Ouiem Bchir1Mohamed Maher Ben Ismail2Computer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi ArabiaComputer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi ArabiaComputer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi ArabiaAutomated action recognition has become essential in the surveillance, healthcare, and multimedia retrieval industries owing to the rapid proliferation of video data. This paper introduces YOLO-Act, a novel spatiotemporal action detection model that enhances the object detection capabilities of YOLOv8 to efficiently manage complex action dynamics within video sequences. YOLO-Act achieves precise and efficient action recognition by integrating keyframe extraction, action tracking, and class fusion. The model depicts essential temporal dynamics without the computational overhead of continuous frame processing by leveraging the adaptive selection of three keyframes representing the beginning, middle, and end of the actions. Compared with state-of-the-art approaches such as the Lagrangian Action Recognition Transformer (LART), YOLO-Act exhibits superior performance with a mean average precision (mAP) of 73.28 in experiments conducted on the AVA dataset, resulting in a gain of +28.18 mAP. Furthermore, YOLO-Act achieves this higher accuracy with significantly lower FLOPs, demonstrating its efficiency in computational resource utilization. The results highlight the advantages of incorporating precise tracking, effective spatial detection, and temporal consistency to address the challenges associated with video-based action detection.https://www.mdpi.com/1424-8220/25/10/3013action detectionkeyframe extractionfusion techniquespatiotemporal informationyou only look once (YOLO) |
| spellingShingle | Nada Alzahrani Ouiem Bchir Mohamed Maher Ben Ismail YOLO-Act: Unified Spatiotemporal Detection of Human Actions Across Multi-Frame Sequences Sensors action detection keyframe extraction fusion technique spatiotemporal information you only look once (YOLO) |
| title | YOLO-Act: Unified Spatiotemporal Detection of Human Actions Across Multi-Frame Sequences |
| title_full | YOLO-Act: Unified Spatiotemporal Detection of Human Actions Across Multi-Frame Sequences |
| title_fullStr | YOLO-Act: Unified Spatiotemporal Detection of Human Actions Across Multi-Frame Sequences |
| title_full_unstemmed | YOLO-Act: Unified Spatiotemporal Detection of Human Actions Across Multi-Frame Sequences |
| title_short | YOLO-Act: Unified Spatiotemporal Detection of Human Actions Across Multi-Frame Sequences |
| title_sort | yolo act unified spatiotemporal detection of human actions across multi frame sequences |
| topic | action detection keyframe extraction fusion technique spatiotemporal information you only look once (YOLO) |
| url | https://www.mdpi.com/1424-8220/25/10/3013 |
| work_keys_str_mv | AT nadaalzahrani yoloactunifiedspatiotemporaldetectionofhumanactionsacrossmultiframesequences AT ouiembchir yoloactunifiedspatiotemporaldetectionofhumanactionsacrossmultiframesequences AT mohamedmaherbenismail yoloactunifiedspatiotemporaldetectionofhumanactionsacrossmultiframesequences |