YOLO-Act: Unified Spatiotemporal Detection of Human Actions Across Multi-Frame Sequences

Automated action recognition has become essential in the surveillance, healthcare, and multimedia retrieval industries owing to the rapid proliferation of video data. This paper introduces YOLO-Act, a novel spatiotemporal action detection model that enhances the object detection capabilities of YOLO...

Full description

Saved in:
Bibliographic Details
Main Authors: Nada Alzahrani, Ouiem Bchir, Mohamed Maher Ben Ismail
Format: Article
Language:English
Published: MDPI AG 2025-05-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/25/10/3013
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849327240346599424
author Nada Alzahrani
Ouiem Bchir
Mohamed Maher Ben Ismail
author_facet Nada Alzahrani
Ouiem Bchir
Mohamed Maher Ben Ismail
author_sort Nada Alzahrani
collection DOAJ
description Automated action recognition has become essential in the surveillance, healthcare, and multimedia retrieval industries owing to the rapid proliferation of video data. This paper introduces YOLO-Act, a novel spatiotemporal action detection model that enhances the object detection capabilities of YOLOv8 to efficiently manage complex action dynamics within video sequences. YOLO-Act achieves precise and efficient action recognition by integrating keyframe extraction, action tracking, and class fusion. The model depicts essential temporal dynamics without the computational overhead of continuous frame processing by leveraging the adaptive selection of three keyframes representing the beginning, middle, and end of the actions. Compared with state-of-the-art approaches such as the Lagrangian Action Recognition Transformer (LART), YOLO-Act exhibits superior performance with a mean average precision (mAP) of 73.28 in experiments conducted on the AVA dataset, resulting in a gain of +28.18 mAP. Furthermore, YOLO-Act achieves this higher accuracy with significantly lower FLOPs, demonstrating its efficiency in computational resource utilization. The results highlight the advantages of incorporating precise tracking, effective spatial detection, and temporal consistency to address the challenges associated with video-based action detection.
format Article
id doaj-art-4de1d8d2e2af42c3a1233c8259109176
institution Kabale University
issn 1424-8220
language English
publishDate 2025-05-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj-art-4de1d8d2e2af42c3a1233c82591091762025-08-20T03:47:57ZengMDPI AGSensors1424-82202025-05-012510301310.3390/s25103013YOLO-Act: Unified Spatiotemporal Detection of Human Actions Across Multi-Frame SequencesNada Alzahrani0Ouiem Bchir1Mohamed Maher Ben Ismail2Computer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi ArabiaComputer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi ArabiaComputer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi ArabiaAutomated action recognition has become essential in the surveillance, healthcare, and multimedia retrieval industries owing to the rapid proliferation of video data. This paper introduces YOLO-Act, a novel spatiotemporal action detection model that enhances the object detection capabilities of YOLOv8 to efficiently manage complex action dynamics within video sequences. YOLO-Act achieves precise and efficient action recognition by integrating keyframe extraction, action tracking, and class fusion. The model depicts essential temporal dynamics without the computational overhead of continuous frame processing by leveraging the adaptive selection of three keyframes representing the beginning, middle, and end of the actions. Compared with state-of-the-art approaches such as the Lagrangian Action Recognition Transformer (LART), YOLO-Act exhibits superior performance with a mean average precision (mAP) of 73.28 in experiments conducted on the AVA dataset, resulting in a gain of +28.18 mAP. Furthermore, YOLO-Act achieves this higher accuracy with significantly lower FLOPs, demonstrating its efficiency in computational resource utilization. The results highlight the advantages of incorporating precise tracking, effective spatial detection, and temporal consistency to address the challenges associated with video-based action detection.https://www.mdpi.com/1424-8220/25/10/3013action detectionkeyframe extractionfusion techniquespatiotemporal informationyou only look once (YOLO)
spellingShingle Nada Alzahrani
Ouiem Bchir
Mohamed Maher Ben Ismail
YOLO-Act: Unified Spatiotemporal Detection of Human Actions Across Multi-Frame Sequences
Sensors
action detection
keyframe extraction
fusion technique
spatiotemporal information
you only look once (YOLO)
title YOLO-Act: Unified Spatiotemporal Detection of Human Actions Across Multi-Frame Sequences
title_full YOLO-Act: Unified Spatiotemporal Detection of Human Actions Across Multi-Frame Sequences
title_fullStr YOLO-Act: Unified Spatiotemporal Detection of Human Actions Across Multi-Frame Sequences
title_full_unstemmed YOLO-Act: Unified Spatiotemporal Detection of Human Actions Across Multi-Frame Sequences
title_short YOLO-Act: Unified Spatiotemporal Detection of Human Actions Across Multi-Frame Sequences
title_sort yolo act unified spatiotemporal detection of human actions across multi frame sequences
topic action detection
keyframe extraction
fusion technique
spatiotemporal information
you only look once (YOLO)
url https://www.mdpi.com/1424-8220/25/10/3013
work_keys_str_mv AT nadaalzahrani yoloactunifiedspatiotemporaldetectionofhumanactionsacrossmultiframesequences
AT ouiembchir yoloactunifiedspatiotemporaldetectionofhumanactionsacrossmultiframesequences
AT mohamedmaherbenismail yoloactunifiedspatiotemporaldetectionofhumanactionsacrossmultiframesequences