InterAcT: A generic keypoints-based lightweight transformer model for recognition of human solo actions and interactions in aerial videos.

Human action recognition forms an important part of several aerial security and surveillance applications. Indeed, numerous efforts have been made to solve the problem in an effective and efficient manner. Existing methods, however, are generally aimed to recognize either solo actions or interaction...

Full description

Saved in:
Bibliographic Details
Main Authors: Mubashir Shah, Tahir Nawaz, Rab Nawaz, Nasir Rashid, Muhammad Osama Ali
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2025-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0323314
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849328785143365632
author Mubashir Shah
Tahir Nawaz
Rab Nawaz
Nasir Rashid
Muhammad Osama Ali
author_facet Mubashir Shah
Tahir Nawaz
Rab Nawaz
Nasir Rashid
Muhammad Osama Ali
author_sort Mubashir Shah
collection DOAJ
description Human action recognition forms an important part of several aerial security and surveillance applications. Indeed, numerous efforts have been made to solve the problem in an effective and efficient manner. Existing methods, however, are generally aimed to recognize either solo actions or interactions, thus restricting their use to specific scenarios. Additionally, the need remains to devise lightweight and computationally efficient models to make them deployable in real-world applications. To this end, this paper presents a generic lightweight and computationally efficient Transformer network-based model, referred to as InterAcT, that relies on extracted bodily keypoints using YOLO v8 to recognize human solo actions as well as interactions in aerial videos. It features a lightweight architecture with 0.0709M parameters and 0.0389G flops, distinguishing it from the AcT models. An extensive performance evaluation has been performed on two publicly available aerial datasets: Drone Action and UT-Interaction, comprising a total of 18 classes including both solo actions and interactions. The model is optimized and trained on 80% train set, 10% validation set and its performance is evaluated on 10% test set achieving highly encouraging performance on multiple benchmarks, outperforming several state-of-the-art methods. Our model, with an accuracy of 0.9923 outperforms the AcT models (micro: 0.9353, small: 0.9893, base: 0.9907, and large: 0.9558), 2P-GCN (0.9337), LSTM (0.9774), 3D-ResNet (0.9921), and 3D CNN (0.9920). It has the strength to recognize a large number of solo actions and two-person interaction classes both in aerial videos and footage from ground-level cameras (grayscale and RGB).
format Article
id doaj-art-2fc6b2acd32d40ddb687cbe0f7b68a88
institution Kabale University
issn 1932-6203
language English
publishDate 2025-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj-art-2fc6b2acd32d40ddb687cbe0f7b68a882025-08-20T03:47:28ZengPublic Library of Science (PLoS)PLoS ONE1932-62032025-01-01205e032331410.1371/journal.pone.0323314InterAcT: A generic keypoints-based lightweight transformer model for recognition of human solo actions and interactions in aerial videos.Mubashir ShahTahir NawazRab NawazNasir RashidMuhammad Osama AliHuman action recognition forms an important part of several aerial security and surveillance applications. Indeed, numerous efforts have been made to solve the problem in an effective and efficient manner. Existing methods, however, are generally aimed to recognize either solo actions or interactions, thus restricting their use to specific scenarios. Additionally, the need remains to devise lightweight and computationally efficient models to make them deployable in real-world applications. To this end, this paper presents a generic lightweight and computationally efficient Transformer network-based model, referred to as InterAcT, that relies on extracted bodily keypoints using YOLO v8 to recognize human solo actions as well as interactions in aerial videos. It features a lightweight architecture with 0.0709M parameters and 0.0389G flops, distinguishing it from the AcT models. An extensive performance evaluation has been performed on two publicly available aerial datasets: Drone Action and UT-Interaction, comprising a total of 18 classes including both solo actions and interactions. The model is optimized and trained on 80% train set, 10% validation set and its performance is evaluated on 10% test set achieving highly encouraging performance on multiple benchmarks, outperforming several state-of-the-art methods. Our model, with an accuracy of 0.9923 outperforms the AcT models (micro: 0.9353, small: 0.9893, base: 0.9907, and large: 0.9558), 2P-GCN (0.9337), LSTM (0.9774), 3D-ResNet (0.9921), and 3D CNN (0.9920). It has the strength to recognize a large number of solo actions and two-person interaction classes both in aerial videos and footage from ground-level cameras (grayscale and RGB).https://doi.org/10.1371/journal.pone.0323314
spellingShingle Mubashir Shah
Tahir Nawaz
Rab Nawaz
Nasir Rashid
Muhammad Osama Ali
InterAcT: A generic keypoints-based lightweight transformer model for recognition of human solo actions and interactions in aerial videos.
PLoS ONE
title InterAcT: A generic keypoints-based lightweight transformer model for recognition of human solo actions and interactions in aerial videos.
title_full InterAcT: A generic keypoints-based lightweight transformer model for recognition of human solo actions and interactions in aerial videos.
title_fullStr InterAcT: A generic keypoints-based lightweight transformer model for recognition of human solo actions and interactions in aerial videos.
title_full_unstemmed InterAcT: A generic keypoints-based lightweight transformer model for recognition of human solo actions and interactions in aerial videos.
title_short InterAcT: A generic keypoints-based lightweight transformer model for recognition of human solo actions and interactions in aerial videos.
title_sort interact a generic keypoints based lightweight transformer model for recognition of human solo actions and interactions in aerial videos
url https://doi.org/10.1371/journal.pone.0323314
work_keys_str_mv AT mubashirshah interactagenerickeypointsbasedlightweighttransformermodelforrecognitionofhumansoloactionsandinteractionsinaerialvideos
AT tahirnawaz interactagenerickeypointsbasedlightweighttransformermodelforrecognitionofhumansoloactionsandinteractionsinaerialvideos
AT rabnawaz interactagenerickeypointsbasedlightweighttransformermodelforrecognitionofhumansoloactionsandinteractionsinaerialvideos
AT nasirrashid interactagenerickeypointsbasedlightweighttransformermodelforrecognitionofhumansoloactionsandinteractionsinaerialvideos
AT muhammadosamaali interactagenerickeypointsbasedlightweighttransformermodelforrecognitionofhumansoloactionsandinteractionsinaerialvideos