Video swin-CLSTM transformer: Enhancing human action recognition with optical flow and long-term dependencies.

As video data volumes soar exponentially, the significance of video content analysis, particularly Human Action Recognition (HAR), has become increasingly prominent in fields such as intelligent surveillance, sports analytics, medical rehabilitation, and virtual reality. However, current deep learni...

Full description

Saved in:

Bibliographic Details
Main Authors:	Jun Qin, Shenwei Chen, Zheng Ye, Jing Liu, Zhou Liu
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2025-01-01
Series:	PLoS ONE
Online Access:	https://doi.org/10.1371/journal.pone.0327717
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849427434591485952
author	Jun Qin Shenwei Chen Zheng Ye Jing Liu Zhou Liu
author_facet	Jun Qin Shenwei Chen Zheng Ye Jing Liu Zhou Liu
author_sort	Jun Qin
collection	DOAJ
description	As video data volumes soar exponentially, the significance of video content analysis, particularly Human Action Recognition (HAR), has become increasingly prominent in fields such as intelligent surveillance, sports analytics, medical rehabilitation, and virtual reality. However, current deep learning-based HAR methods encounter challenges in recognizing subtle actions within complex backgrounds, comprehending long-term semantics, and maintaining computational efficiency. To address these challenges, we introduce the Video Swin-CLSTM Transformer. Based on the Video Swin Transformer backbone, our model incorporates optical flow information at the input stage to effectively counteract background interference, employing a sparse sampling strategy. Combined with the backbone's 3D Patch Partition and Patch Merging techniques, it efficiently extracts and fuses multi-level features from both optical flow and raw RGB inputs, thereby enhancing the model's ability to capture motion characteristics in complex backgrounds. Additionally, by embedding Convolutional Long Short-Term Memory (ConvLSTM) units, the model's capacity to capture and understand long-term dependencies among key actions in videos is further enhanced. Experiments on the UCF-101 dataset demonstrate that our model achieves mean Top-1/Top-5 accuracies of 92.8% and 99.4%, which are 3.2% and 2.0% higher than those of the baseline model, while the computational cost is reduced by an average of 3.3% at peak performance compared to models without optical flow. Ablation studies further validate the effectiveness of our model's crucial components, with the integration of optical flow and the embedding of ConvLSTM modules yielding maximum improvements in mean Top-1 accuracy of 2.6% and 1.9%, respectively. Notably, employing our custom ImageNet-1K-LSTM pre-training model results in a maximum increase of 2.7% in mean Top-1 accuracy compared to traditional ImageNet-1K pre-training model. These experimental results indicate that our model offers certain advantages over other Swin Transformer-based methods for video HAR tasks.
format	Article
id	doaj-art-9d5cbd9eea5844269985819eabefb783
institution	Kabale University
issn	1932-6203
language	English
publishDate	2025-01-01
publisher	Public Library of Science (PLoS)
record_format	Article
series	PLoS ONE
spelling	doaj-art-9d5cbd9eea5844269985819eabefb7832025-08-20T03:29:02ZengPublic Library of Science (PLoS)PLoS ONE1932-62032025-01-01207e032771710.1371/journal.pone.0327717Video swin-CLSTM transformer: Enhancing human action recognition with optical flow and long-term dependencies.Jun QinShenwei ChenZheng YeJing LiuZhou LiuAs video data volumes soar exponentially, the significance of video content analysis, particularly Human Action Recognition (HAR), has become increasingly prominent in fields such as intelligent surveillance, sports analytics, medical rehabilitation, and virtual reality. However, current deep learning-based HAR methods encounter challenges in recognizing subtle actions within complex backgrounds, comprehending long-term semantics, and maintaining computational efficiency. To address these challenges, we introduce the Video Swin-CLSTM Transformer. Based on the Video Swin Transformer backbone, our model incorporates optical flow information at the input stage to effectively counteract background interference, employing a sparse sampling strategy. Combined with the backbone's 3D Patch Partition and Patch Merging techniques, it efficiently extracts and fuses multi-level features from both optical flow and raw RGB inputs, thereby enhancing the model's ability to capture motion characteristics in complex backgrounds. Additionally, by embedding Convolutional Long Short-Term Memory (ConvLSTM) units, the model's capacity to capture and understand long-term dependencies among key actions in videos is further enhanced. Experiments on the UCF-101 dataset demonstrate that our model achieves mean Top-1/Top-5 accuracies of 92.8% and 99.4%, which are 3.2% and 2.0% higher than those of the baseline model, while the computational cost is reduced by an average of 3.3% at peak performance compared to models without optical flow. Ablation studies further validate the effectiveness of our model's crucial components, with the integration of optical flow and the embedding of ConvLSTM modules yielding maximum improvements in mean Top-1 accuracy of 2.6% and 1.9%, respectively. Notably, employing our custom ImageNet-1K-LSTM pre-training model results in a maximum increase of 2.7% in mean Top-1 accuracy compared to traditional ImageNet-1K pre-training model. These experimental results indicate that our model offers certain advantages over other Swin Transformer-based methods for video HAR tasks.https://doi.org/10.1371/journal.pone.0327717
spellingShingle	Jun Qin Shenwei Chen Zheng Ye Jing Liu Zhou Liu Video swin-CLSTM transformer: Enhancing human action recognition with optical flow and long-term dependencies. PLoS ONE
title	Video swin-CLSTM transformer: Enhancing human action recognition with optical flow and long-term dependencies.
title_full	Video swin-CLSTM transformer: Enhancing human action recognition with optical flow and long-term dependencies.
title_fullStr	Video swin-CLSTM transformer: Enhancing human action recognition with optical flow and long-term dependencies.
title_full_unstemmed	Video swin-CLSTM transformer: Enhancing human action recognition with optical flow and long-term dependencies.
title_short	Video swin-CLSTM transformer: Enhancing human action recognition with optical flow and long-term dependencies.
title_sort	video swin clstm transformer enhancing human action recognition with optical flow and long term dependencies
url	https://doi.org/10.1371/journal.pone.0327717
work_keys_str_mv	AT junqin videoswinclstmtransformerenhancinghumanactionrecognitionwithopticalflowandlongtermdependencies AT shenweichen videoswinclstmtransformerenhancinghumanactionrecognitionwithopticalflowandlongtermdependencies AT zhengye videoswinclstmtransformerenhancinghumanactionrecognitionwithopticalflowandlongtermdependencies AT jingliu videoswinclstmtransformerenhancinghumanactionrecognitionwithopticalflowandlongtermdependencies AT zhouliu videoswinclstmtransformerenhancinghumanactionrecognitionwithopticalflowandlongtermdependencies

Video swin-CLSTM transformer: Enhancing human action recognition with optical flow and long-term dependencies.

Similar Items