Self-Supervised Spatiotemporal Representation Learning for Skeleton-Based Human Action Recognition

Skeleton-based human action recognition (HAR) plays an important role in video analytics and recognition systems, with the goal of accurately identifying human actions in videos. However, large-scale action annotation is costly, which has led to the growing interest in HAR research using self-superv...

Full description

Saved in:

Bibliographic Details
Main Authors:	Jinhyeok Park, Seoung Bum Kim
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Skeleton-based human action recognition self-supervised learning skeleton-specific transformation graph representation learning non-contrastive learning
Online Access:	https://ieeexplore.ieee.org/document/10945847/
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Skeleton-based human action recognition (HAR) plays an important role in video analytics and recognition systems, with the goal of accurately identifying human actions in videos. However, large-scale action annotation is costly, which has led to the growing interest in HAR research using self-supervised learning (SSL). While existing SSL studies have focused on extracting global information from skeleton sequences, they often overlook local information that captures the relationships between joints and their subtle movements over time. In this study, we propose an SSL-based HAR framework called coarse-to-fine spatiotemporal representation masking (CFSEM) that effectively learns global, local, and temporal information within skeletal. CFSEM captures not only global information in the skeleton using body- and part-level masking but also fine-grained movements using hand masking. In addition, temporal-axis shuffling is introduced into the proposed framework to account for temporal patterns inherent in skeleton sequences. To further enhance the learning process, the loss function is redefined using a cross-correlation matrix, introducing a non-contrastive SSL approach. Experiments on various datasets were conducted to evaluate the proposed framework against baseline methods. Experimental results showed the superior performance of CFSEM and highlighted the possibility of training HAR models using less labeled data, offering the potential to effectively develop HAR models for various industries.
ISSN:	2169-3536

Self-Supervised Spatiotemporal Representation Learning for Skeleton-Based Human Action Recognition

Similar Items