Self-Supervised Spatiotemporal Representation Learning for Skeleton-Based Human Action Recognition
Skeleton-based human action recognition (HAR) plays an important role in video analytics and recognition systems, with the goal of accurately identifying human actions in videos. However, large-scale action annotation is costly, which has led to the growing interest in HAR research using self-superv...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10945847/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Skeleton-based human action recognition (HAR) plays an important role in video analytics and recognition systems, with the goal of accurately identifying human actions in videos. However, large-scale action annotation is costly, which has led to the growing interest in HAR research using self-supervised learning (SSL). While existing SSL studies have focused on extracting global information from skeleton sequences, they often overlook local information that captures the relationships between joints and their subtle movements over time. In this study, we propose an SSL-based HAR framework called coarse-to-fine spatiotemporal representation masking (CFSEM) that effectively learns global, local, and temporal information within skeletal. CFSEM captures not only global information in the skeleton using body- and part-level masking but also fine-grained movements using hand masking. In addition, temporal-axis shuffling is introduced into the proposed framework to account for temporal patterns inherent in skeleton sequences. To further enhance the learning process, the loss function is redefined using a cross-correlation matrix, introducing a non-contrastive SSL approach. Experiments on various datasets were conducted to evaluate the proposed framework against baseline methods. Experimental results showed the superior performance of CFSEM and highlighted the possibility of training HAR models using less labeled data, offering the potential to effectively develop HAR models for various industries. |
|---|---|
| ISSN: | 2169-3536 |