Self-Supervised Spatiotemporal Representation Learning for Skeleton-Based Human Action Recognition

Skeleton-based human action recognition (HAR) plays an important role in video analytics and recognition systems, with the goal of accurately identifying human actions in videos. However, large-scale action annotation is costly, which has led to the growing interest in HAR research using self-superv...

Full description

Saved in:
Bibliographic Details
Main Authors: Jinhyeok Park, Seoung Bum Kim
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10945847/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850152177772789760
author Jinhyeok Park
Seoung Bum Kim
author_facet Jinhyeok Park
Seoung Bum Kim
author_sort Jinhyeok Park
collection DOAJ
description Skeleton-based human action recognition (HAR) plays an important role in video analytics and recognition systems, with the goal of accurately identifying human actions in videos. However, large-scale action annotation is costly, which has led to the growing interest in HAR research using self-supervised learning (SSL). While existing SSL studies have focused on extracting global information from skeleton sequences, they often overlook local information that captures the relationships between joints and their subtle movements over time. In this study, we propose an SSL-based HAR framework called coarse-to-fine spatiotemporal representation masking (CFSEM) that effectively learns global, local, and temporal information within skeletal. CFSEM captures not only global information in the skeleton using body- and part-level masking but also fine-grained movements using hand masking. In addition, temporal-axis shuffling is introduced into the proposed framework to account for temporal patterns inherent in skeleton sequences. To further enhance the learning process, the loss function is redefined using a cross-correlation matrix, introducing a non-contrastive SSL approach. Experiments on various datasets were conducted to evaluate the proposed framework against baseline methods. Experimental results showed the superior performance of CFSEM and highlighted the possibility of training HAR models using less labeled data, offering the potential to effectively develop HAR models for various industries.
format Article
id doaj-art-de484c51bedb411983ca03ffcfd07729
institution OA Journals
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-de484c51bedb411983ca03ffcfd077292025-08-20T02:26:03ZengIEEEIEEE Access2169-35362025-01-0113581645817410.1109/ACCESS.2025.355595310945847Self-Supervised Spatiotemporal Representation Learning for Skeleton-Based Human Action RecognitionJinhyeok Park0https://orcid.org/0000-0001-6191-0188Seoung Bum Kim1https://orcid.org/0000-0002-2205-8516Department of Industrial and Management Engineering, Korea University, Seoul, South KoreaDepartment of Industrial and Management Engineering, Korea University, Seoul, South KoreaSkeleton-based human action recognition (HAR) plays an important role in video analytics and recognition systems, with the goal of accurately identifying human actions in videos. However, large-scale action annotation is costly, which has led to the growing interest in HAR research using self-supervised learning (SSL). While existing SSL studies have focused on extracting global information from skeleton sequences, they often overlook local information that captures the relationships between joints and their subtle movements over time. In this study, we propose an SSL-based HAR framework called coarse-to-fine spatiotemporal representation masking (CFSEM) that effectively learns global, local, and temporal information within skeletal. CFSEM captures not only global information in the skeleton using body- and part-level masking but also fine-grained movements using hand masking. In addition, temporal-axis shuffling is introduced into the proposed framework to account for temporal patterns inherent in skeleton sequences. To further enhance the learning process, the loss function is redefined using a cross-correlation matrix, introducing a non-contrastive SSL approach. Experiments on various datasets were conducted to evaluate the proposed framework against baseline methods. Experimental results showed the superior performance of CFSEM and highlighted the possibility of training HAR models using less labeled data, offering the potential to effectively develop HAR models for various industries.https://ieeexplore.ieee.org/document/10945847/Skeleton-based human action recognitionself-supervised learningskeleton-specific transformationgraph representation learningnon-contrastive learning
spellingShingle Jinhyeok Park
Seoung Bum Kim
Self-Supervised Spatiotemporal Representation Learning for Skeleton-Based Human Action Recognition
IEEE Access
Skeleton-based human action recognition
self-supervised learning
skeleton-specific transformation
graph representation learning
non-contrastive learning
title Self-Supervised Spatiotemporal Representation Learning for Skeleton-Based Human Action Recognition
title_full Self-Supervised Spatiotemporal Representation Learning for Skeleton-Based Human Action Recognition
title_fullStr Self-Supervised Spatiotemporal Representation Learning for Skeleton-Based Human Action Recognition
title_full_unstemmed Self-Supervised Spatiotemporal Representation Learning for Skeleton-Based Human Action Recognition
title_short Self-Supervised Spatiotemporal Representation Learning for Skeleton-Based Human Action Recognition
title_sort self supervised spatiotemporal representation learning for skeleton based human action recognition
topic Skeleton-based human action recognition
self-supervised learning
skeleton-specific transformation
graph representation learning
non-contrastive learning
url https://ieeexplore.ieee.org/document/10945847/
work_keys_str_mv AT jinhyeokpark selfsupervisedspatiotemporalrepresentationlearningforskeletonbasedhumanactionrecognition
AT seoungbumkim selfsupervisedspatiotemporalrepresentationlearningforskeletonbasedhumanactionrecognition