Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition

Abstract For the purpose of achieving accurate skeleton-based action recognition, the majority of prior approaches have adopted a serial strategy that combines Graph Convolutional Networks (GCNs) with attention-based methods. However, this approach frequently treats the human skeleton as an isolated...

Full description

Saved in:
Bibliographic Details
Main Authors: Dong Chen, Mingdong Chen, Peisong Wu, Mengtao Wu, Tao Zhang, Chuanqi Li
Format: Article
Language:English
Published: Nature Portfolio 2025-02-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-87752-8
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850067583886163968
author Dong Chen
Mingdong Chen
Peisong Wu
Mengtao Wu
Tao Zhang
Chuanqi Li
author_facet Dong Chen
Mingdong Chen
Peisong Wu
Mengtao Wu
Tao Zhang
Chuanqi Li
author_sort Dong Chen
collection DOAJ
description Abstract For the purpose of achieving accurate skeleton-based action recognition, the majority of prior approaches have adopted a serial strategy that combines Graph Convolutional Networks (GCNs) with attention-based methods. However, this approach frequently treats the human skeleton as an isolated and complete structure, neglecting the significance of highly correlated yet indirectly connected skeletal parts, finally hindering recognition accuracy. This study proposes a novel architecture addressing this limitation by implementing a parallel configuration of GCNs and the Transformer model (SA-TDGFormer). This parallel structure integrates the advantages of both the GCN model and the Transformer model, facilitating the extraction of both local and global spatio-temporal features, leading to more accurate motion information encoding and improved recognition performance. The proposed model distinguishes itself through its dual-stream structure: a spatiotemporal GCN stream and a spatiotemporal Transformer stream. The former focuses on capturing the topological structure and motion representations of human skeletons. In contrast, the latter seeks to capture motion representations that consist of global inter-joint relationships. Recognizing the unique feature representations generated by these streams and their limited mutual understanding, the model also incorporates a late fusion strategy to merge the results from the two streams. This fusion allows the spatiotemporal GCN and Transformer streams to complement each other, enriching action features and maximizing information exchange between the two representation types. Empirical validation on three established benchmark datasets, NTU RGB + D 60, NTU RGB + D 120, and Kinetics-Skeleton, substantiates the model’s effectiveness. The experimental results indicate that, compared to existing classification frameworks, the method proposed in this paper improves the accuracy of human action recognition by 1–5% (NTU RGB + D 60 dataset). This improvement demonstrates the superior performance of the model in action recognition.
format Article
id doaj-art-b9a364db16b44e7196451eb0d6946f4c
institution DOAJ
issn 2045-2322
language English
publishDate 2025-02-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-b9a364db16b44e7196451eb0d6946f4c2025-08-20T02:48:16ZengNature PortfolioScientific Reports2045-23222025-02-0115111410.1038/s41598-025-87752-8Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognitionDong Chen0Mingdong Chen1Peisong Wu2Mengtao Wu3Tao Zhang4Chuanqi Li5Guangxi Normal University, College of Computer Science and EngineeringNanning Normal University, College of Physics and Electronic EngineeringNanning Normal University, College of Physics and Electronic EngineeringNanning Normal University, College of Physics and Electronic EngineeringNanning Normal University, College of Physics and Electronic EngineeringGuangxi Key Laboratory of Functional Information Materials and Intelligent Information ProcessingAbstract For the purpose of achieving accurate skeleton-based action recognition, the majority of prior approaches have adopted a serial strategy that combines Graph Convolutional Networks (GCNs) with attention-based methods. However, this approach frequently treats the human skeleton as an isolated and complete structure, neglecting the significance of highly correlated yet indirectly connected skeletal parts, finally hindering recognition accuracy. This study proposes a novel architecture addressing this limitation by implementing a parallel configuration of GCNs and the Transformer model (SA-TDGFormer). This parallel structure integrates the advantages of both the GCN model and the Transformer model, facilitating the extraction of both local and global spatio-temporal features, leading to more accurate motion information encoding and improved recognition performance. The proposed model distinguishes itself through its dual-stream structure: a spatiotemporal GCN stream and a spatiotemporal Transformer stream. The former focuses on capturing the topological structure and motion representations of human skeletons. In contrast, the latter seeks to capture motion representations that consist of global inter-joint relationships. Recognizing the unique feature representations generated by these streams and their limited mutual understanding, the model also incorporates a late fusion strategy to merge the results from the two streams. This fusion allows the spatiotemporal GCN and Transformer streams to complement each other, enriching action features and maximizing information exchange between the two representation types. Empirical validation on three established benchmark datasets, NTU RGB + D 60, NTU RGB + D 120, and Kinetics-Skeleton, substantiates the model’s effectiveness. The experimental results indicate that, compared to existing classification frameworks, the method proposed in this paper improves the accuracy of human action recognition by 1–5% (NTU RGB + D 60 dataset). This improvement demonstrates the superior performance of the model in action recognition.https://doi.org/10.1038/s41598-025-87752-8Action recognitionGraph convolutional networksTransformer
spellingShingle Dong Chen
Mingdong Chen
Peisong Wu
Mengtao Wu
Tao Zhang
Chuanqi Li
Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition
Scientific Reports
Action recognition
Graph convolutional networks
Transformer
title Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition
title_full Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition
title_fullStr Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition
title_full_unstemmed Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition
title_short Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition
title_sort two stream spatio temporal gcn transformer networks for skeleton based action recognition
topic Action recognition
Graph convolutional networks
Transformer
url https://doi.org/10.1038/s41598-025-87752-8
work_keys_str_mv AT dongchen twostreamspatiotemporalgcntransformernetworksforskeletonbasedactionrecognition
AT mingdongchen twostreamspatiotemporalgcntransformernetworksforskeletonbasedactionrecognition
AT peisongwu twostreamspatiotemporalgcntransformernetworksforskeletonbasedactionrecognition
AT mengtaowu twostreamspatiotemporalgcntransformernetworksforskeletonbasedactionrecognition
AT taozhang twostreamspatiotemporalgcntransformernetworksforskeletonbasedactionrecognition
AT chuanqili twostreamspatiotemporalgcntransformernetworksforskeletonbasedactionrecognition