Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition
Abstract For the purpose of achieving accurate skeleton-based action recognition, the majority of prior approaches have adopted a serial strategy that combines Graph Convolutional Networks (GCNs) with attention-based methods. However, this approach frequently treats the human skeleton as an isolated...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-02-01
|
| Series: | Scientific Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1038/s41598-025-87752-8 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850067583886163968 |
|---|---|
| author | Dong Chen Mingdong Chen Peisong Wu Mengtao Wu Tao Zhang Chuanqi Li |
| author_facet | Dong Chen Mingdong Chen Peisong Wu Mengtao Wu Tao Zhang Chuanqi Li |
| author_sort | Dong Chen |
| collection | DOAJ |
| description | Abstract For the purpose of achieving accurate skeleton-based action recognition, the majority of prior approaches have adopted a serial strategy that combines Graph Convolutional Networks (GCNs) with attention-based methods. However, this approach frequently treats the human skeleton as an isolated and complete structure, neglecting the significance of highly correlated yet indirectly connected skeletal parts, finally hindering recognition accuracy. This study proposes a novel architecture addressing this limitation by implementing a parallel configuration of GCNs and the Transformer model (SA-TDGFormer). This parallel structure integrates the advantages of both the GCN model and the Transformer model, facilitating the extraction of both local and global spatio-temporal features, leading to more accurate motion information encoding and improved recognition performance. The proposed model distinguishes itself through its dual-stream structure: a spatiotemporal GCN stream and a spatiotemporal Transformer stream. The former focuses on capturing the topological structure and motion representations of human skeletons. In contrast, the latter seeks to capture motion representations that consist of global inter-joint relationships. Recognizing the unique feature representations generated by these streams and their limited mutual understanding, the model also incorporates a late fusion strategy to merge the results from the two streams. This fusion allows the spatiotemporal GCN and Transformer streams to complement each other, enriching action features and maximizing information exchange between the two representation types. Empirical validation on three established benchmark datasets, NTU RGB + D 60, NTU RGB + D 120, and Kinetics-Skeleton, substantiates the model’s effectiveness. The experimental results indicate that, compared to existing classification frameworks, the method proposed in this paper improves the accuracy of human action recognition by 1–5% (NTU RGB + D 60 dataset). This improvement demonstrates the superior performance of the model in action recognition. |
| format | Article |
| id | doaj-art-b9a364db16b44e7196451eb0d6946f4c |
| institution | DOAJ |
| issn | 2045-2322 |
| language | English |
| publishDate | 2025-02-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Scientific Reports |
| spelling | doaj-art-b9a364db16b44e7196451eb0d6946f4c2025-08-20T02:48:16ZengNature PortfolioScientific Reports2045-23222025-02-0115111410.1038/s41598-025-87752-8Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognitionDong Chen0Mingdong Chen1Peisong Wu2Mengtao Wu3Tao Zhang4Chuanqi Li5Guangxi Normal University, College of Computer Science and EngineeringNanning Normal University, College of Physics and Electronic EngineeringNanning Normal University, College of Physics and Electronic EngineeringNanning Normal University, College of Physics and Electronic EngineeringNanning Normal University, College of Physics and Electronic EngineeringGuangxi Key Laboratory of Functional Information Materials and Intelligent Information ProcessingAbstract For the purpose of achieving accurate skeleton-based action recognition, the majority of prior approaches have adopted a serial strategy that combines Graph Convolutional Networks (GCNs) with attention-based methods. However, this approach frequently treats the human skeleton as an isolated and complete structure, neglecting the significance of highly correlated yet indirectly connected skeletal parts, finally hindering recognition accuracy. This study proposes a novel architecture addressing this limitation by implementing a parallel configuration of GCNs and the Transformer model (SA-TDGFormer). This parallel structure integrates the advantages of both the GCN model and the Transformer model, facilitating the extraction of both local and global spatio-temporal features, leading to more accurate motion information encoding and improved recognition performance. The proposed model distinguishes itself through its dual-stream structure: a spatiotemporal GCN stream and a spatiotemporal Transformer stream. The former focuses on capturing the topological structure and motion representations of human skeletons. In contrast, the latter seeks to capture motion representations that consist of global inter-joint relationships. Recognizing the unique feature representations generated by these streams and their limited mutual understanding, the model also incorporates a late fusion strategy to merge the results from the two streams. This fusion allows the spatiotemporal GCN and Transformer streams to complement each other, enriching action features and maximizing information exchange between the two representation types. Empirical validation on three established benchmark datasets, NTU RGB + D 60, NTU RGB + D 120, and Kinetics-Skeleton, substantiates the model’s effectiveness. The experimental results indicate that, compared to existing classification frameworks, the method proposed in this paper improves the accuracy of human action recognition by 1–5% (NTU RGB + D 60 dataset). This improvement demonstrates the superior performance of the model in action recognition.https://doi.org/10.1038/s41598-025-87752-8Action recognitionGraph convolutional networksTransformer |
| spellingShingle | Dong Chen Mingdong Chen Peisong Wu Mengtao Wu Tao Zhang Chuanqi Li Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition Scientific Reports Action recognition Graph convolutional networks Transformer |
| title | Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition |
| title_full | Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition |
| title_fullStr | Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition |
| title_full_unstemmed | Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition |
| title_short | Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition |
| title_sort | two stream spatio temporal gcn transformer networks for skeleton based action recognition |
| topic | Action recognition Graph convolutional networks Transformer |
| url | https://doi.org/10.1038/s41598-025-87752-8 |
| work_keys_str_mv | AT dongchen twostreamspatiotemporalgcntransformernetworksforskeletonbasedactionrecognition AT mingdongchen twostreamspatiotemporalgcntransformernetworksforskeletonbasedactionrecognition AT peisongwu twostreamspatiotemporalgcntransformernetworksforskeletonbasedactionrecognition AT mengtaowu twostreamspatiotemporalgcntransformernetworksforskeletonbasedactionrecognition AT taozhang twostreamspatiotemporalgcntransformernetworksforskeletonbasedactionrecognition AT chuanqili twostreamspatiotemporalgcntransformernetworksforskeletonbasedactionrecognition |