Multi-Grained Temporal Clip Transformer for Skeleton-Based Human Activity Recognition

Skeleton-based human activity recognition is a key research topic in the fields of deep learning and computer vision. However, existing approaches are less effective at capturing short-term sub-action information at different granularity levels and long-term motion correlations, which affect recogni...

Full description

Saved in:

Bibliographic Details
Main Authors:	Peiwang Zhu, Chengwu Liang, Yalong Liu, Songqi Jiang
Format:	Article
Language:	English
Published:	MDPI AG 2025-04-01
Series:	Applied Sciences
Subjects:	human activity recognition temporal clip deep learning self-attention skeleton
Online Access:	https://www.mdpi.com/2076-3417/15/9/4768
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Skeleton-based human activity recognition is a key research topic in the fields of deep learning and computer vision. However, existing approaches are less effective at capturing short-term sub-action information at different granularity levels and long-term motion correlations, which affect recognition accuracy. To overcome these challenges, an innovative multi-grained temporal clip transformer (MTC-Former) is proposed. Firstly, based on the transformer backbone, a multi-grained temporal clip attention (MTCA) module with multi-branch architecture is proposed to capture the characteristics of short-term sub-action features. Secondly, an innovative multi-scale spatial–temporal feature interaction module is proposed to jointly learn sub-action dependencies and facilitate skeletal motion interactions, where long-range motion patterns are embedded to enhance correlation modeling. Experiments were conducted on three datasets, including NTU RGB+D, NTU RGB+D 120, and InHARD, and achieved state-of-the-art Top-1 recognition accuracy, demonstrating the superiority of the proposed MTC-Former.
ISSN:	2076-3417

Multi-Grained Temporal Clip Transformer for Skeleton-Based Human Activity Recognition

Similar Items