Convolutional spatio-temporal sequential inference model for human interaction behavior recognition

IntroductionHuman action recognition is a critical task with broad applications and remains a challenging problem due to the complexity of modeling dynamic interactions between individuals. Existing methods, including skeleton sequence-based and RGB video-based models, have achieved impressive accur...

Full description

Saved in:
Bibliographic Details
Main Authors: Lizhong Jin, Rulong Fan, Xiaoling Han, Xueying Cui
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-07-01
Series:Frontiers in Computer Science
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fcomp.2025.1576775/full
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:IntroductionHuman action recognition is a critical task with broad applications and remains a challenging problem due to the complexity of modeling dynamic interactions between individuals. Existing methods, including skeleton sequence-based and RGB video-based models, have achieved impressive accuracy but often suffer from high computational costs and limited effectiveness in modeling human interaction behaviors.MethodsTo address these limitations, we propose a lightweight Convolutional Spatiotemporal Sequence Inference Model (CSSIModel) for recognizing human interaction behaviors. The model extracts features from skeleton sequences using DINet and from RGB video frames using ResNet-18. These multi-modal features are fused and processed using a novel multiscale two-dimensional convolutional peak-valley inference module to classify interaction behaviors.ResultsCSSIModel achieves competitive results across several benchmark datasets: 87.4% accuracy on NTU RGB+D 60 (XSub), 94.1% on NTU RGB+D 60 (XView), 80.5% on NTU RGB+D 120 (XSub), and 84.9% on NTU RGB+D 120 (XSet). These results are comparable to or exceed those of state-of-the-art methods.DiscussionThe proposed method effectively balances accuracy and computational efficiency. By significantly reducing model complexity while maintaining high performance, CSSIModel is well-suited for real-time applications and provides a valuable reference for future research in multi-modal human behavior recognition.
ISSN:2624-9898