Real-Time Human Action Recognition With Dynamical Frame Processing via Modified ConvLSTM and BERT

In this study, a human action recognition approach with dynamical frame processing is proposed to fulfill the need for action recognition in a real-time manner. A novel architecture with a modified convolutional long short-term memory (ModConvLSTM) with pose heatmaps as input features is proposed to...

Full description

Saved in:
Bibliographic Details
Main Authors: Raden Hadapiningsyah Kusumoseniarto, Zhi-Yuan Lin, Shun-Feng Su, Pei-Jun Lee
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11045380/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849426062795079680
author Raden Hadapiningsyah Kusumoseniarto
Zhi-Yuan Lin
Shun-Feng Su
Pei-Jun Lee
author_facet Raden Hadapiningsyah Kusumoseniarto
Zhi-Yuan Lin
Shun-Feng Su
Pei-Jun Lee
author_sort Raden Hadapiningsyah Kusumoseniarto
collection DOAJ
description In this study, a human action recognition approach with dynamical frame processing is proposed to fulfill the need for action recognition in a real-time manner. A novel architecture with a modified convolutional long short-term memory (ModConvLSTM) with pose heatmaps as input features is proposed to achieve human action recognition without a fixed frame number. The effects of ModConvLSTM are verified at different depths. In our proposed architecture, we replace global average pooling (GAP) with Bidirectional Encoder Representations from Transformers (BERT) to address the limitations of temporal processing in a two-dimensional convolutional neural network (2D-CNN). By incorporating a training-stage mask and leveraging BERT’s attention mechanism, the model gains improved contextual understanding. This enhancement enables our network to achieve accuracies of 91.46% on the NTU60 dataset and 83.06% on NTU120, marking an improvement of 1.63% and 3.22%, respectively, with only a slight increase of 0.1 GFLOPs in computational cost. Unlike other real-time models that rely on large-scale datasets such as Kinetics for extensive pretraining, our model achieves competitive performance while being trained directly on NTU datasets. A key contribution of our work is its ability to perform dynamical frame recognition, efficiently updating the ModConvLSTM cell and computing BERT every four new frames, achieving real-time inference in just 14.2ms on a CPU. This study enables continuous action recognition without requiring predefined action boundaries, making our model well-suited for real-world applications where actions occur naturally without explicit segmentation.
format Article
id doaj-art-8ee5a5d08e204767bea68391601eb2fd
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-8ee5a5d08e204767bea68391601eb2fd2025-08-20T03:29:34ZengIEEEIEEE Access2169-35362025-01-011311236811237710.1109/ACCESS.2025.358173411045380Real-Time Human Action Recognition With Dynamical Frame Processing via Modified ConvLSTM and BERTRaden Hadapiningsyah Kusumoseniarto0https://orcid.org/0009-0001-6138-0495Zhi-Yuan Lin1Shun-Feng Su2https://orcid.org/0000-0001-9777-128XPei-Jun Lee3https://orcid.org/0000-0003-2010-0853Electronics and Computer Engineering Department, National Taiwan University of Science and Technology, Taipei, TaiwanElectrical Engineering Department, National Taiwan University of Science and Technology, Taipei, TaiwanElectrical Engineering Department, National Taiwan University of Science and Technology, Taipei, TaiwanElectronics and Computer Engineering Department, National Taiwan University of Science and Technology, Taipei, TaiwanIn this study, a human action recognition approach with dynamical frame processing is proposed to fulfill the need for action recognition in a real-time manner. A novel architecture with a modified convolutional long short-term memory (ModConvLSTM) with pose heatmaps as input features is proposed to achieve human action recognition without a fixed frame number. The effects of ModConvLSTM are verified at different depths. In our proposed architecture, we replace global average pooling (GAP) with Bidirectional Encoder Representations from Transformers (BERT) to address the limitations of temporal processing in a two-dimensional convolutional neural network (2D-CNN). By incorporating a training-stage mask and leveraging BERT’s attention mechanism, the model gains improved contextual understanding. This enhancement enables our network to achieve accuracies of 91.46% on the NTU60 dataset and 83.06% on NTU120, marking an improvement of 1.63% and 3.22%, respectively, with only a slight increase of 0.1 GFLOPs in computational cost. Unlike other real-time models that rely on large-scale datasets such as Kinetics for extensive pretraining, our model achieves competitive performance while being trained directly on NTU datasets. A key contribution of our work is its ability to perform dynamical frame recognition, efficiently updating the ModConvLSTM cell and computing BERT every four new frames, achieving real-time inference in just 14.2ms on a CPU. This study enables continuous action recognition without requiring predefined action boundaries, making our model well-suited for real-world applications where actions occur naturally without explicit segmentation.https://ieeexplore.ieee.org/document/11045380/Deep learninghuman action recognitionconvolutional LSTMBERTcomputer vision
spellingShingle Raden Hadapiningsyah Kusumoseniarto
Zhi-Yuan Lin
Shun-Feng Su
Pei-Jun Lee
Real-Time Human Action Recognition With Dynamical Frame Processing via Modified ConvLSTM and BERT
IEEE Access
Deep learning
human action recognition
convolutional LSTM
BERT
computer vision
title Real-Time Human Action Recognition With Dynamical Frame Processing via Modified ConvLSTM and BERT
title_full Real-Time Human Action Recognition With Dynamical Frame Processing via Modified ConvLSTM and BERT
title_fullStr Real-Time Human Action Recognition With Dynamical Frame Processing via Modified ConvLSTM and BERT
title_full_unstemmed Real-Time Human Action Recognition With Dynamical Frame Processing via Modified ConvLSTM and BERT
title_short Real-Time Human Action Recognition With Dynamical Frame Processing via Modified ConvLSTM and BERT
title_sort real time human action recognition with dynamical frame processing via modified convlstm and bert
topic Deep learning
human action recognition
convolutional LSTM
BERT
computer vision
url https://ieeexplore.ieee.org/document/11045380/
work_keys_str_mv AT radenhadapiningsyahkusumoseniarto realtimehumanactionrecognitionwithdynamicalframeprocessingviamodifiedconvlstmandbert
AT zhiyuanlin realtimehumanactionrecognitionwithdynamicalframeprocessingviamodifiedconvlstmandbert
AT shunfengsu realtimehumanactionrecognitionwithdynamicalframeprocessingviamodifiedconvlstmandbert
AT peijunlee realtimehumanactionrecognitionwithdynamicalframeprocessingviamodifiedconvlstmandbert