Real-Time Human Action Recognition With Dynamical Frame Processing via Modified ConvLSTM and BERT
In this study, a human action recognition approach with dynamical frame processing is proposed to fulfill the need for action recognition in a real-time manner. A novel architecture with a modified convolutional long short-term memory (ModConvLSTM) with pose heatmaps as input features is proposed to...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11045380/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849426062795079680 |
|---|---|
| author | Raden Hadapiningsyah Kusumoseniarto Zhi-Yuan Lin Shun-Feng Su Pei-Jun Lee |
| author_facet | Raden Hadapiningsyah Kusumoseniarto Zhi-Yuan Lin Shun-Feng Su Pei-Jun Lee |
| author_sort | Raden Hadapiningsyah Kusumoseniarto |
| collection | DOAJ |
| description | In this study, a human action recognition approach with dynamical frame processing is proposed to fulfill the need for action recognition in a real-time manner. A novel architecture with a modified convolutional long short-term memory (ModConvLSTM) with pose heatmaps as input features is proposed to achieve human action recognition without a fixed frame number. The effects of ModConvLSTM are verified at different depths. In our proposed architecture, we replace global average pooling (GAP) with Bidirectional Encoder Representations from Transformers (BERT) to address the limitations of temporal processing in a two-dimensional convolutional neural network (2D-CNN). By incorporating a training-stage mask and leveraging BERT’s attention mechanism, the model gains improved contextual understanding. This enhancement enables our network to achieve accuracies of 91.46% on the NTU60 dataset and 83.06% on NTU120, marking an improvement of 1.63% and 3.22%, respectively, with only a slight increase of 0.1 GFLOPs in computational cost. Unlike other real-time models that rely on large-scale datasets such as Kinetics for extensive pretraining, our model achieves competitive performance while being trained directly on NTU datasets. A key contribution of our work is its ability to perform dynamical frame recognition, efficiently updating the ModConvLSTM cell and computing BERT every four new frames, achieving real-time inference in just 14.2ms on a CPU. This study enables continuous action recognition without requiring predefined action boundaries, making our model well-suited for real-world applications where actions occur naturally without explicit segmentation. |
| format | Article |
| id | doaj-art-8ee5a5d08e204767bea68391601eb2fd |
| institution | Kabale University |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-8ee5a5d08e204767bea68391601eb2fd2025-08-20T03:29:34ZengIEEEIEEE Access2169-35362025-01-011311236811237710.1109/ACCESS.2025.358173411045380Real-Time Human Action Recognition With Dynamical Frame Processing via Modified ConvLSTM and BERTRaden Hadapiningsyah Kusumoseniarto0https://orcid.org/0009-0001-6138-0495Zhi-Yuan Lin1Shun-Feng Su2https://orcid.org/0000-0001-9777-128XPei-Jun Lee3https://orcid.org/0000-0003-2010-0853Electronics and Computer Engineering Department, National Taiwan University of Science and Technology, Taipei, TaiwanElectrical Engineering Department, National Taiwan University of Science and Technology, Taipei, TaiwanElectrical Engineering Department, National Taiwan University of Science and Technology, Taipei, TaiwanElectronics and Computer Engineering Department, National Taiwan University of Science and Technology, Taipei, TaiwanIn this study, a human action recognition approach with dynamical frame processing is proposed to fulfill the need for action recognition in a real-time manner. A novel architecture with a modified convolutional long short-term memory (ModConvLSTM) with pose heatmaps as input features is proposed to achieve human action recognition without a fixed frame number. The effects of ModConvLSTM are verified at different depths. In our proposed architecture, we replace global average pooling (GAP) with Bidirectional Encoder Representations from Transformers (BERT) to address the limitations of temporal processing in a two-dimensional convolutional neural network (2D-CNN). By incorporating a training-stage mask and leveraging BERT’s attention mechanism, the model gains improved contextual understanding. This enhancement enables our network to achieve accuracies of 91.46% on the NTU60 dataset and 83.06% on NTU120, marking an improvement of 1.63% and 3.22%, respectively, with only a slight increase of 0.1 GFLOPs in computational cost. Unlike other real-time models that rely on large-scale datasets such as Kinetics for extensive pretraining, our model achieves competitive performance while being trained directly on NTU datasets. A key contribution of our work is its ability to perform dynamical frame recognition, efficiently updating the ModConvLSTM cell and computing BERT every four new frames, achieving real-time inference in just 14.2ms on a CPU. This study enables continuous action recognition without requiring predefined action boundaries, making our model well-suited for real-world applications where actions occur naturally without explicit segmentation.https://ieeexplore.ieee.org/document/11045380/Deep learninghuman action recognitionconvolutional LSTMBERTcomputer vision |
| spellingShingle | Raden Hadapiningsyah Kusumoseniarto Zhi-Yuan Lin Shun-Feng Su Pei-Jun Lee Real-Time Human Action Recognition With Dynamical Frame Processing via Modified ConvLSTM and BERT IEEE Access Deep learning human action recognition convolutional LSTM BERT computer vision |
| title | Real-Time Human Action Recognition With Dynamical Frame Processing via Modified ConvLSTM and BERT |
| title_full | Real-Time Human Action Recognition With Dynamical Frame Processing via Modified ConvLSTM and BERT |
| title_fullStr | Real-Time Human Action Recognition With Dynamical Frame Processing via Modified ConvLSTM and BERT |
| title_full_unstemmed | Real-Time Human Action Recognition With Dynamical Frame Processing via Modified ConvLSTM and BERT |
| title_short | Real-Time Human Action Recognition With Dynamical Frame Processing via Modified ConvLSTM and BERT |
| title_sort | real time human action recognition with dynamical frame processing via modified convlstm and bert |
| topic | Deep learning human action recognition convolutional LSTM BERT computer vision |
| url | https://ieeexplore.ieee.org/document/11045380/ |
| work_keys_str_mv | AT radenhadapiningsyahkusumoseniarto realtimehumanactionrecognitionwithdynamicalframeprocessingviamodifiedconvlstmandbert AT zhiyuanlin realtimehumanactionrecognitionwithdynamicalframeprocessingviamodifiedconvlstmandbert AT shunfengsu realtimehumanactionrecognitionwithdynamicalframeprocessingviamodifiedconvlstmandbert AT peijunlee realtimehumanactionrecognitionwithdynamicalframeprocessingviamodifiedconvlstmandbert |