Temporal Segment Method in Sign Word Recognition Using a Pretrained CNN-LSTM Network

Sign language recognition plays a crucial role in enhancing accessibility and inclusion for people with hearing impairments, facilitating more effective communication in social and professional environments. However, gesture classification from video data typically demands substantial computational...

Full description

Saved in:
Bibliographic Details
Main Authors: Seungju Lee, Irina Polyakova
Format: Article
Language:Russian
Published: The Fund for Promotion of Internet media, IT education, human development «League Internet Media» 2025-04-01
Series:Современные информационные технологии и IT-образование
Subjects:
Online Access:https://sitito.cs.msu.ru/index.php/SITITO/article/view/1146
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Sign language recognition plays a crucial role in enhancing accessibility and inclusion for people with hearing impairments, facilitating more effective communication in social and professional environments. However, gesture classification from video data typically demands substantial computational resources due to the large number of frames that need processing, significantly increasing training time and memory consumption. Additionally, video data often contain redundant frames without meaningful information for gesture recognition, further complicating the processing and leading to inefficient resource usage. This paper introduces a model for automatic sign language word recognition based on the Temporal Segment Networks (TSN) method. TSN effectively selects key frames from video sequences, reducing redundant information, significantly decreasing training time and RAM usage, without substantially compromising classification accuracy. The CNN-LSTM architecture is employed as the underlying technology, with the ResNet convolutional neural network responsible for extracting spatial features and the recurrent LSTM layer handling temporal dependencies in sequences. The model was trained and tested on the WLASL dataset containing 6 gesture classes. Model performance was evaluated using Accuracy and F1-score metrics, allowing an objective comparison with alternative approaches. Experiments included a comparative analysis of different pretrained ResNet models (ResNet18, ResNet34, ResNet50, ResNet101, ResNet152), resulting in the identification of the optimal configuration. Results demonstrated that applying TSN significantly reduced computational costs: training time was reduced by a factor of 2.173, GPU RAM usage decreased by a factor of 0.5514, and system memory usage by a factor of 1.027. Furthermore, the TSN-based model exhibited higher accuracy compared to the non-TSN variant, confirming the method's effectiveness in gesture classification tasks. Thus, the combination of CNN-LSTM with ResNet18 and TSN not only achieves high accuracy but also ensures efficient use of computational resources. The obtained results can serve as a foundation for further development of automatic sign language recognition systems, including scaling to larger datasets and integration with multimodal gesture-processing systems.
ISSN:2411-1473