Enhancing Sign Language Recognition Performance Through Coverage-Based Dynamic Clip Generation

Sign Language Recognition (SLR) has made substantial progress through advances in deep learning and video-based action recognition. Conventional SLR systems typically segment input videos into a fixed number of clips (e.g., five clips per video), regardless of the video’s actual length, to meet the...

Full description

Saved in:
Bibliographic Details
Main Authors: Taewan Kim, Bongjae Kim
Format: Article
Language:English
Published: MDPI AG 2025-06-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/11/6372
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Sign Language Recognition (SLR) has made substantial progress through advances in deep learning and video-based action recognition. Conventional SLR systems typically segment input videos into a fixed number of clips (e.g., five clips per video), regardless of the video’s actual length, to meet the fixed-length input requirements of deep learning models. While this approach simplifies model design and training, it fails to account for temporal variations inherent in sign language videos. Specifically, applying a fixed number of clips to videos of varying lengths can lead to significant information loss: longer videos suffer from excessive frame skipping, causing the model to miss critical gestural cues, whereas shorter videos require frame duplication, introducing temporal redundancy that distorts motion dynamics. To address these limitations, we propose a dynamic clip generation method that adaptively adjusts the number of clips during inference based on a novel coverage metric. This metric quantifies how effectively a clip selection captures the temporal information in a given video, enabling the system to maintain both temporal fidelity and computational efficiency. Experimental results on benchmark SLR datasets using multiple models-including 3D CNNs, R(2+1)D, Video Swin Transformer, and Multiscale Vision Transformers demonstrate that our method consistently outperforms fixed clip generation methods. Notably, our approach achieves 98.67% accuracy with the Video Swin Transformer while reducing inference time by 28.57%. These findings highlight the effectiveness of coverage-based dynamic clip generation in improving both accuracy and efficiency, particularly for videos with high temporal variability.
ISSN:2076-3417