Bangla Sign Language Recognition With Multimodal Deep Learning Fusion
ABSTRACT Sign languages are conveyed through hand gestures and body language; different cultures and languages have unique sign language representations. It is difficult for the general population to interpret all these variations. A Bangla sign language recognition system can help mitigate this pro...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Wiley
2025-04-01
|
| Series: | Engineering Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1002/eng2.70139 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | ABSTRACT Sign languages are conveyed through hand gestures and body language; different cultures and languages have unique sign language representations. It is difficult for the general population to interpret all these variations. A Bangla sign language recognition system can help mitigate this problem. In Bangladesh, approximately 13.7 million people are affected by hearing impairments, highlighting the importance of enhancing the Bangla sign language recognition system. In this study, a merged dataset consists of 200‐word classes and 7,000 videos performed by 16 signers. A unique aspect of this dataset is that it includes two modalities. Initially, we extracted frames and audio from the videos. We then employed two frameworks to capture body keypoints from the frames. Additionally, we contributed to expanding the dataset by creating a custom dataset that adds more samples per class. During preprocessing, we augmented the training data, improving our model's learning capacity and reducing overfitting. We developed a system leveraging various deep learning techniques, simultaneously working with two different modalities. We proposed a custom CNN‐LSTM model for pose estimation and integrated VGGish and 2D CNN techniques to process audio in our multimodal model. The CNN‐LSTM model attained an accuracy of 87.08% for pose estimation using the OpenPose framework. The applied ViT model achieved the best performance with 88.52% and 88.39% F1 score to predict body keypoints directly. The VGGish technique delivered the best result for the audio dataset, achieving 81.91% accuracy and 80.42% F1 score. Finally, the late fusion approach combining the ViT and VGGish networks employing the multimodal data achieved the best performance, achieving 94.71% accuracy and 94.52% F1 score. |
|---|---|
| ISSN: | 2577-8196 |