Bangla Sign Language Recognition With Multimodal Deep Learning Fusion
ABSTRACT Sign languages are conveyed through hand gestures and body language; different cultures and languages have unique sign language representations. It is difficult for the general population to interpret all these variations. A Bangla sign language recognition system can help mitigate this pro...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Wiley
2025-04-01
|
| Series: | Engineering Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1002/eng2.70139 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849720033068974080 |
|---|---|
| author | Adib Hasan Mahmodul Hasan Jobayer Md Abdullah Al Mahmud Pias Tahmidul Alam Riasat Khan |
| author_facet | Adib Hasan Mahmodul Hasan Jobayer Md Abdullah Al Mahmud Pias Tahmidul Alam Riasat Khan |
| author_sort | Adib Hasan |
| collection | DOAJ |
| description | ABSTRACT Sign languages are conveyed through hand gestures and body language; different cultures and languages have unique sign language representations. It is difficult for the general population to interpret all these variations. A Bangla sign language recognition system can help mitigate this problem. In Bangladesh, approximately 13.7 million people are affected by hearing impairments, highlighting the importance of enhancing the Bangla sign language recognition system. In this study, a merged dataset consists of 200‐word classes and 7,000 videos performed by 16 signers. A unique aspect of this dataset is that it includes two modalities. Initially, we extracted frames and audio from the videos. We then employed two frameworks to capture body keypoints from the frames. Additionally, we contributed to expanding the dataset by creating a custom dataset that adds more samples per class. During preprocessing, we augmented the training data, improving our model's learning capacity and reducing overfitting. We developed a system leveraging various deep learning techniques, simultaneously working with two different modalities. We proposed a custom CNN‐LSTM model for pose estimation and integrated VGGish and 2D CNN techniques to process audio in our multimodal model. The CNN‐LSTM model attained an accuracy of 87.08% for pose estimation using the OpenPose framework. The applied ViT model achieved the best performance with 88.52% and 88.39% F1 score to predict body keypoints directly. The VGGish technique delivered the best result for the audio dataset, achieving 81.91% accuracy and 80.42% F1 score. Finally, the late fusion approach combining the ViT and VGGish networks employing the multimodal data achieved the best performance, achieving 94.71% accuracy and 94.52% F1 score. |
| format | Article |
| id | doaj-art-abc2aea4569f4e7aaa5c6b2273300189 |
| institution | DOAJ |
| issn | 2577-8196 |
| language | English |
| publishDate | 2025-04-01 |
| publisher | Wiley |
| record_format | Article |
| series | Engineering Reports |
| spelling | doaj-art-abc2aea4569f4e7aaa5c6b22733001892025-08-20T03:12:02ZengWileyEngineering Reports2577-81962025-04-0174n/an/a10.1002/eng2.70139Bangla Sign Language Recognition With Multimodal Deep Learning FusionAdib Hasan0Mahmodul Hasan Jobayer1Md Abdullah Al Mahmud Pias2Tahmidul Alam3Riasat Khan4Electrical and Computer Engineering North South University Dhaka BangladeshElectrical and Computer Engineering North South University Dhaka BangladeshElectrical and Computer Engineering North South University Dhaka BangladeshElectrical and Computer Engineering North South University Dhaka BangladeshElectrical and Computer Engineering North South University Dhaka BangladeshABSTRACT Sign languages are conveyed through hand gestures and body language; different cultures and languages have unique sign language representations. It is difficult for the general population to interpret all these variations. A Bangla sign language recognition system can help mitigate this problem. In Bangladesh, approximately 13.7 million people are affected by hearing impairments, highlighting the importance of enhancing the Bangla sign language recognition system. In this study, a merged dataset consists of 200‐word classes and 7,000 videos performed by 16 signers. A unique aspect of this dataset is that it includes two modalities. Initially, we extracted frames and audio from the videos. We then employed two frameworks to capture body keypoints from the frames. Additionally, we contributed to expanding the dataset by creating a custom dataset that adds more samples per class. During preprocessing, we augmented the training data, improving our model's learning capacity and reducing overfitting. We developed a system leveraging various deep learning techniques, simultaneously working with two different modalities. We proposed a custom CNN‐LSTM model for pose estimation and integrated VGGish and 2D CNN techniques to process audio in our multimodal model. The CNN‐LSTM model attained an accuracy of 87.08% for pose estimation using the OpenPose framework. The applied ViT model achieved the best performance with 88.52% and 88.39% F1 score to predict body keypoints directly. The VGGish technique delivered the best result for the audio dataset, achieving 81.91% accuracy and 80.42% F1 score. Finally, the late fusion approach combining the ViT and VGGish networks employing the multimodal data achieved the best performance, achieving 94.71% accuracy and 94.52% F1 score.https://doi.org/10.1002/eng2.70139Bangla sign language recognitionCNN‐LSTMMediapipemultimodal learningOpenPoseVGGish |
| spellingShingle | Adib Hasan Mahmodul Hasan Jobayer Md Abdullah Al Mahmud Pias Tahmidul Alam Riasat Khan Bangla Sign Language Recognition With Multimodal Deep Learning Fusion Engineering Reports Bangla sign language recognition CNN‐LSTM Mediapipe multimodal learning OpenPose VGGish |
| title | Bangla Sign Language Recognition With Multimodal Deep Learning Fusion |
| title_full | Bangla Sign Language Recognition With Multimodal Deep Learning Fusion |
| title_fullStr | Bangla Sign Language Recognition With Multimodal Deep Learning Fusion |
| title_full_unstemmed | Bangla Sign Language Recognition With Multimodal Deep Learning Fusion |
| title_short | Bangla Sign Language Recognition With Multimodal Deep Learning Fusion |
| title_sort | bangla sign language recognition with multimodal deep learning fusion |
| topic | Bangla sign language recognition CNN‐LSTM Mediapipe multimodal learning OpenPose VGGish |
| url | https://doi.org/10.1002/eng2.70139 |
| work_keys_str_mv | AT adibhasan banglasignlanguagerecognitionwithmultimodaldeeplearningfusion AT mahmodulhasanjobayer banglasignlanguagerecognitionwithmultimodaldeeplearningfusion AT mdabdullahalmahmudpias banglasignlanguagerecognitionwithmultimodaldeeplearningfusion AT tahmidulalam banglasignlanguagerecognitionwithmultimodaldeeplearningfusion AT riasatkhan banglasignlanguagerecognitionwithmultimodaldeeplearningfusion |