Bangla Sign Language Recognition With Multimodal Deep Learning Fusion

ABSTRACT Sign languages are conveyed through hand gestures and body language; different cultures and languages have unique sign language representations. It is difficult for the general population to interpret all these variations. A Bangla sign language recognition system can help mitigate this pro...

Full description

Saved in:
Bibliographic Details
Main Authors: Adib Hasan, Mahmodul Hasan Jobayer, Md Abdullah Al Mahmud Pias, Tahmidul Alam, Riasat Khan
Format: Article
Language:English
Published: Wiley 2025-04-01
Series:Engineering Reports
Subjects:
Online Access:https://doi.org/10.1002/eng2.70139
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849720033068974080
author Adib Hasan
Mahmodul Hasan Jobayer
Md Abdullah Al Mahmud Pias
Tahmidul Alam
Riasat Khan
author_facet Adib Hasan
Mahmodul Hasan Jobayer
Md Abdullah Al Mahmud Pias
Tahmidul Alam
Riasat Khan
author_sort Adib Hasan
collection DOAJ
description ABSTRACT Sign languages are conveyed through hand gestures and body language; different cultures and languages have unique sign language representations. It is difficult for the general population to interpret all these variations. A Bangla sign language recognition system can help mitigate this problem. In Bangladesh, approximately 13.7 million people are affected by hearing impairments, highlighting the importance of enhancing the Bangla sign language recognition system. In this study, a merged dataset consists of 200‐word classes and 7,000 videos performed by 16 signers. A unique aspect of this dataset is that it includes two modalities. Initially, we extracted frames and audio from the videos. We then employed two frameworks to capture body keypoints from the frames. Additionally, we contributed to expanding the dataset by creating a custom dataset that adds more samples per class. During preprocessing, we augmented the training data, improving our model's learning capacity and reducing overfitting. We developed a system leveraging various deep learning techniques, simultaneously working with two different modalities. We proposed a custom CNN‐LSTM model for pose estimation and integrated VGGish and 2D CNN techniques to process audio in our multimodal model. The CNN‐LSTM model attained an accuracy of 87.08% for pose estimation using the OpenPose framework. The applied ViT model achieved the best performance with 88.52% and 88.39% F1 score to predict body keypoints directly. The VGGish technique delivered the best result for the audio dataset, achieving 81.91% accuracy and 80.42% F1 score. Finally, the late fusion approach combining the ViT and VGGish networks employing the multimodal data achieved the best performance, achieving 94.71% accuracy and 94.52% F1 score.
format Article
id doaj-art-abc2aea4569f4e7aaa5c6b2273300189
institution DOAJ
issn 2577-8196
language English
publishDate 2025-04-01
publisher Wiley
record_format Article
series Engineering Reports
spelling doaj-art-abc2aea4569f4e7aaa5c6b22733001892025-08-20T03:12:02ZengWileyEngineering Reports2577-81962025-04-0174n/an/a10.1002/eng2.70139Bangla Sign Language Recognition With Multimodal Deep Learning FusionAdib Hasan0Mahmodul Hasan Jobayer1Md Abdullah Al Mahmud Pias2Tahmidul Alam3Riasat Khan4Electrical and Computer Engineering North South University Dhaka BangladeshElectrical and Computer Engineering North South University Dhaka BangladeshElectrical and Computer Engineering North South University Dhaka BangladeshElectrical and Computer Engineering North South University Dhaka BangladeshElectrical and Computer Engineering North South University Dhaka BangladeshABSTRACT Sign languages are conveyed through hand gestures and body language; different cultures and languages have unique sign language representations. It is difficult for the general population to interpret all these variations. A Bangla sign language recognition system can help mitigate this problem. In Bangladesh, approximately 13.7 million people are affected by hearing impairments, highlighting the importance of enhancing the Bangla sign language recognition system. In this study, a merged dataset consists of 200‐word classes and 7,000 videos performed by 16 signers. A unique aspect of this dataset is that it includes two modalities. Initially, we extracted frames and audio from the videos. We then employed two frameworks to capture body keypoints from the frames. Additionally, we contributed to expanding the dataset by creating a custom dataset that adds more samples per class. During preprocessing, we augmented the training data, improving our model's learning capacity and reducing overfitting. We developed a system leveraging various deep learning techniques, simultaneously working with two different modalities. We proposed a custom CNN‐LSTM model for pose estimation and integrated VGGish and 2D CNN techniques to process audio in our multimodal model. The CNN‐LSTM model attained an accuracy of 87.08% for pose estimation using the OpenPose framework. The applied ViT model achieved the best performance with 88.52% and 88.39% F1 score to predict body keypoints directly. The VGGish technique delivered the best result for the audio dataset, achieving 81.91% accuracy and 80.42% F1 score. Finally, the late fusion approach combining the ViT and VGGish networks employing the multimodal data achieved the best performance, achieving 94.71% accuracy and 94.52% F1 score.https://doi.org/10.1002/eng2.70139Bangla sign language recognitionCNN‐LSTMMediapipemultimodal learningOpenPoseVGGish
spellingShingle Adib Hasan
Mahmodul Hasan Jobayer
Md Abdullah Al Mahmud Pias
Tahmidul Alam
Riasat Khan
Bangla Sign Language Recognition With Multimodal Deep Learning Fusion
Engineering Reports
Bangla sign language recognition
CNN‐LSTM
Mediapipe
multimodal learning
OpenPose
VGGish
title Bangla Sign Language Recognition With Multimodal Deep Learning Fusion
title_full Bangla Sign Language Recognition With Multimodal Deep Learning Fusion
title_fullStr Bangla Sign Language Recognition With Multimodal Deep Learning Fusion
title_full_unstemmed Bangla Sign Language Recognition With Multimodal Deep Learning Fusion
title_short Bangla Sign Language Recognition With Multimodal Deep Learning Fusion
title_sort bangla sign language recognition with multimodal deep learning fusion
topic Bangla sign language recognition
CNN‐LSTM
Mediapipe
multimodal learning
OpenPose
VGGish
url https://doi.org/10.1002/eng2.70139
work_keys_str_mv AT adibhasan banglasignlanguagerecognitionwithmultimodaldeeplearningfusion
AT mahmodulhasanjobayer banglasignlanguagerecognitionwithmultimodaldeeplearningfusion
AT mdabdullahalmahmudpias banglasignlanguagerecognitionwithmultimodaldeeplearningfusion
AT tahmidulalam banglasignlanguagerecognitionwithmultimodaldeeplearningfusion
AT riasatkhan banglasignlanguagerecognitionwithmultimodaldeeplearningfusion