Bangla Sign Language Recognition With Multimodal Deep Learning Fusion

ABSTRACT Sign languages are conveyed through hand gestures and body language; different cultures and languages have unique sign language representations. It is difficult for the general population to interpret all these variations. A Bangla sign language recognition system can help mitigate this pro...

Full description

Saved in:

Bibliographic Details
Main Authors:	Adib Hasan, Mahmodul Hasan Jobayer, Md Abdullah Al Mahmud Pias, Tahmidul Alam, Riasat Khan
Format:	Article
Language:	English
Published:	Wiley 2025-04-01
Series:	Engineering Reports
Subjects:	Bangla sign language recognition CNN‐LSTM Mediapipe multimodal learning OpenPose VGGish
Online Access:	https://doi.org/10.1002/eng2.70139
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849720033068974080
author	Adib Hasan Mahmodul Hasan Jobayer Md Abdullah Al Mahmud Pias Tahmidul Alam Riasat Khan
author_facet	Adib Hasan Mahmodul Hasan Jobayer Md Abdullah Al Mahmud Pias Tahmidul Alam Riasat Khan
author_sort	Adib Hasan
collection	DOAJ
description	ABSTRACT Sign languages are conveyed through hand gestures and body language; different cultures and languages have unique sign language representations. It is difficult for the general population to interpret all these variations. A Bangla sign language recognition system can help mitigate this problem. In Bangladesh, approximately 13.7 million people are affected by hearing impairments, highlighting the importance of enhancing the Bangla sign language recognition system. In this study, a merged dataset consists of 200‐word classes and 7,000 videos performed by 16 signers. A unique aspect of this dataset is that it includes two modalities. Initially, we extracted frames and audio from the videos. We then employed two frameworks to capture body keypoints from the frames. Additionally, we contributed to expanding the dataset by creating a custom dataset that adds more samples per class. During preprocessing, we augmented the training data, improving our model's learning capacity and reducing overfitting. We developed a system leveraging various deep learning techniques, simultaneously working with two different modalities. We proposed a custom CNN‐LSTM model for pose estimation and integrated VGGish and 2D CNN techniques to process audio in our multimodal model. The CNN‐LSTM model attained an accuracy of 87.08% for pose estimation using the OpenPose framework. The applied ViT model achieved the best performance with 88.52% and 88.39% F1 score to predict body keypoints directly. The VGGish technique delivered the best result for the audio dataset, achieving 81.91% accuracy and 80.42% F1 score. Finally, the late fusion approach combining the ViT and VGGish networks employing the multimodal data achieved the best performance, achieving 94.71% accuracy and 94.52% F1 score.
format	Article
id	doaj-art-abc2aea4569f4e7aaa5c6b2273300189
institution	DOAJ
issn	2577-8196
language	English
publishDate	2025-04-01
publisher	Wiley
record_format	Article
series	Engineering Reports
spelling	doaj-art-abc2aea4569f4e7aaa5c6b22733001892025-08-20T03:12:02ZengWileyEngineering Reports2577-81962025-04-0174n/an/a10.1002/eng2.70139Bangla Sign Language Recognition With Multimodal Deep Learning FusionAdib Hasan0Mahmodul Hasan Jobayer1Md Abdullah Al Mahmud Pias2Tahmidul Alam3Riasat Khan4Electrical and Computer Engineering North South University Dhaka BangladeshElectrical and Computer Engineering North South University Dhaka BangladeshElectrical and Computer Engineering North South University Dhaka BangladeshElectrical and Computer Engineering North South University Dhaka BangladeshElectrical and Computer Engineering North South University Dhaka BangladeshABSTRACT Sign languages are conveyed through hand gestures and body language; different cultures and languages have unique sign language representations. It is difficult for the general population to interpret all these variations. A Bangla sign language recognition system can help mitigate this problem. In Bangladesh, approximately 13.7 million people are affected by hearing impairments, highlighting the importance of enhancing the Bangla sign language recognition system. In this study, a merged dataset consists of 200‐word classes and 7,000 videos performed by 16 signers. A unique aspect of this dataset is that it includes two modalities. Initially, we extracted frames and audio from the videos. We then employed two frameworks to capture body keypoints from the frames. Additionally, we contributed to expanding the dataset by creating a custom dataset that adds more samples per class. During preprocessing, we augmented the training data, improving our model's learning capacity and reducing overfitting. We developed a system leveraging various deep learning techniques, simultaneously working with two different modalities. We proposed a custom CNN‐LSTM model for pose estimation and integrated VGGish and 2D CNN techniques to process audio in our multimodal model. The CNN‐LSTM model attained an accuracy of 87.08% for pose estimation using the OpenPose framework. The applied ViT model achieved the best performance with 88.52% and 88.39% F1 score to predict body keypoints directly. The VGGish technique delivered the best result for the audio dataset, achieving 81.91% accuracy and 80.42% F1 score. Finally, the late fusion approach combining the ViT and VGGish networks employing the multimodal data achieved the best performance, achieving 94.71% accuracy and 94.52% F1 score.https://doi.org/10.1002/eng2.70139Bangla sign language recognitionCNN‐LSTMMediapipemultimodal learningOpenPoseVGGish
spellingShingle	Adib Hasan Mahmodul Hasan Jobayer Md Abdullah Al Mahmud Pias Tahmidul Alam Riasat Khan Bangla Sign Language Recognition With Multimodal Deep Learning Fusion Engineering Reports Bangla sign language recognition CNN‐LSTM Mediapipe multimodal learning OpenPose VGGish
title	Bangla Sign Language Recognition With Multimodal Deep Learning Fusion
title_full	Bangla Sign Language Recognition With Multimodal Deep Learning Fusion
title_fullStr	Bangla Sign Language Recognition With Multimodal Deep Learning Fusion
title_full_unstemmed	Bangla Sign Language Recognition With Multimodal Deep Learning Fusion
title_short	Bangla Sign Language Recognition With Multimodal Deep Learning Fusion
title_sort	bangla sign language recognition with multimodal deep learning fusion
topic	Bangla sign language recognition CNN‐LSTM Mediapipe multimodal learning OpenPose VGGish
url	https://doi.org/10.1002/eng2.70139
work_keys_str_mv	AT adibhasan banglasignlanguagerecognitionwithmultimodaldeeplearningfusion AT mahmodulhasanjobayer banglasignlanguagerecognitionwithmultimodaldeeplearningfusion AT mdabdullahalmahmudpias banglasignlanguagerecognitionwithmultimodaldeeplearningfusion AT tahmidulalam banglasignlanguagerecognitionwithmultimodaldeeplearningfusion AT riasatkhan banglasignlanguagerecognitionwithmultimodaldeeplearningfusion

Bangla Sign Language Recognition With Multimodal Deep Learning Fusion

Similar Items