Hybrid LSTM–Attention and CNN Model for Enhanced Speech Emotion Recognition

Emotion recognition is crucial for enhancing human–machine interactions by establishing a foundation for AI systems that integrate cognitive and emotional understanding, bridging the gap between machine functions and human emotions. Even though deep learning algorithms are actively used in this fiel...

Full description

Saved in:
Bibliographic Details
Main Authors: Fazliddin Makhmudov, Alpamis Kutlimuratov, Young-Im Cho
Format: Article
Language:English
Published: MDPI AG 2024-12-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/14/23/11342
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850061319056654336
author Fazliddin Makhmudov
Alpamis Kutlimuratov
Young-Im Cho
author_facet Fazliddin Makhmudov
Alpamis Kutlimuratov
Young-Im Cho
author_sort Fazliddin Makhmudov
collection DOAJ
description Emotion recognition is crucial for enhancing human–machine interactions by establishing a foundation for AI systems that integrate cognitive and emotional understanding, bridging the gap between machine functions and human emotions. Even though deep learning algorithms are actively used in this field, the study of sequence modeling that accounts for the shifts in emotions over time has not been thoroughly explored. In this research, we present a comprehensive speech emotion-recognition framework that amalgamates the ZCR, RMS, and MFCC feature sets. Our approach employs both CNN and LSTM networks, complemented by an attention model, for enhanced emotion prediction. Specifically, the LSTM model addresses the challenges of long-term dependencies, enabling the system to factor in historical emotional experiences alongside current ones. We also incorporate the psychological “peak–end rule”, suggesting that preceding emotional states significantly influence the present emotion. The CNN plays a pivotal role in restructuring input dimensions, facilitating nuanced feature processing. We rigorously evaluated the proposed model utilizing two distinct datasets, namely TESS and RAVDESS. The empirical outcomes highlighted the model’s superior performance, with accuracy rates reaching 99.8% for TESS and 95.7% for RAVDESS. These results are a notable advancement, showcasing our system’s precision and innovative contributions to emotion recognition.
format Article
id doaj-art-df0fc99f3799492abd68f6c676e7cffc
institution DOAJ
issn 2076-3417
language English
publishDate 2024-12-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj-art-df0fc99f3799492abd68f6c676e7cffc2025-08-20T02:50:16ZengMDPI AGApplied Sciences2076-34172024-12-0114231134210.3390/app142311342Hybrid LSTM–Attention and CNN Model for Enhanced Speech Emotion RecognitionFazliddin Makhmudov0Alpamis Kutlimuratov1Young-Im Cho2Department of Computer Engineering, Gachon University, Seongnam 1342, Republic of KoreaDepartment of Econometrics, Tashkent State University of Economics, Tashkent 100066, UzbekistanDepartment of Computer Engineering, Gachon University, Seongnam 1342, Republic of KoreaEmotion recognition is crucial for enhancing human–machine interactions by establishing a foundation for AI systems that integrate cognitive and emotional understanding, bridging the gap between machine functions and human emotions. Even though deep learning algorithms are actively used in this field, the study of sequence modeling that accounts for the shifts in emotions over time has not been thoroughly explored. In this research, we present a comprehensive speech emotion-recognition framework that amalgamates the ZCR, RMS, and MFCC feature sets. Our approach employs both CNN and LSTM networks, complemented by an attention model, for enhanced emotion prediction. Specifically, the LSTM model addresses the challenges of long-term dependencies, enabling the system to factor in historical emotional experiences alongside current ones. We also incorporate the psychological “peak–end rule”, suggesting that preceding emotional states significantly influence the present emotion. The CNN plays a pivotal role in restructuring input dimensions, facilitating nuanced feature processing. We rigorously evaluated the proposed model utilizing two distinct datasets, namely TESS and RAVDESS. The empirical outcomes highlighted the model’s superior performance, with accuracy rates reaching 99.8% for TESS and 95.7% for RAVDESS. These results are a notable advancement, showcasing our system’s precision and innovative contributions to emotion recognition.https://www.mdpi.com/2076-3417/14/23/11342CNNspeech emotion recognitionLSTMattention mechanism
spellingShingle Fazliddin Makhmudov
Alpamis Kutlimuratov
Young-Im Cho
Hybrid LSTM–Attention and CNN Model for Enhanced Speech Emotion Recognition
Applied Sciences
CNN
speech emotion recognition
LSTM
attention mechanism
title Hybrid LSTM–Attention and CNN Model for Enhanced Speech Emotion Recognition
title_full Hybrid LSTM–Attention and CNN Model for Enhanced Speech Emotion Recognition
title_fullStr Hybrid LSTM–Attention and CNN Model for Enhanced Speech Emotion Recognition
title_full_unstemmed Hybrid LSTM–Attention and CNN Model for Enhanced Speech Emotion Recognition
title_short Hybrid LSTM–Attention and CNN Model for Enhanced Speech Emotion Recognition
title_sort hybrid lstm attention and cnn model for enhanced speech emotion recognition
topic CNN
speech emotion recognition
LSTM
attention mechanism
url https://www.mdpi.com/2076-3417/14/23/11342
work_keys_str_mv AT fazliddinmakhmudov hybridlstmattentionandcnnmodelforenhancedspeechemotionrecognition
AT alpamiskutlimuratov hybridlstmattentionandcnnmodelforenhancedspeechemotionrecognition
AT youngimcho hybridlstmattentionandcnnmodelforenhancedspeechemotionrecognition