Hybrid LSTM–Attention and CNN Model for Enhanced Speech Emotion Recognition
Emotion recognition is crucial for enhancing human–machine interactions by establishing a foundation for AI systems that integrate cognitive and emotional understanding, bridging the gap between machine functions and human emotions. Even though deep learning algorithms are actively used in this fiel...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2024-12-01
|
| Series: | Applied Sciences |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2076-3417/14/23/11342 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850061319056654336 |
|---|---|
| author | Fazliddin Makhmudov Alpamis Kutlimuratov Young-Im Cho |
| author_facet | Fazliddin Makhmudov Alpamis Kutlimuratov Young-Im Cho |
| author_sort | Fazliddin Makhmudov |
| collection | DOAJ |
| description | Emotion recognition is crucial for enhancing human–machine interactions by establishing a foundation for AI systems that integrate cognitive and emotional understanding, bridging the gap between machine functions and human emotions. Even though deep learning algorithms are actively used in this field, the study of sequence modeling that accounts for the shifts in emotions over time has not been thoroughly explored. In this research, we present a comprehensive speech emotion-recognition framework that amalgamates the ZCR, RMS, and MFCC feature sets. Our approach employs both CNN and LSTM networks, complemented by an attention model, for enhanced emotion prediction. Specifically, the LSTM model addresses the challenges of long-term dependencies, enabling the system to factor in historical emotional experiences alongside current ones. We also incorporate the psychological “peak–end rule”, suggesting that preceding emotional states significantly influence the present emotion. The CNN plays a pivotal role in restructuring input dimensions, facilitating nuanced feature processing. We rigorously evaluated the proposed model utilizing two distinct datasets, namely TESS and RAVDESS. The empirical outcomes highlighted the model’s superior performance, with accuracy rates reaching 99.8% for TESS and 95.7% for RAVDESS. These results are a notable advancement, showcasing our system’s precision and innovative contributions to emotion recognition. |
| format | Article |
| id | doaj-art-df0fc99f3799492abd68f6c676e7cffc |
| institution | DOAJ |
| issn | 2076-3417 |
| language | English |
| publishDate | 2024-12-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Applied Sciences |
| spelling | doaj-art-df0fc99f3799492abd68f6c676e7cffc2025-08-20T02:50:16ZengMDPI AGApplied Sciences2076-34172024-12-0114231134210.3390/app142311342Hybrid LSTM–Attention and CNN Model for Enhanced Speech Emotion RecognitionFazliddin Makhmudov0Alpamis Kutlimuratov1Young-Im Cho2Department of Computer Engineering, Gachon University, Seongnam 1342, Republic of KoreaDepartment of Econometrics, Tashkent State University of Economics, Tashkent 100066, UzbekistanDepartment of Computer Engineering, Gachon University, Seongnam 1342, Republic of KoreaEmotion recognition is crucial for enhancing human–machine interactions by establishing a foundation for AI systems that integrate cognitive and emotional understanding, bridging the gap between machine functions and human emotions. Even though deep learning algorithms are actively used in this field, the study of sequence modeling that accounts for the shifts in emotions over time has not been thoroughly explored. In this research, we present a comprehensive speech emotion-recognition framework that amalgamates the ZCR, RMS, and MFCC feature sets. Our approach employs both CNN and LSTM networks, complemented by an attention model, for enhanced emotion prediction. Specifically, the LSTM model addresses the challenges of long-term dependencies, enabling the system to factor in historical emotional experiences alongside current ones. We also incorporate the psychological “peak–end rule”, suggesting that preceding emotional states significantly influence the present emotion. The CNN plays a pivotal role in restructuring input dimensions, facilitating nuanced feature processing. We rigorously evaluated the proposed model utilizing two distinct datasets, namely TESS and RAVDESS. The empirical outcomes highlighted the model’s superior performance, with accuracy rates reaching 99.8% for TESS and 95.7% for RAVDESS. These results are a notable advancement, showcasing our system’s precision and innovative contributions to emotion recognition.https://www.mdpi.com/2076-3417/14/23/11342CNNspeech emotion recognitionLSTMattention mechanism |
| spellingShingle | Fazliddin Makhmudov Alpamis Kutlimuratov Young-Im Cho Hybrid LSTM–Attention and CNN Model for Enhanced Speech Emotion Recognition Applied Sciences CNN speech emotion recognition LSTM attention mechanism |
| title | Hybrid LSTM–Attention and CNN Model for Enhanced Speech Emotion Recognition |
| title_full | Hybrid LSTM–Attention and CNN Model for Enhanced Speech Emotion Recognition |
| title_fullStr | Hybrid LSTM–Attention and CNN Model for Enhanced Speech Emotion Recognition |
| title_full_unstemmed | Hybrid LSTM–Attention and CNN Model for Enhanced Speech Emotion Recognition |
| title_short | Hybrid LSTM–Attention and CNN Model for Enhanced Speech Emotion Recognition |
| title_sort | hybrid lstm attention and cnn model for enhanced speech emotion recognition |
| topic | CNN speech emotion recognition LSTM attention mechanism |
| url | https://www.mdpi.com/2076-3417/14/23/11342 |
| work_keys_str_mv | AT fazliddinmakhmudov hybridlstmattentionandcnnmodelforenhancedspeechemotionrecognition AT alpamiskutlimuratov hybridlstmattentionandcnnmodelforenhancedspeechemotionrecognition AT youngimcho hybridlstmattentionandcnnmodelforenhancedspeechemotionrecognition |