Speech Emotion Recognition via Sparse Learning-Based Fusion Model

Speech communication is a powerful tool for conveying intentions and emotions, fostering mutual understanding, and strengthening relationships. In the realm of natural human-computer interaction, speech-emotion recognition plays a crucial role. This process involves three stages: dataset collection,...

Full description

Saved in:

Bibliographic Details
Main Authors:	Dong-Jin Min, Deok-Hwan Kim
Format:	Article
Language:	English
Published:	IEEE 2024-01-01
Series:	IEEE Access
Subjects:	Emotion recognition 2D convolutional neural network squeeze and excitation network multivariate long short-term memory-fully convolutional network late fusion sparse learning
Online Access:	https://ieeexplore.ieee.org/document/10767710/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1846141948406005760
author	Dong-Jin Min Deok-Hwan Kim
author_facet	Dong-Jin Min Deok-Hwan Kim
author_sort	Dong-Jin Min
collection	DOAJ
description	Speech communication is a powerful tool for conveying intentions and emotions, fostering mutual understanding, and strengthening relationships. In the realm of natural human-computer interaction, speech-emotion recognition plays a crucial role. This process involves three stages: dataset collection, feature extraction, and emotion classification. Collecting speech-emotion recognition datasets is a complex and costly process, leading to limited data volumes and uneven emotional distributions. This scarcity and imbalance pose significant challenges, affecting the accuracy and reliability of emotion recognition. To address these issues, this study introduces a novel model that is more robust and adaptive. We employ the Ranking Magnitude Method (RMM) based on sparse learning. We use the Root Mean Square (RMS) energy and Zero Crossing Rate (ZCR) as temporal features to measure the speech’s overall volume and noise intensity. The Mel Frequency Cepstral Coefficient (MFCC) is utilized to extract critical speech features, which are then integrated into a multivariate Long Short-Term Memory-Fully Convolutional Network (LSTM-FCN) model. We analyze the utterance levels using the log-Mel spectrogram for spatial features, processing these patterns through a 2D Convolutional Neural Network Squeeze and Excitation Network (CNN-SEN) model. The core of our method is a Sparse Learning-Based Fusion Model (SLBF), which addresses dataset imbalances by selectively retraining the underperforming nodes. This dynamic adjustment of learning priorities significantly enhances the robustness and accuracy of emotion recognition. Using this approach, our model outperforms state-of-the-art methods for various datasets, achieving impressive accuracy rates of 97.18%, 97.92%, 99.31%, and 96.89% for the EMOVO, RAVDESS, SAVE, and EMO-DB datasets, respectively.
format	Article
id	doaj-art-1375912aaa184dd38c7652ec93e6d3ff
institution	Kabale University
issn	2169-3536
language	English
publishDate	2024-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-1375912aaa184dd38c7652ec93e6d3ff2024-12-04T00:00:55ZengIEEEIEEE Access2169-35362024-01-011217721917723510.1109/ACCESS.2024.350656510767710Speech Emotion Recognition via Sparse Learning-Based Fusion ModelDong-Jin Min0https://orcid.org/0009-0007-4092-3132Deok-Hwan Kim1https://orcid.org/0000-0002-6048-9392Department of Electrical and Computer Engineering, Inha University, Incheon, South KoreaDepartment of Electrical and Computer Engineering, Inha University, Incheon, South KoreaSpeech communication is a powerful tool for conveying intentions and emotions, fostering mutual understanding, and strengthening relationships. In the realm of natural human-computer interaction, speech-emotion recognition plays a crucial role. This process involves three stages: dataset collection, feature extraction, and emotion classification. Collecting speech-emotion recognition datasets is a complex and costly process, leading to limited data volumes and uneven emotional distributions. This scarcity and imbalance pose significant challenges, affecting the accuracy and reliability of emotion recognition. To address these issues, this study introduces a novel model that is more robust and adaptive. We employ the Ranking Magnitude Method (RMM) based on sparse learning. We use the Root Mean Square (RMS) energy and Zero Crossing Rate (ZCR) as temporal features to measure the speech’s overall volume and noise intensity. The Mel Frequency Cepstral Coefficient (MFCC) is utilized to extract critical speech features, which are then integrated into a multivariate Long Short-Term Memory-Fully Convolutional Network (LSTM-FCN) model. We analyze the utterance levels using the log-Mel spectrogram for spatial features, processing these patterns through a 2D Convolutional Neural Network Squeeze and Excitation Network (CNN-SEN) model. The core of our method is a Sparse Learning-Based Fusion Model (SLBF), which addresses dataset imbalances by selectively retraining the underperforming nodes. This dynamic adjustment of learning priorities significantly enhances the robustness and accuracy of emotion recognition. Using this approach, our model outperforms state-of-the-art methods for various datasets, achieving impressive accuracy rates of 97.18%, 97.92%, 99.31%, and 96.89% for the EMOVO, RAVDESS, SAVE, and EMO-DB datasets, respectively.https://ieeexplore.ieee.org/document/10767710/Emotion recognition2D convolutional neural network squeeze and excitation networkmultivariate long short-term memory-fully convolutional networklate fusionsparse learning
spellingShingle	Dong-Jin Min Deok-Hwan Kim Speech Emotion Recognition via Sparse Learning-Based Fusion Model IEEE Access Emotion recognition 2D convolutional neural network squeeze and excitation network multivariate long short-term memory-fully convolutional network late fusion sparse learning
title	Speech Emotion Recognition via Sparse Learning-Based Fusion Model
title_full	Speech Emotion Recognition via Sparse Learning-Based Fusion Model
title_fullStr	Speech Emotion Recognition via Sparse Learning-Based Fusion Model
title_full_unstemmed	Speech Emotion Recognition via Sparse Learning-Based Fusion Model
title_short	Speech Emotion Recognition via Sparse Learning-Based Fusion Model
title_sort	speech emotion recognition via sparse learning based fusion model
topic	Emotion recognition 2D convolutional neural network squeeze and excitation network multivariate long short-term memory-fully convolutional network late fusion sparse learning
url	https://ieeexplore.ieee.org/document/10767710/
work_keys_str_mv	AT dongjinmin speechemotionrecognitionviasparselearningbasedfusionmodel AT deokhwankim speechemotionrecognitionviasparselearningbasedfusionmodel

Speech Emotion Recognition via Sparse Learning-Based Fusion Model

Similar Items