Speech emotion recognition with light weight deep neural ensemble model using hand crafted features

Abstract Automatic emotion detection has become crucial in various domains, such as healthcare, neuroscience, smart home technologies, and human-computer interaction (HCI). Speech Emotion Recognition (SER) has attracted considerable attention because of its potential to improve conversational roboti...

Full description

Saved in:

Bibliographic Details
Main Authors:	Jaher Hassan Chowdhury, Sheela Ramanna, Ketan Kotecha
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-04-01
Series:	Scientific Reports
Subjects:	Speech emotion recognition Averaging ensemble Convolutional neural network Bi-directional LSTM Audio signal processing
Online Access:	https://doi.org/10.1038/s41598-025-95734-z
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849737785540345856
author	Jaher Hassan Chowdhury Sheela Ramanna Ketan Kotecha
author_facet	Jaher Hassan Chowdhury Sheela Ramanna Ketan Kotecha
author_sort	Jaher Hassan Chowdhury
collection	DOAJ
description	Abstract Automatic emotion detection has become crucial in various domains, such as healthcare, neuroscience, smart home technologies, and human-computer interaction (HCI). Speech Emotion Recognition (SER) has attracted considerable attention because of its potential to improve conversational robotics and human-computer interaction (HCI) systems. Despite its promise, SER research faces challenges such as data scarcity, the subjective nature of emotions, and complex feature extraction methods. In this paper, we seek to investigate whether a lightweight deep neural ensemble model (CNN and CNN_Bi-LSTM) using well-known hand-crafted features such as ZCR, RMSE, Chroma STFT, and MFCC would outperform models that use automatic feature extraction techniques (e.g., spectrogram-based methods) on benchmarked datasets. The focus of this paper is on the effectiveness of careful fine-tuning of the neural models with learning rate (LR) schedulers and applying regularization techniques. Our proposed ensemble model is validated using five publicly available datasets: RAVDESS, TESS, SAVEE, CREMA-D, and EmoDB. Accuracy, AUC-ROC, AUC-PRC, and F1-score metrics were used for performance testing, and the LIME (Local Interpretable Model-agnostic Explanations) technique was used for interpreting the results of our proposed ensemble model. Results indicate that our ensemble model consistently outperforms individual models, as well as several compared models which include spectrogram-based models for the above datasets in terms of the evaluation metrics.
format	Article
id	doaj-art-efecc0b38d654728b60a1619d48a7397
institution	DOAJ
issn	2045-2322
language	English
publishDate	2025-04-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Reports
spelling	doaj-art-efecc0b38d654728b60a1619d48a73972025-08-20T03:06:49ZengNature PortfolioScientific Reports2045-23222025-04-0115111410.1038/s41598-025-95734-zSpeech emotion recognition with light weight deep neural ensemble model using hand crafted featuresJaher Hassan Chowdhury0Sheela Ramanna1Ketan Kotecha2The University of WinnipegThe University of WinnipegSymbiosis International (Deemed University)Abstract Automatic emotion detection has become crucial in various domains, such as healthcare, neuroscience, smart home technologies, and human-computer interaction (HCI). Speech Emotion Recognition (SER) has attracted considerable attention because of its potential to improve conversational robotics and human-computer interaction (HCI) systems. Despite its promise, SER research faces challenges such as data scarcity, the subjective nature of emotions, and complex feature extraction methods. In this paper, we seek to investigate whether a lightweight deep neural ensemble model (CNN and CNN_Bi-LSTM) using well-known hand-crafted features such as ZCR, RMSE, Chroma STFT, and MFCC would outperform models that use automatic feature extraction techniques (e.g., spectrogram-based methods) on benchmarked datasets. The focus of this paper is on the effectiveness of careful fine-tuning of the neural models with learning rate (LR) schedulers and applying regularization techniques. Our proposed ensemble model is validated using five publicly available datasets: RAVDESS, TESS, SAVEE, CREMA-D, and EmoDB. Accuracy, AUC-ROC, AUC-PRC, and F1-score metrics were used for performance testing, and the LIME (Local Interpretable Model-agnostic Explanations) technique was used for interpreting the results of our proposed ensemble model. Results indicate that our ensemble model consistently outperforms individual models, as well as several compared models which include spectrogram-based models for the above datasets in terms of the evaluation metrics.https://doi.org/10.1038/s41598-025-95734-zSpeech emotion recognitionAveraging ensembleConvolutional neural networkBi-directional LSTMAudio signal processing
spellingShingle	Jaher Hassan Chowdhury Sheela Ramanna Ketan Kotecha Speech emotion recognition with light weight deep neural ensemble model using hand crafted features Scientific Reports Speech emotion recognition Averaging ensemble Convolutional neural network Bi-directional LSTM Audio signal processing
title	Speech emotion recognition with light weight deep neural ensemble model using hand crafted features
title_full	Speech emotion recognition with light weight deep neural ensemble model using hand crafted features
title_fullStr	Speech emotion recognition with light weight deep neural ensemble model using hand crafted features
title_full_unstemmed	Speech emotion recognition with light weight deep neural ensemble model using hand crafted features
title_short	Speech emotion recognition with light weight deep neural ensemble model using hand crafted features
title_sort	speech emotion recognition with light weight deep neural ensemble model using hand crafted features
topic	Speech emotion recognition Averaging ensemble Convolutional neural network Bi-directional LSTM Audio signal processing
url	https://doi.org/10.1038/s41598-025-95734-z
work_keys_str_mv	AT jaherhassanchowdhury speechemotionrecognitionwithlightweightdeepneuralensemblemodelusinghandcraftedfeatures AT sheelaramanna speechemotionrecognitionwithlightweightdeepneuralensemblemodelusinghandcraftedfeatures AT ketankotecha speechemotionrecognitionwithlightweightdeepneuralensemblemodelusinghandcraftedfeatures

Speech emotion recognition with light weight deep neural ensemble model using hand crafted features

Similar Items