Enhancing Embedded Space with Low–Level Features for Speech Emotion Recognition

This work proposes an approach that uses a feature space by combining the representation obtained in the unsupervised learning process and manually selected features defining the prosody of the utterances. In the experiments, we used two time-frequency representations (Mel and CQT spectrograms) and...

Full description

Saved in:
Bibliographic Details
Main Authors: Lukasz Smietanka, Tomasz Maka
Format: Article
Language:English
Published: MDPI AG 2025-02-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/5/2598
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This work proposes an approach that uses a feature space by combining the representation obtained in the unsupervised learning process and manually selected features defining the prosody of the utterances. In the experiments, we used two time-frequency representations (Mel and CQT spectrograms) and EmoDB and RAVDESS databases. As the results show, the proposed system improved the classification accuracy of both representations: 1.29% for CQT and 3.75% for Mel spectrogram compared to the typical CNN architecture for the EmoDB dataset and 3.02% for CQT and 0.63% for Mel spectrogram in the case of RAVDESS. Additionally, the results present a significant increase of around 14% in classification performance in the case of happiness and disgust emotions using Mel spectrograms and around 20% in happiness and disgust emotions for CQT in the case of best models trained on EmoDB. On the other hand, in the case of models that achieved the highest result for the RAVDESS database, the most significant improvement was observed in the classification of a neutral state, around 16%, using the Mel spectrogram. For CQT representation, the most significant improvement occurred for fear and surprise, around 9%. Additionally, the average results for all prepared models showed the positive impact of the method used on the quality of classification of most emotional states. For the EmoDB database, the highest average improvement was observed for happiness—14.6%. For other emotions, it ranged from 1.2% to 8.7%. The only exception was the emotion of sadness, for which the classification quality was average decreased by 1% when using the Mel spectrogram. In turn, for the RAVDESS database, the most significant improvement also occurred for happiness—7.5%, while for other emotions ranged from 0.2% to 7.1%, except disgust and calm, the classification of which deteriorated for the Mel spectrogram and the CQT representation, respectively.
ISSN:2076-3417