Small Language Models for Speech Emotion Recognition in Text and Audio Modalities

Speech emotion recognition has become increasingly important in a wide range of applications, driven by the development of large transformer-based natural language processing models. However, the large size of these architectures limits their usability, which has led to a growing interest in smaller...

Full description

Saved in:
Bibliographic Details
Main Authors: José L. Gómez-Sirvent, Francisco López de la Rosa, Daniel Sánchez-Reolid, Roberto Sánchez-Reolid, Antonio Fernández-Caballero
Format: Article
Language:English
Published: MDPI AG 2025-07-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/14/7730
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Speech emotion recognition has become increasingly important in a wide range of applications, driven by the development of large transformer-based natural language processing models. However, the large size of these architectures limits their usability, which has led to a growing interest in smaller models. In this paper, we evaluate nineteen of the most popular small language models for the text and audio modalities for speech emotion recognition on the IEMOCAP dataset. Based on their cross-validation accuracy, the best architectures were selected to create ensemble models to evaluate the effect of combining audio and text, as well as the effect of incorporating contextual information on model performance. The experiments conducted showed a significant increase in accuracy with the inclusion of contextual information and the combination of modalities. The results obtained were highly competitive, outperforming numerous recent approaches. The proposed ensemble model achieved an accuracy of 82.12% on the IEMOCAP dataset, outperforming several recent approaches. These results demonstrate the effectiveness of ensemble methods for improving speech emotion recognition performance, and highlight the feasibility of training multiple small language models on consumer-grade computers.
ISSN:2076-3417