Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models

Feature extraction for speech emotion recognition (SER) has evolved from handcrafted techniques through deep learning methods to embeddings derived from pre-trained models (PTMs). This study presents the first comparative analysis focused on using PTMs for Spanish SER, evaluating six models—Whisper,...

Full description

Saved in:

Bibliographic Details
Main Authors:	Alex Mares, Gerardo Diaz-Arango, Jorge Perez-Jacome-Friscione, Hector Vazquez-Leal, Luis Hernandez-Martinez, Jesus Huerta-Chua, Andres Felipe Jaramillo-Alvarado, Alfonso Dominguez-Chavez
Format:	Article
Language:	English
Published:	MDPI AG 2025-04-01
Series:	Applied Sciences
Subjects:	Spanish speech emotion recognition pre-trained models Wav2Vec 2.0 TRILLsson Spanish emotional speech databases leave one speaker out
Online Access:	https://www.mdpi.com/2076-3417/15/8/4340
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Feature extraction for speech emotion recognition (SER) has evolved from handcrafted techniques through deep learning methods to embeddings derived from pre-trained models (PTMs). This study presents the first comparative analysis focused on using PTMs for Spanish SER, evaluating six models—Whisper, Wav2Vec 2.0, WavLM, HuBERT, TRILLsson, and CLAP—across six emotional speech databases: EmoMatchSpanishDB, MESD, MEACorpus, EmoWisconsin, INTER1SP, and EmoFilm. We propose a robust framework combining layer-wise feature extraction with Leave-One-Speaker-Out validation to ensure interpretable model comparisons. Our method significantly outperforms existing state-of-the-art benchmarks, notably achieving scores on metrics such as F1 on EmoMatchSpanishDB (88.32%), INTER1SP (99.83%), and MEACorpus (92.53%). Layer-wise analyses reveal optimal emotional representation extraction at early layers in 24-layer models and middle layers in larger architectures. Additionally, TRILLsson exhibits remarkable generalization in speaker-independent evaluations, highlighting the necessity of strategic model selection, fine-tuning, and language-specific adaptations to maximize SER performance for Spanish.
ISSN:	2076-3417

Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models

Similar Items