Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models
Feature extraction for speech emotion recognition (SER) has evolved from handcrafted techniques through deep learning methods to embeddings derived from pre-trained models (PTMs). This study presents the first comparative analysis focused on using PTMs for Spanish SER, evaluating six models—Whisper,...
Saved in:
| Main Authors: | , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-04-01
|
| Series: | Applied Sciences |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2076-3417/15/8/4340 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849712186806501376 |
|---|---|
| author | Alex Mares Gerardo Diaz-Arango Jorge Perez-Jacome-Friscione Hector Vazquez-Leal Luis Hernandez-Martinez Jesus Huerta-Chua Andres Felipe Jaramillo-Alvarado Alfonso Dominguez-Chavez |
| author_facet | Alex Mares Gerardo Diaz-Arango Jorge Perez-Jacome-Friscione Hector Vazquez-Leal Luis Hernandez-Martinez Jesus Huerta-Chua Andres Felipe Jaramillo-Alvarado Alfonso Dominguez-Chavez |
| author_sort | Alex Mares |
| collection | DOAJ |
| description | Feature extraction for speech emotion recognition (SER) has evolved from handcrafted techniques through deep learning methods to embeddings derived from pre-trained models (PTMs). This study presents the first comparative analysis focused on using PTMs for Spanish SER, evaluating six models—Whisper, Wav2Vec 2.0, WavLM, HuBERT, TRILLsson, and CLAP—across six emotional speech databases: EmoMatchSpanishDB, MESD, MEACorpus, EmoWisconsin, INTER1SP, and EmoFilm. We propose a robust framework combining layer-wise feature extraction with Leave-One-Speaker-Out validation to ensure interpretable model comparisons. Our method significantly outperforms existing state-of-the-art benchmarks, notably achieving scores on metrics such as F1 on EmoMatchSpanishDB (88.32%), INTER1SP (99.83%), and MEACorpus (92.53%). Layer-wise analyses reveal optimal emotional representation extraction at early layers in 24-layer models and middle layers in larger architectures. Additionally, TRILLsson exhibits remarkable generalization in speaker-independent evaluations, highlighting the necessity of strategic model selection, fine-tuning, and language-specific adaptations to maximize SER performance for Spanish. |
| format | Article |
| id | doaj-art-64bff34e92fe47ea8aa80778800a6bf9 |
| institution | DOAJ |
| issn | 2076-3417 |
| language | English |
| publishDate | 2025-04-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Applied Sciences |
| spelling | doaj-art-64bff34e92fe47ea8aa80778800a6bf92025-08-20T03:14:21ZengMDPI AGApplied Sciences2076-34172025-04-01158434010.3390/app15084340Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained ModelsAlex Mares0Gerardo Diaz-Arango1Jorge Perez-Jacome-Friscione2Hector Vazquez-Leal3Luis Hernandez-Martinez4Jesus Huerta-Chua5Andres Felipe Jaramillo-Alvarado6Alfonso Dominguez-Chavez7Facultad de Instrumentacion Electronica, Universidad Veracruzana, Cto. Gonzalo Aguirre Beltran S/N, Xalapa 91000, MexicoFacultad de Instrumentacion Electronica, Universidad Veracruzana, Cto. Gonzalo Aguirre Beltran S/N, Xalapa 91000, MexicoFacultad de Instrumentacion Electronica, Universidad Veracruzana, Cto. Gonzalo Aguirre Beltran S/N, Xalapa 91000, MexicoFacultad de Instrumentacion Electronica, Universidad Veracruzana, Cto. Gonzalo Aguirre Beltran S/N, Xalapa 91000, MexicoInstituto Tecnologico Superior de Poza Rica, Tecnologico Nacional de Mexico, Luis Donaldo Colosio Murrieta S/N, Arroyo del Maiz, Poza Rica 93230, MexicoElectronics Department, National Institute for Astrophysics, Optics and Electronics, Sta. María Tonantzintla, Puebla 72840, MexicoInstituto Tecnologico Superior de Poza Rica, Tecnologico Nacional de Mexico, Luis Donaldo Colosio Murrieta S/N, Arroyo del Maiz, Poza Rica 93230, MexicoFacultad de Instrumentacion Electronica, Universidad Veracruzana, Cto. Gonzalo Aguirre Beltran S/N, Xalapa 91000, MexicoFeature extraction for speech emotion recognition (SER) has evolved from handcrafted techniques through deep learning methods to embeddings derived from pre-trained models (PTMs). This study presents the first comparative analysis focused on using PTMs for Spanish SER, evaluating six models—Whisper, Wav2Vec 2.0, WavLM, HuBERT, TRILLsson, and CLAP—across six emotional speech databases: EmoMatchSpanishDB, MESD, MEACorpus, EmoWisconsin, INTER1SP, and EmoFilm. We propose a robust framework combining layer-wise feature extraction with Leave-One-Speaker-Out validation to ensure interpretable model comparisons. Our method significantly outperforms existing state-of-the-art benchmarks, notably achieving scores on metrics such as F1 on EmoMatchSpanishDB (88.32%), INTER1SP (99.83%), and MEACorpus (92.53%). Layer-wise analyses reveal optimal emotional representation extraction at early layers in 24-layer models and middle layers in larger architectures. Additionally, TRILLsson exhibits remarkable generalization in speaker-independent evaluations, highlighting the necessity of strategic model selection, fine-tuning, and language-specific adaptations to maximize SER performance for Spanish.https://www.mdpi.com/2076-3417/15/8/4340Spanish speech emotion recognitionpre-trained modelsWav2Vec 2.0TRILLssonSpanish emotional speech databasesleave one speaker out |
| spellingShingle | Alex Mares Gerardo Diaz-Arango Jorge Perez-Jacome-Friscione Hector Vazquez-Leal Luis Hernandez-Martinez Jesus Huerta-Chua Andres Felipe Jaramillo-Alvarado Alfonso Dominguez-Chavez Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models Applied Sciences Spanish speech emotion recognition pre-trained models Wav2Vec 2.0 TRILLsson Spanish emotional speech databases leave one speaker out |
| title | Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models |
| title_full | Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models |
| title_fullStr | Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models |
| title_full_unstemmed | Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models |
| title_short | Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models |
| title_sort | advancing spanish speech emotion recognition a comprehensive benchmark of pre trained models |
| topic | Spanish speech emotion recognition pre-trained models Wav2Vec 2.0 TRILLsson Spanish emotional speech databases leave one speaker out |
| url | https://www.mdpi.com/2076-3417/15/8/4340 |
| work_keys_str_mv | AT alexmares advancingspanishspeechemotionrecognitionacomprehensivebenchmarkofpretrainedmodels AT gerardodiazarango advancingspanishspeechemotionrecognitionacomprehensivebenchmarkofpretrainedmodels AT jorgeperezjacomefriscione advancingspanishspeechemotionrecognitionacomprehensivebenchmarkofpretrainedmodels AT hectorvazquezleal advancingspanishspeechemotionrecognitionacomprehensivebenchmarkofpretrainedmodels AT luishernandezmartinez advancingspanishspeechemotionrecognitionacomprehensivebenchmarkofpretrainedmodels AT jesushuertachua advancingspanishspeechemotionrecognitionacomprehensivebenchmarkofpretrainedmodels AT andresfelipejaramilloalvarado advancingspanishspeechemotionrecognitionacomprehensivebenchmarkofpretrainedmodels AT alfonsodominguezchavez advancingspanishspeechemotionrecognitionacomprehensivebenchmarkofpretrainedmodels |