Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models

Feature extraction for speech emotion recognition (SER) has evolved from handcrafted techniques through deep learning methods to embeddings derived from pre-trained models (PTMs). This study presents the first comparative analysis focused on using PTMs for Spanish SER, evaluating six models—Whisper,...

Full description

Saved in:
Bibliographic Details
Main Authors: Alex Mares, Gerardo Diaz-Arango, Jorge Perez-Jacome-Friscione, Hector Vazquez-Leal, Luis Hernandez-Martinez, Jesus Huerta-Chua, Andres Felipe Jaramillo-Alvarado, Alfonso Dominguez-Chavez
Format: Article
Language:English
Published: MDPI AG 2025-04-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/8/4340
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849712186806501376
author Alex Mares
Gerardo Diaz-Arango
Jorge Perez-Jacome-Friscione
Hector Vazquez-Leal
Luis Hernandez-Martinez
Jesus Huerta-Chua
Andres Felipe Jaramillo-Alvarado
Alfonso Dominguez-Chavez
author_facet Alex Mares
Gerardo Diaz-Arango
Jorge Perez-Jacome-Friscione
Hector Vazquez-Leal
Luis Hernandez-Martinez
Jesus Huerta-Chua
Andres Felipe Jaramillo-Alvarado
Alfonso Dominguez-Chavez
author_sort Alex Mares
collection DOAJ
description Feature extraction for speech emotion recognition (SER) has evolved from handcrafted techniques through deep learning methods to embeddings derived from pre-trained models (PTMs). This study presents the first comparative analysis focused on using PTMs for Spanish SER, evaluating six models—Whisper, Wav2Vec 2.0, WavLM, HuBERT, TRILLsson, and CLAP—across six emotional speech databases: EmoMatchSpanishDB, MESD, MEACorpus, EmoWisconsin, INTER1SP, and EmoFilm. We propose a robust framework combining layer-wise feature extraction with Leave-One-Speaker-Out validation to ensure interpretable model comparisons. Our method significantly outperforms existing state-of-the-art benchmarks, notably achieving scores on metrics such as F1 on EmoMatchSpanishDB (88.32%), INTER1SP (99.83%), and MEACorpus (92.53%). Layer-wise analyses reveal optimal emotional representation extraction at early layers in 24-layer models and middle layers in larger architectures. Additionally, TRILLsson exhibits remarkable generalization in speaker-independent evaluations, highlighting the necessity of strategic model selection, fine-tuning, and language-specific adaptations to maximize SER performance for Spanish.
format Article
id doaj-art-64bff34e92fe47ea8aa80778800a6bf9
institution DOAJ
issn 2076-3417
language English
publishDate 2025-04-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj-art-64bff34e92fe47ea8aa80778800a6bf92025-08-20T03:14:21ZengMDPI AGApplied Sciences2076-34172025-04-01158434010.3390/app15084340Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained ModelsAlex Mares0Gerardo Diaz-Arango1Jorge Perez-Jacome-Friscione2Hector Vazquez-Leal3Luis Hernandez-Martinez4Jesus Huerta-Chua5Andres Felipe Jaramillo-Alvarado6Alfonso Dominguez-Chavez7Facultad de Instrumentacion Electronica, Universidad Veracruzana, Cto. Gonzalo Aguirre Beltran S/N, Xalapa 91000, MexicoFacultad de Instrumentacion Electronica, Universidad Veracruzana, Cto. Gonzalo Aguirre Beltran S/N, Xalapa 91000, MexicoFacultad de Instrumentacion Electronica, Universidad Veracruzana, Cto. Gonzalo Aguirre Beltran S/N, Xalapa 91000, MexicoFacultad de Instrumentacion Electronica, Universidad Veracruzana, Cto. Gonzalo Aguirre Beltran S/N, Xalapa 91000, MexicoInstituto Tecnologico Superior de Poza Rica, Tecnologico Nacional de Mexico, Luis Donaldo Colosio Murrieta S/N, Arroyo del Maiz, Poza Rica 93230, MexicoElectronics Department, National Institute for Astrophysics, Optics and Electronics, Sta. María Tonantzintla, Puebla 72840, MexicoInstituto Tecnologico Superior de Poza Rica, Tecnologico Nacional de Mexico, Luis Donaldo Colosio Murrieta S/N, Arroyo del Maiz, Poza Rica 93230, MexicoFacultad de Instrumentacion Electronica, Universidad Veracruzana, Cto. Gonzalo Aguirre Beltran S/N, Xalapa 91000, MexicoFeature extraction for speech emotion recognition (SER) has evolved from handcrafted techniques through deep learning methods to embeddings derived from pre-trained models (PTMs). This study presents the first comparative analysis focused on using PTMs for Spanish SER, evaluating six models—Whisper, Wav2Vec 2.0, WavLM, HuBERT, TRILLsson, and CLAP—across six emotional speech databases: EmoMatchSpanishDB, MESD, MEACorpus, EmoWisconsin, INTER1SP, and EmoFilm. We propose a robust framework combining layer-wise feature extraction with Leave-One-Speaker-Out validation to ensure interpretable model comparisons. Our method significantly outperforms existing state-of-the-art benchmarks, notably achieving scores on metrics such as F1 on EmoMatchSpanishDB (88.32%), INTER1SP (99.83%), and MEACorpus (92.53%). Layer-wise analyses reveal optimal emotional representation extraction at early layers in 24-layer models and middle layers in larger architectures. Additionally, TRILLsson exhibits remarkable generalization in speaker-independent evaluations, highlighting the necessity of strategic model selection, fine-tuning, and language-specific adaptations to maximize SER performance for Spanish.https://www.mdpi.com/2076-3417/15/8/4340Spanish speech emotion recognitionpre-trained modelsWav2Vec 2.0TRILLssonSpanish emotional speech databasesleave one speaker out
spellingShingle Alex Mares
Gerardo Diaz-Arango
Jorge Perez-Jacome-Friscione
Hector Vazquez-Leal
Luis Hernandez-Martinez
Jesus Huerta-Chua
Andres Felipe Jaramillo-Alvarado
Alfonso Dominguez-Chavez
Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models
Applied Sciences
Spanish speech emotion recognition
pre-trained models
Wav2Vec 2.0
TRILLsson
Spanish emotional speech databases
leave one speaker out
title Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models
title_full Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models
title_fullStr Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models
title_full_unstemmed Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models
title_short Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models
title_sort advancing spanish speech emotion recognition a comprehensive benchmark of pre trained models
topic Spanish speech emotion recognition
pre-trained models
Wav2Vec 2.0
TRILLsson
Spanish emotional speech databases
leave one speaker out
url https://www.mdpi.com/2076-3417/15/8/4340
work_keys_str_mv AT alexmares advancingspanishspeechemotionrecognitionacomprehensivebenchmarkofpretrainedmodels
AT gerardodiazarango advancingspanishspeechemotionrecognitionacomprehensivebenchmarkofpretrainedmodels
AT jorgeperezjacomefriscione advancingspanishspeechemotionrecognitionacomprehensivebenchmarkofpretrainedmodels
AT hectorvazquezleal advancingspanishspeechemotionrecognitionacomprehensivebenchmarkofpretrainedmodels
AT luishernandezmartinez advancingspanishspeechemotionrecognitionacomprehensivebenchmarkofpretrainedmodels
AT jesushuertachua advancingspanishspeechemotionrecognitionacomprehensivebenchmarkofpretrainedmodels
AT andresfelipejaramilloalvarado advancingspanishspeechemotionrecognitionacomprehensivebenchmarkofpretrainedmodels
AT alfonsodominguezchavez advancingspanishspeechemotionrecognitionacomprehensivebenchmarkofpretrainedmodels