MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations

Abstract Extracting richer emotional representations from raw speech is one of the key approaches to improving the accuracy of Speech Emotion Recognition (SER). In recent years, there has been a trend in utilizing self-supervised learning (SSL) for extracting SER features, due to the exceptional per...

Full description

Saved in:
Bibliographic Details
Main Authors: Hongchen Song, Long Zhang, Meixian Gao, Hengyuan Zhang, Thomas Hain, Linlin Shan
Format: Article
Language:English
Published: Nature Portfolio 2025-07-01
Series:Scientific Reports
Online Access:https://doi.org/10.1038/s41598-025-94727-2
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849238612722319360
author Hongchen Song
Long Zhang
Meixian Gao
Hengyuan Zhang
Thomas Hain
Linlin Shan
author_facet Hongchen Song
Long Zhang
Meixian Gao
Hengyuan Zhang
Thomas Hain
Linlin Shan
author_sort Hongchen Song
collection DOAJ
description Abstract Extracting richer emotional representations from raw speech is one of the key approaches to improving the accuracy of Speech Emotion Recognition (SER). In recent years, there has been a trend in utilizing self-supervised learning (SSL) for extracting SER features, due to the exceptional performance of SSL in Automatic Speech Recognition (ASR). However, existing SSL methods are not sufficiently sensitive in capturing emotional information, making them less effective for SER tasks. To overcome this issue, this study proposes MS-EmoBoost, a novel strategy for enhancing self-supervised speech emotion representations. Specifically, MS-EmoBoost uses the deep emotional information from Melfrequency cepstral coefficient (MFCC) and spectrogram as guidance to enhance the emotional representation capabilities of self-supervised features. To determine the effectiveness of our proposed approach, we conduct a comprehensive experiment on three benchmark speech emotion datasets: IEMOCAP, EMODB, and EMOVO. The SER performance is measured by weighted accuracy (WA) and unweighted accuracy (UA). The experimental results show that our method successfully enhances the emotional representation capability of wav2vec 2.0 Base features, achieving competitive performance in SER tasks (IEMOCAP:WA,72.10%; UA,72.91%; EMODB:WA,92.45%; UA,92.62%; EMOVO:WA,86.88%; UA,87.51%), and proves effective for other self-supervised features.
format Article
id doaj-art-50322282292a478c81347a9052ddeefb
institution Kabale University
issn 2045-2322
language English
publishDate 2025-07-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-50322282292a478c81347a9052ddeefb2025-08-20T04:01:34ZengNature PortfolioScientific Reports2045-23222025-07-0115111310.1038/s41598-025-94727-2MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representationsHongchen Song0Long Zhang1Meixian Gao2Hengyuan Zhang3Thomas Hain4Linlin Shan5College of Computer and Information Engineering, Tianjin Normal UniversityCollege of Computer and Information Engineering, Tianjin Normal UniversityCollege of Computer and Information Engineering, Tianjin Normal UniversityCollege of Computer and Information Engineering, Tianjin Normal UniversitySchool of Computer Science, The University of SheffieldCollege of Fine Arts and Design, Tianjin Normal UniversityAbstract Extracting richer emotional representations from raw speech is one of the key approaches to improving the accuracy of Speech Emotion Recognition (SER). In recent years, there has been a trend in utilizing self-supervised learning (SSL) for extracting SER features, due to the exceptional performance of SSL in Automatic Speech Recognition (ASR). However, existing SSL methods are not sufficiently sensitive in capturing emotional information, making them less effective for SER tasks. To overcome this issue, this study proposes MS-EmoBoost, a novel strategy for enhancing self-supervised speech emotion representations. Specifically, MS-EmoBoost uses the deep emotional information from Melfrequency cepstral coefficient (MFCC) and spectrogram as guidance to enhance the emotional representation capabilities of self-supervised features. To determine the effectiveness of our proposed approach, we conduct a comprehensive experiment on three benchmark speech emotion datasets: IEMOCAP, EMODB, and EMOVO. The SER performance is measured by weighted accuracy (WA) and unweighted accuracy (UA). The experimental results show that our method successfully enhances the emotional representation capability of wav2vec 2.0 Base features, achieving competitive performance in SER tasks (IEMOCAP:WA,72.10%; UA,72.91%; EMODB:WA,92.45%; UA,92.62%; EMOVO:WA,86.88%; UA,87.51%), and proves effective for other self-supervised features.https://doi.org/10.1038/s41598-025-94727-2
spellingShingle Hongchen Song
Long Zhang
Meixian Gao
Hengyuan Zhang
Thomas Hain
Linlin Shan
MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations
Scientific Reports
title MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations
title_full MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations
title_fullStr MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations
title_full_unstemmed MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations
title_short MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations
title_sort ms emoboost a novel strategy for enhancing self supervised speech emotion representations
url https://doi.org/10.1038/s41598-025-94727-2
work_keys_str_mv AT hongchensong msemoboostanovelstrategyforenhancingselfsupervisedspeechemotionrepresentations
AT longzhang msemoboostanovelstrategyforenhancingselfsupervisedspeechemotionrepresentations
AT meixiangao msemoboostanovelstrategyforenhancingselfsupervisedspeechemotionrepresentations
AT hengyuanzhang msemoboostanovelstrategyforenhancingselfsupervisedspeechemotionrepresentations
AT thomashain msemoboostanovelstrategyforenhancingselfsupervisedspeechemotionrepresentations
AT linlinshan msemoboostanovelstrategyforenhancingselfsupervisedspeechemotionrepresentations