MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations

Abstract Extracting richer emotional representations from raw speech is one of the key approaches to improving the accuracy of Speech Emotion Recognition (SER). In recent years, there has been a trend in utilizing self-supervised learning (SSL) for extracting SER features, due to the exceptional per...

Full description

Saved in:

Bibliographic Details
Main Authors:	Hongchen Song, Long Zhang, Meixian Gao, Hengyuan Zhang, Thomas Hain, Linlin Shan
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-07-01
Series:	Scientific Reports
Online Access:	https://doi.org/10.1038/s41598-025-94727-2
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849238612722319360
author	Hongchen Song Long Zhang Meixian Gao Hengyuan Zhang Thomas Hain Linlin Shan
author_facet	Hongchen Song Long Zhang Meixian Gao Hengyuan Zhang Thomas Hain Linlin Shan
author_sort	Hongchen Song
collection	DOAJ
description	Abstract Extracting richer emotional representations from raw speech is one of the key approaches to improving the accuracy of Speech Emotion Recognition (SER). In recent years, there has been a trend in utilizing self-supervised learning (SSL) for extracting SER features, due to the exceptional performance of SSL in Automatic Speech Recognition (ASR). However, existing SSL methods are not sufficiently sensitive in capturing emotional information, making them less effective for SER tasks. To overcome this issue, this study proposes MS-EmoBoost, a novel strategy for enhancing self-supervised speech emotion representations. Specifically, MS-EmoBoost uses the deep emotional information from Melfrequency cepstral coefficient (MFCC) and spectrogram as guidance to enhance the emotional representation capabilities of self-supervised features. To determine the effectiveness of our proposed approach, we conduct a comprehensive experiment on three benchmark speech emotion datasets: IEMOCAP, EMODB, and EMOVO. The SER performance is measured by weighted accuracy (WA) and unweighted accuracy (UA). The experimental results show that our method successfully enhances the emotional representation capability of wav2vec 2.0 Base features, achieving competitive performance in SER tasks (IEMOCAP:WA,72.10%; UA,72.91%; EMODB:WA,92.45%; UA,92.62%; EMOVO:WA,86.88%; UA,87.51%), and proves effective for other self-supervised features.
format	Article
id	doaj-art-50322282292a478c81347a9052ddeefb
institution	Kabale University
issn	2045-2322
language	English
publishDate	2025-07-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Reports
spelling	doaj-art-50322282292a478c81347a9052ddeefb2025-08-20T04:01:34ZengNature PortfolioScientific Reports2045-23222025-07-0115111310.1038/s41598-025-94727-2MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representationsHongchen Song0Long Zhang1Meixian Gao2Hengyuan Zhang3Thomas Hain4Linlin Shan5College of Computer and Information Engineering, Tianjin Normal UniversityCollege of Computer and Information Engineering, Tianjin Normal UniversityCollege of Computer and Information Engineering, Tianjin Normal UniversityCollege of Computer and Information Engineering, Tianjin Normal UniversitySchool of Computer Science, The University of SheffieldCollege of Fine Arts and Design, Tianjin Normal UniversityAbstract Extracting richer emotional representations from raw speech is one of the key approaches to improving the accuracy of Speech Emotion Recognition (SER). In recent years, there has been a trend in utilizing self-supervised learning (SSL) for extracting SER features, due to the exceptional performance of SSL in Automatic Speech Recognition (ASR). However, existing SSL methods are not sufficiently sensitive in capturing emotional information, making them less effective for SER tasks. To overcome this issue, this study proposes MS-EmoBoost, a novel strategy for enhancing self-supervised speech emotion representations. Specifically, MS-EmoBoost uses the deep emotional information from Melfrequency cepstral coefficient (MFCC) and spectrogram as guidance to enhance the emotional representation capabilities of self-supervised features. To determine the effectiveness of our proposed approach, we conduct a comprehensive experiment on three benchmark speech emotion datasets: IEMOCAP, EMODB, and EMOVO. The SER performance is measured by weighted accuracy (WA) and unweighted accuracy (UA). The experimental results show that our method successfully enhances the emotional representation capability of wav2vec 2.0 Base features, achieving competitive performance in SER tasks (IEMOCAP:WA,72.10%; UA,72.91%; EMODB:WA,92.45%; UA,92.62%; EMOVO:WA,86.88%; UA,87.51%), and proves effective for other self-supervised features.https://doi.org/10.1038/s41598-025-94727-2
spellingShingle	Hongchen Song Long Zhang Meixian Gao Hengyuan Zhang Thomas Hain Linlin Shan MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations Scientific Reports
title	MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations
title_full	MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations
title_fullStr	MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations
title_full_unstemmed	MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations
title_short	MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations
title_sort	ms emoboost a novel strategy for enhancing self supervised speech emotion representations
url	https://doi.org/10.1038/s41598-025-94727-2
work_keys_str_mv	AT hongchensong msemoboostanovelstrategyforenhancingselfsupervisedspeechemotionrepresentations AT longzhang msemoboostanovelstrategyforenhancingselfsupervisedspeechemotionrepresentations AT meixiangao msemoboostanovelstrategyforenhancingselfsupervisedspeechemotionrepresentations AT hengyuanzhang msemoboostanovelstrategyforenhancingselfsupervisedspeechemotionrepresentations AT thomashain msemoboostanovelstrategyforenhancingselfsupervisedspeechemotionrepresentations AT linlinshan msemoboostanovelstrategyforenhancingselfsupervisedspeechemotionrepresentations

MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations

Similar Items