MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations
Abstract Extracting richer emotional representations from raw speech is one of the key approaches to improving the accuracy of Speech Emotion Recognition (SER). In recent years, there has been a trend in utilizing self-supervised learning (SSL) for extracting SER features, due to the exceptional per...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-07-01
|
| Series: | Scientific Reports |
| Online Access: | https://doi.org/10.1038/s41598-025-94727-2 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849238612722319360 |
|---|---|
| author | Hongchen Song Long Zhang Meixian Gao Hengyuan Zhang Thomas Hain Linlin Shan |
| author_facet | Hongchen Song Long Zhang Meixian Gao Hengyuan Zhang Thomas Hain Linlin Shan |
| author_sort | Hongchen Song |
| collection | DOAJ |
| description | Abstract Extracting richer emotional representations from raw speech is one of the key approaches to improving the accuracy of Speech Emotion Recognition (SER). In recent years, there has been a trend in utilizing self-supervised learning (SSL) for extracting SER features, due to the exceptional performance of SSL in Automatic Speech Recognition (ASR). However, existing SSL methods are not sufficiently sensitive in capturing emotional information, making them less effective for SER tasks. To overcome this issue, this study proposes MS-EmoBoost, a novel strategy for enhancing self-supervised speech emotion representations. Specifically, MS-EmoBoost uses the deep emotional information from Melfrequency cepstral coefficient (MFCC) and spectrogram as guidance to enhance the emotional representation capabilities of self-supervised features. To determine the effectiveness of our proposed approach, we conduct a comprehensive experiment on three benchmark speech emotion datasets: IEMOCAP, EMODB, and EMOVO. The SER performance is measured by weighted accuracy (WA) and unweighted accuracy (UA). The experimental results show that our method successfully enhances the emotional representation capability of wav2vec 2.0 Base features, achieving competitive performance in SER tasks (IEMOCAP:WA,72.10%; UA,72.91%; EMODB:WA,92.45%; UA,92.62%; EMOVO:WA,86.88%; UA,87.51%), and proves effective for other self-supervised features. |
| format | Article |
| id | doaj-art-50322282292a478c81347a9052ddeefb |
| institution | Kabale University |
| issn | 2045-2322 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Scientific Reports |
| spelling | doaj-art-50322282292a478c81347a9052ddeefb2025-08-20T04:01:34ZengNature PortfolioScientific Reports2045-23222025-07-0115111310.1038/s41598-025-94727-2MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representationsHongchen Song0Long Zhang1Meixian Gao2Hengyuan Zhang3Thomas Hain4Linlin Shan5College of Computer and Information Engineering, Tianjin Normal UniversityCollege of Computer and Information Engineering, Tianjin Normal UniversityCollege of Computer and Information Engineering, Tianjin Normal UniversityCollege of Computer and Information Engineering, Tianjin Normal UniversitySchool of Computer Science, The University of SheffieldCollege of Fine Arts and Design, Tianjin Normal UniversityAbstract Extracting richer emotional representations from raw speech is one of the key approaches to improving the accuracy of Speech Emotion Recognition (SER). In recent years, there has been a trend in utilizing self-supervised learning (SSL) for extracting SER features, due to the exceptional performance of SSL in Automatic Speech Recognition (ASR). However, existing SSL methods are not sufficiently sensitive in capturing emotional information, making them less effective for SER tasks. To overcome this issue, this study proposes MS-EmoBoost, a novel strategy for enhancing self-supervised speech emotion representations. Specifically, MS-EmoBoost uses the deep emotional information from Melfrequency cepstral coefficient (MFCC) and spectrogram as guidance to enhance the emotional representation capabilities of self-supervised features. To determine the effectiveness of our proposed approach, we conduct a comprehensive experiment on three benchmark speech emotion datasets: IEMOCAP, EMODB, and EMOVO. The SER performance is measured by weighted accuracy (WA) and unweighted accuracy (UA). The experimental results show that our method successfully enhances the emotional representation capability of wav2vec 2.0 Base features, achieving competitive performance in SER tasks (IEMOCAP:WA,72.10%; UA,72.91%; EMODB:WA,92.45%; UA,92.62%; EMOVO:WA,86.88%; UA,87.51%), and proves effective for other self-supervised features.https://doi.org/10.1038/s41598-025-94727-2 |
| spellingShingle | Hongchen Song Long Zhang Meixian Gao Hengyuan Zhang Thomas Hain Linlin Shan MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations Scientific Reports |
| title | MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations |
| title_full | MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations |
| title_fullStr | MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations |
| title_full_unstemmed | MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations |
| title_short | MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations |
| title_sort | ms emoboost a novel strategy for enhancing self supervised speech emotion representations |
| url | https://doi.org/10.1038/s41598-025-94727-2 |
| work_keys_str_mv | AT hongchensong msemoboostanovelstrategyforenhancingselfsupervisedspeechemotionrepresentations AT longzhang msemoboostanovelstrategyforenhancingselfsupervisedspeechemotionrepresentations AT meixiangao msemoboostanovelstrategyforenhancingselfsupervisedspeechemotionrepresentations AT hengyuanzhang msemoboostanovelstrategyforenhancingselfsupervisedspeechemotionrepresentations AT thomashain msemoboostanovelstrategyforenhancingselfsupervisedspeechemotionrepresentations AT linlinshan msemoboostanovelstrategyforenhancingselfsupervisedspeechemotionrepresentations |