Enhancing depression recognition through a mixed expert model by integrating speaker-related and emotion-related features
Abstract The World Health Organization predicts that by 2030, depression will be the most common mental disorder, significantly affecting individuals, families, and society. Speech, as a sensitive indicator, reveals noticeable acoustic changes linked to physiological and cognitive variations, making...
Saved in:
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Nature Portfolio
2025-02-01
|
Series: | Scientific Reports |
Online Access: | https://doi.org/10.1038/s41598-025-88313-9 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1823862139626979328 |
---|---|
author | Weitong Guo Qian He Ziyu Lin Xiaolong Bu Ziyang Wang Dong Li Hongwu Yang |
author_facet | Weitong Guo Qian He Ziyu Lin Xiaolong Bu Ziyang Wang Dong Li Hongwu Yang |
author_sort | Weitong Guo |
collection | DOAJ |
description | Abstract The World Health Organization predicts that by 2030, depression will be the most common mental disorder, significantly affecting individuals, families, and society. Speech, as a sensitive indicator, reveals noticeable acoustic changes linked to physiological and cognitive variations, making it a crucial behavioral marker for detecting depression. However, existing studies often overlook the separation of speaker-related and emotion-related features in speech when recognizing depression. To tackle this challenge, we propose a Mixture-of-Experts (MoE) method that integrates speaker-related and emotion-related features for depression recognition. Our approach begins with a Time Delay Neural Network to pre-train a speaker-related feature extractor using a large-scale speaker recognition dataset while simultaneously pre-training a speaker’s emotion-related feature extractor with a speech emotion dataset. We then apply transfer learning to extract both features from a depression dataset, followed by fusion. A multi-domain adaptation algorithm trains the MoE model for depression recognition. Experimental results demonstrate that our method achieves 74.3% accuracy on a self-built Chinese localized depression dataset and an MAE of 6.32 on the AVEC2014 dataset. Thus, it outperforms state-of-the-art deep learning methods that use speech features. Additionally, our approach shows strong performance across Chinese and English speech datasets, highlighting its effectiveness in addressing cultural variations. |
format | Article |
id | doaj-art-3115de41bef340f0bbf4ba8ea8ccdf06 |
institution | Kabale University |
issn | 2045-2322 |
language | English |
publishDate | 2025-02-01 |
publisher | Nature Portfolio |
record_format | Article |
series | Scientific Reports |
spelling | doaj-art-3115de41bef340f0bbf4ba8ea8ccdf062025-02-09T12:37:49ZengNature PortfolioScientific Reports2045-23222025-02-0115111510.1038/s41598-025-88313-9Enhancing depression recognition through a mixed expert model by integrating speaker-related and emotion-related featuresWeitong Guo0Qian He1Ziyu Lin2Xiaolong Bu3Ziyang Wang4Dong Li5Hongwu Yang6School of Educational Technology, Northwest Normal UniversitySchool of Educational Technology, Northwest Normal UniversitySchool of Educational Technology, Northwest Normal UniversitySchool of Educational Technology, Northwest Normal UniversitySchool of Educational Technology, Northwest Normal UniversityFaculty of Artificial Intelligence in Education, Central China Normal UniversitySchool of Educational Technology, Northwest Normal UniversityAbstract The World Health Organization predicts that by 2030, depression will be the most common mental disorder, significantly affecting individuals, families, and society. Speech, as a sensitive indicator, reveals noticeable acoustic changes linked to physiological and cognitive variations, making it a crucial behavioral marker for detecting depression. However, existing studies often overlook the separation of speaker-related and emotion-related features in speech when recognizing depression. To tackle this challenge, we propose a Mixture-of-Experts (MoE) method that integrates speaker-related and emotion-related features for depression recognition. Our approach begins with a Time Delay Neural Network to pre-train a speaker-related feature extractor using a large-scale speaker recognition dataset while simultaneously pre-training a speaker’s emotion-related feature extractor with a speech emotion dataset. We then apply transfer learning to extract both features from a depression dataset, followed by fusion. A multi-domain adaptation algorithm trains the MoE model for depression recognition. Experimental results demonstrate that our method achieves 74.3% accuracy on a self-built Chinese localized depression dataset and an MAE of 6.32 on the AVEC2014 dataset. Thus, it outperforms state-of-the-art deep learning methods that use speech features. Additionally, our approach shows strong performance across Chinese and English speech datasets, highlighting its effectiveness in addressing cultural variations.https://doi.org/10.1038/s41598-025-88313-9 |
spellingShingle | Weitong Guo Qian He Ziyu Lin Xiaolong Bu Ziyang Wang Dong Li Hongwu Yang Enhancing depression recognition through a mixed expert model by integrating speaker-related and emotion-related features Scientific Reports |
title | Enhancing depression recognition through a mixed expert model by integrating speaker-related and emotion-related features |
title_full | Enhancing depression recognition through a mixed expert model by integrating speaker-related and emotion-related features |
title_fullStr | Enhancing depression recognition through a mixed expert model by integrating speaker-related and emotion-related features |
title_full_unstemmed | Enhancing depression recognition through a mixed expert model by integrating speaker-related and emotion-related features |
title_short | Enhancing depression recognition through a mixed expert model by integrating speaker-related and emotion-related features |
title_sort | enhancing depression recognition through a mixed expert model by integrating speaker related and emotion related features |
url | https://doi.org/10.1038/s41598-025-88313-9 |
work_keys_str_mv | AT weitongguo enhancingdepressionrecognitionthroughamixedexpertmodelbyintegratingspeakerrelatedandemotionrelatedfeatures AT qianhe enhancingdepressionrecognitionthroughamixedexpertmodelbyintegratingspeakerrelatedandemotionrelatedfeatures AT ziyulin enhancingdepressionrecognitionthroughamixedexpertmodelbyintegratingspeakerrelatedandemotionrelatedfeatures AT xiaolongbu enhancingdepressionrecognitionthroughamixedexpertmodelbyintegratingspeakerrelatedandemotionrelatedfeatures AT ziyangwang enhancingdepressionrecognitionthroughamixedexpertmodelbyintegratingspeakerrelatedandemotionrelatedfeatures AT dongli enhancingdepressionrecognitionthroughamixedexpertmodelbyintegratingspeakerrelatedandemotionrelatedfeatures AT hongwuyang enhancingdepressionrecognitionthroughamixedexpertmodelbyintegratingspeakerrelatedandemotionrelatedfeatures |