Enhancing depression recognition through a mixed expert model by integrating speaker-related and emotion-related features

Abstract The World Health Organization predicts that by 2030, depression will be the most common mental disorder, significantly affecting individuals, families, and society. Speech, as a sensitive indicator, reveals noticeable acoustic changes linked to physiological and cognitive variations, making...

Full description

Saved in:

Bibliographic Details
Main Authors:	Weitong Guo, Qian He, Ziyu Lin, Xiaolong Bu, Ziyang Wang, Dong Li, Hongwu Yang
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-02-01
Series:	Scientific Reports
Online Access:	https://doi.org/10.1038/s41598-025-88313-9
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Abstract The World Health Organization predicts that by 2030, depression will be the most common mental disorder, significantly affecting individuals, families, and society. Speech, as a sensitive indicator, reveals noticeable acoustic changes linked to physiological and cognitive variations, making it a crucial behavioral marker for detecting depression. However, existing studies often overlook the separation of speaker-related and emotion-related features in speech when recognizing depression. To tackle this challenge, we propose a Mixture-of-Experts (MoE) method that integrates speaker-related and emotion-related features for depression recognition. Our approach begins with a Time Delay Neural Network to pre-train a speaker-related feature extractor using a large-scale speaker recognition dataset while simultaneously pre-training a speaker’s emotion-related feature extractor with a speech emotion dataset. We then apply transfer learning to extract both features from a depression dataset, followed by fusion. A multi-domain adaptation algorithm trains the MoE model for depression recognition. Experimental results demonstrate that our method achieves 74.3% accuracy on a self-built Chinese localized depression dataset and an MAE of 6.32 on the AVEC2014 dataset. Thus, it outperforms state-of-the-art deep learning methods that use speech features. Additionally, our approach shows strong performance across Chinese and English speech datasets, highlighting its effectiveness in addressing cultural variations.
ISSN:	2045-2322

Enhancing depression recognition through a mixed expert model by integrating speaker-related and emotion-related features

Similar Items