DepITCM: an audio-visual method for detecting depression

IntroductionDepression is a prevalent mental disorder, and early screening and treatment are crucial for detecting depression. However, there are still some limitations in the currently proposed deep models based on audio-video data, for example, it is difficult to effectively extract and select use...

Full description

Saved in:

Bibliographic Details
Main Authors:	Lishan Zhang, Zhenhua Liu, Yumei Wan, Yunli Fan, Diancai Chen, Qingxiang Wang, Kaihong Zhang, Yunshao Zheng
Format:	Article
Language:	English
Published:	Frontiers Media S.A. 2025-01-01
Series:	Frontiers in Psychiatry
Subjects:	depression detection multimodal feature extraction multi-task learning DepITCM
Online Access:	https://www.frontiersin.org/articles/10.3389/fpsyt.2024.1466507/full
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832590823578927104
author	Lishan Zhang Lishan Zhang Zhenhua Liu Yumei Wan Yunli Fan Diancai Chen Qingxiang Wang Kaihong Zhang Yunshao Zheng
author_facet	Lishan Zhang Lishan Zhang Zhenhua Liu Yumei Wan Yunli Fan Diancai Chen Qingxiang Wang Kaihong Zhang Yunshao Zheng
author_sort	Lishan Zhang
collection	DOAJ
description	IntroductionDepression is a prevalent mental disorder, and early screening and treatment are crucial for detecting depression. However, there are still some limitations in the currently proposed deep models based on audio-video data, for example, it is difficult to effectively extract and select useful multimodal information and features from audio-video data, and very few studies have been able to focus on three dimensions of information: time, channel, and space at the same time in depression detection. In addition, there are challenges in utilizing other tasks to enhance prediction accuracy. The resolution of these issues is crucial for constructing models of depression detection.MethodsIn this paper, we propose a multi-task representation learning based on vision and audio for depression detection model (DepITCM).The model comprises three main modules: a data preprocessing module, the Inception-Temporal-Channel Principal Component Analysis Module(ITCM Encoder), and a multi-task learning module. To efficiently extract rich feature representations from audio and video data, the ITCM Encoder employs a staged feature extraction strategy, transitioning from global to local features. This approach enables the capture of global features while emphasizing the fusion of temporal, channel, and spatial information in finer detail. Furthermore, inspired by multi-task learning strategies, this paper enhances the primary task of depression classification by incorporating a secondary task (regression task) to improve overall performance.ResultsWe conducted experiments on the AVEC2017 and AVEC2019 datasets. The results show that, in the classification task, our method achieved an F1 score of 0.823 and a classification accuracy of 0.823 on the AVEC2017 dataset, and an F1 score of 0.816 and a classification accuracy of 0.810 on the AVEC2019 dataset. In the regression task, the RMSE was 6.10 (AVEC2017) and 4.89 (AVEC2019), respectively. These results demonstrate that our method outperforms most existing methods in both classification and regression tasks. Furthermore, we demonstrate that the model proposed in this paper can effectively improve the performance of depression detection when using multi-task learning.DiscussionAlthough depression detection through multimodality has shown good results in previous studies. However, multi-task learning can utilize the complementary information between different tasks. Therefore, our work combines multimodal and multi-task learning to improve the accuracy of depression detection. Previous studies have mostly focused on the extraction of global features while ignoring the importance of local features. Based on the problems of previous studies, we have made corresponding improvements to provide a more comprehensive and effective solution for depression detection.
format	Article
id	doaj-art-7cf2db35175f428889ed7f032d5a4061
institution	Kabale University
issn	1664-0640
language	English
publishDate	2025-01-01
publisher	Frontiers Media S.A.
record_format	Article
series	Frontiers in Psychiatry
spelling	doaj-art-7cf2db35175f428889ed7f032d5a40612025-01-23T06:56:38ZengFrontiers Media S.A.Frontiers in Psychiatry1664-06402025-01-011510.3389/fpsyt.2024.14665071466507DepITCM: an audio-visual method for detecting depressionLishan Zhang0Lishan Zhang1Zhenhua Liu2Yumei Wan3Yunli Fan4Diancai Chen5Qingxiang Wang6Kaihong Zhang7Yunshao Zheng8Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences), Jinan, ChinaShandong Provincial Key Laboratory of Computer Networks, Shandong Fundamental Research Center for Computer Science, Jinan, ChinaShandong Mental Health Center, Shandong University, Jinan, ChinaShandong Mental Health Center, Shandong University, Jinan, ChinaShandong Mental Health Center, Shandong University, Jinan, ChinaShandong Mental Health Center, Shandong University, Jinan, ChinaShandong Mental Health Center, Shandong University, Jinan, ChinaShandong Mental Health Center, Shandong University, Jinan, ChinaShandong Mental Health Center, Shandong University, Jinan, ChinaIntroductionDepression is a prevalent mental disorder, and early screening and treatment are crucial for detecting depression. However, there are still some limitations in the currently proposed deep models based on audio-video data, for example, it is difficult to effectively extract and select useful multimodal information and features from audio-video data, and very few studies have been able to focus on three dimensions of information: time, channel, and space at the same time in depression detection. In addition, there are challenges in utilizing other tasks to enhance prediction accuracy. The resolution of these issues is crucial for constructing models of depression detection.MethodsIn this paper, we propose a multi-task representation learning based on vision and audio for depression detection model (DepITCM).The model comprises three main modules: a data preprocessing module, the Inception-Temporal-Channel Principal Component Analysis Module(ITCM Encoder), and a multi-task learning module. To efficiently extract rich feature representations from audio and video data, the ITCM Encoder employs a staged feature extraction strategy, transitioning from global to local features. This approach enables the capture of global features while emphasizing the fusion of temporal, channel, and spatial information in finer detail. Furthermore, inspired by multi-task learning strategies, this paper enhances the primary task of depression classification by incorporating a secondary task (regression task) to improve overall performance.ResultsWe conducted experiments on the AVEC2017 and AVEC2019 datasets. The results show that, in the classification task, our method achieved an F1 score of 0.823 and a classification accuracy of 0.823 on the AVEC2017 dataset, and an F1 score of 0.816 and a classification accuracy of 0.810 on the AVEC2019 dataset. In the regression task, the RMSE was 6.10 (AVEC2017) and 4.89 (AVEC2019), respectively. These results demonstrate that our method outperforms most existing methods in both classification and regression tasks. Furthermore, we demonstrate that the model proposed in this paper can effectively improve the performance of depression detection when using multi-task learning.DiscussionAlthough depression detection through multimodality has shown good results in previous studies. However, multi-task learning can utilize the complementary information between different tasks. Therefore, our work combines multimodal and multi-task learning to improve the accuracy of depression detection. Previous studies have mostly focused on the extraction of global features while ignoring the importance of local features. Based on the problems of previous studies, we have made corresponding improvements to provide a more comprehensive and effective solution for depression detection.https://www.frontiersin.org/articles/10.3389/fpsyt.2024.1466507/fulldepression detectionmultimodalfeature extractionmulti-task learningDepITCM
spellingShingle	Lishan Zhang Lishan Zhang Zhenhua Liu Yumei Wan Yunli Fan Diancai Chen Qingxiang Wang Kaihong Zhang Yunshao Zheng DepITCM: an audio-visual method for detecting depression Frontiers in Psychiatry depression detection multimodal feature extraction multi-task learning DepITCM
title	DepITCM: an audio-visual method for detecting depression
title_full	DepITCM: an audio-visual method for detecting depression
title_fullStr	DepITCM: an audio-visual method for detecting depression
title_full_unstemmed	DepITCM: an audio-visual method for detecting depression
title_short	DepITCM: an audio-visual method for detecting depression
title_sort	depitcm an audio visual method for detecting depression
topic	depression detection multimodal feature extraction multi-task learning DepITCM
url	https://www.frontiersin.org/articles/10.3389/fpsyt.2024.1466507/full
work_keys_str_mv	AT lishanzhang depitcmanaudiovisualmethodfordetectingdepression AT lishanzhang depitcmanaudiovisualmethodfordetectingdepression AT zhenhualiu depitcmanaudiovisualmethodfordetectingdepression AT yumeiwan depitcmanaudiovisualmethodfordetectingdepression AT yunlifan depitcmanaudiovisualmethodfordetectingdepression AT diancaichen depitcmanaudiovisualmethodfordetectingdepression AT qingxiangwang depitcmanaudiovisualmethodfordetectingdepression AT kaihongzhang depitcmanaudiovisualmethodfordetectingdepression AT yunshaozheng depitcmanaudiovisualmethodfordetectingdepression

DepITCM: an audio-visual method for detecting depression

Similar Items