DepITCM: an audio-visual method for detecting depression

IntroductionDepression is a prevalent mental disorder, and early screening and treatment are crucial for detecting depression. However, there are still some limitations in the currently proposed deep models based on audio-video data, for example, it is difficult to effectively extract and select use...

Full description

Saved in:
Bibliographic Details
Main Authors: Lishan Zhang, Zhenhua Liu, Yumei Wan, Yunli Fan, Diancai Chen, Qingxiang Wang, Kaihong Zhang, Yunshao Zheng
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-01-01
Series:Frontiers in Psychiatry
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fpsyt.2024.1466507/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832590823578927104
author Lishan Zhang
Lishan Zhang
Zhenhua Liu
Yumei Wan
Yunli Fan
Diancai Chen
Qingxiang Wang
Kaihong Zhang
Yunshao Zheng
author_facet Lishan Zhang
Lishan Zhang
Zhenhua Liu
Yumei Wan
Yunli Fan
Diancai Chen
Qingxiang Wang
Kaihong Zhang
Yunshao Zheng
author_sort Lishan Zhang
collection DOAJ
description IntroductionDepression is a prevalent mental disorder, and early screening and treatment are crucial for detecting depression. However, there are still some limitations in the currently proposed deep models based on audio-video data, for example, it is difficult to effectively extract and select useful multimodal information and features from audio-video data, and very few studies have been able to focus on three dimensions of information: time, channel, and space at the same time in depression detection. In addition, there are challenges in utilizing other tasks to enhance prediction accuracy. The resolution of these issues is crucial for constructing models of depression detection.MethodsIn this paper, we propose a multi-task representation learning based on vision and audio for depression detection model (DepITCM).The model comprises three main modules: a data preprocessing module, the Inception-Temporal-Channel Principal Component Analysis Module(ITCM Encoder), and a multi-task learning module. To efficiently extract rich feature representations from audio and video data, the ITCM Encoder employs a staged feature extraction strategy, transitioning from global to local features. This approach enables the capture of global features while emphasizing the fusion of temporal, channel, and spatial information in finer detail. Furthermore, inspired by multi-task learning strategies, this paper enhances the primary task of depression classification by incorporating a secondary task (regression task) to improve overall performance.ResultsWe conducted experiments on the AVEC2017 and AVEC2019 datasets. The results show that, in the classification task, our method achieved an F1 score of 0.823 and a classification accuracy of 0.823 on the AVEC2017 dataset, and an F1 score of 0.816 and a classification accuracy of 0.810 on the AVEC2019 dataset. In the regression task, the RMSE was 6.10 (AVEC2017) and 4.89 (AVEC2019), respectively. These results demonstrate that our method outperforms most existing methods in both classification and regression tasks. Furthermore, we demonstrate that the model proposed in this paper can effectively improve the performance of depression detection when using multi-task learning.DiscussionAlthough depression detection through multimodality has shown good results in previous studies. However, multi-task learning can utilize the complementary information between different tasks. Therefore, our work combines multimodal and multi-task learning to improve the accuracy of depression detection. Previous studies have mostly focused on the extraction of global features while ignoring the importance of local features. Based on the problems of previous studies, we have made corresponding improvements to provide a more comprehensive and effective solution for depression detection.
format Article
id doaj-art-7cf2db35175f428889ed7f032d5a4061
institution Kabale University
issn 1664-0640
language English
publishDate 2025-01-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Psychiatry
spelling doaj-art-7cf2db35175f428889ed7f032d5a40612025-01-23T06:56:38ZengFrontiers Media S.A.Frontiers in Psychiatry1664-06402025-01-011510.3389/fpsyt.2024.14665071466507DepITCM: an audio-visual method for detecting depressionLishan Zhang0Lishan Zhang1Zhenhua Liu2Yumei Wan3Yunli Fan4Diancai Chen5Qingxiang Wang6Kaihong Zhang7Yunshao Zheng8Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences), Jinan, ChinaShandong Provincial Key Laboratory of Computer Networks, Shandong Fundamental Research Center for Computer Science, Jinan, ChinaShandong Mental Health Center, Shandong University, Jinan, ChinaShandong Mental Health Center, Shandong University, Jinan, ChinaShandong Mental Health Center, Shandong University, Jinan, ChinaShandong Mental Health Center, Shandong University, Jinan, ChinaShandong Mental Health Center, Shandong University, Jinan, ChinaShandong Mental Health Center, Shandong University, Jinan, ChinaShandong Mental Health Center, Shandong University, Jinan, ChinaIntroductionDepression is a prevalent mental disorder, and early screening and treatment are crucial for detecting depression. However, there are still some limitations in the currently proposed deep models based on audio-video data, for example, it is difficult to effectively extract and select useful multimodal information and features from audio-video data, and very few studies have been able to focus on three dimensions of information: time, channel, and space at the same time in depression detection. In addition, there are challenges in utilizing other tasks to enhance prediction accuracy. The resolution of these issues is crucial for constructing models of depression detection.MethodsIn this paper, we propose a multi-task representation learning based on vision and audio for depression detection model (DepITCM).The model comprises three main modules: a data preprocessing module, the Inception-Temporal-Channel Principal Component Analysis Module(ITCM Encoder), and a multi-task learning module. To efficiently extract rich feature representations from audio and video data, the ITCM Encoder employs a staged feature extraction strategy, transitioning from global to local features. This approach enables the capture of global features while emphasizing the fusion of temporal, channel, and spatial information in finer detail. Furthermore, inspired by multi-task learning strategies, this paper enhances the primary task of depression classification by incorporating a secondary task (regression task) to improve overall performance.ResultsWe conducted experiments on the AVEC2017 and AVEC2019 datasets. The results show that, in the classification task, our method achieved an F1 score of 0.823 and a classification accuracy of 0.823 on the AVEC2017 dataset, and an F1 score of 0.816 and a classification accuracy of 0.810 on the AVEC2019 dataset. In the regression task, the RMSE was 6.10 (AVEC2017) and 4.89 (AVEC2019), respectively. These results demonstrate that our method outperforms most existing methods in both classification and regression tasks. Furthermore, we demonstrate that the model proposed in this paper can effectively improve the performance of depression detection when using multi-task learning.DiscussionAlthough depression detection through multimodality has shown good results in previous studies. However, multi-task learning can utilize the complementary information between different tasks. Therefore, our work combines multimodal and multi-task learning to improve the accuracy of depression detection. Previous studies have mostly focused on the extraction of global features while ignoring the importance of local features. Based on the problems of previous studies, we have made corresponding improvements to provide a more comprehensive and effective solution for depression detection.https://www.frontiersin.org/articles/10.3389/fpsyt.2024.1466507/fulldepression detectionmultimodalfeature extractionmulti-task learningDepITCM
spellingShingle Lishan Zhang
Lishan Zhang
Zhenhua Liu
Yumei Wan
Yunli Fan
Diancai Chen
Qingxiang Wang
Kaihong Zhang
Yunshao Zheng
DepITCM: an audio-visual method for detecting depression
Frontiers in Psychiatry
depression detection
multimodal
feature extraction
multi-task learning
DepITCM
title DepITCM: an audio-visual method for detecting depression
title_full DepITCM: an audio-visual method for detecting depression
title_fullStr DepITCM: an audio-visual method for detecting depression
title_full_unstemmed DepITCM: an audio-visual method for detecting depression
title_short DepITCM: an audio-visual method for detecting depression
title_sort depitcm an audio visual method for detecting depression
topic depression detection
multimodal
feature extraction
multi-task learning
DepITCM
url https://www.frontiersin.org/articles/10.3389/fpsyt.2024.1466507/full
work_keys_str_mv AT lishanzhang depitcmanaudiovisualmethodfordetectingdepression
AT lishanzhang depitcmanaudiovisualmethodfordetectingdepression
AT zhenhualiu depitcmanaudiovisualmethodfordetectingdepression
AT yumeiwan depitcmanaudiovisualmethodfordetectingdepression
AT yunlifan depitcmanaudiovisualmethodfordetectingdepression
AT diancaichen depitcmanaudiovisualmethodfordetectingdepression
AT qingxiangwang depitcmanaudiovisualmethodfordetectingdepression
AT kaihongzhang depitcmanaudiovisualmethodfordetectingdepression
AT yunshaozheng depitcmanaudiovisualmethodfordetectingdepression