DepITCM: an audio-visual method for detecting depression
IntroductionDepression is a prevalent mental disorder, and early screening and treatment are crucial for detecting depression. However, there are still some limitations in the currently proposed deep models based on audio-video data, for example, it is difficult to effectively extract and select use...
Saved in:
Main Authors: | , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Frontiers Media S.A.
2025-01-01
|
Series: | Frontiers in Psychiatry |
Subjects: | |
Online Access: | https://www.frontiersin.org/articles/10.3389/fpsyt.2024.1466507/full |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832590823578927104 |
---|---|
author | Lishan Zhang Lishan Zhang Zhenhua Liu Yumei Wan Yunli Fan Diancai Chen Qingxiang Wang Kaihong Zhang Yunshao Zheng |
author_facet | Lishan Zhang Lishan Zhang Zhenhua Liu Yumei Wan Yunli Fan Diancai Chen Qingxiang Wang Kaihong Zhang Yunshao Zheng |
author_sort | Lishan Zhang |
collection | DOAJ |
description | IntroductionDepression is a prevalent mental disorder, and early screening and treatment are crucial for detecting depression. However, there are still some limitations in the currently proposed deep models based on audio-video data, for example, it is difficult to effectively extract and select useful multimodal information and features from audio-video data, and very few studies have been able to focus on three dimensions of information: time, channel, and space at the same time in depression detection. In addition, there are challenges in utilizing other tasks to enhance prediction accuracy. The resolution of these issues is crucial for constructing models of depression detection.MethodsIn this paper, we propose a multi-task representation learning based on vision and audio for depression detection model (DepITCM).The model comprises three main modules: a data preprocessing module, the Inception-Temporal-Channel Principal Component Analysis Module(ITCM Encoder), and a multi-task learning module. To efficiently extract rich feature representations from audio and video data, the ITCM Encoder employs a staged feature extraction strategy, transitioning from global to local features. This approach enables the capture of global features while emphasizing the fusion of temporal, channel, and spatial information in finer detail. Furthermore, inspired by multi-task learning strategies, this paper enhances the primary task of depression classification by incorporating a secondary task (regression task) to improve overall performance.ResultsWe conducted experiments on the AVEC2017 and AVEC2019 datasets. The results show that, in the classification task, our method achieved an F1 score of 0.823 and a classification accuracy of 0.823 on the AVEC2017 dataset, and an F1 score of 0.816 and a classification accuracy of 0.810 on the AVEC2019 dataset. In the regression task, the RMSE was 6.10 (AVEC2017) and 4.89 (AVEC2019), respectively. These results demonstrate that our method outperforms most existing methods in both classification and regression tasks. Furthermore, we demonstrate that the model proposed in this paper can effectively improve the performance of depression detection when using multi-task learning.DiscussionAlthough depression detection through multimodality has shown good results in previous studies. However, multi-task learning can utilize the complementary information between different tasks. Therefore, our work combines multimodal and multi-task learning to improve the accuracy of depression detection. Previous studies have mostly focused on the extraction of global features while ignoring the importance of local features. Based on the problems of previous studies, we have made corresponding improvements to provide a more comprehensive and effective solution for depression detection. |
format | Article |
id | doaj-art-7cf2db35175f428889ed7f032d5a4061 |
institution | Kabale University |
issn | 1664-0640 |
language | English |
publishDate | 2025-01-01 |
publisher | Frontiers Media S.A. |
record_format | Article |
series | Frontiers in Psychiatry |
spelling | doaj-art-7cf2db35175f428889ed7f032d5a40612025-01-23T06:56:38ZengFrontiers Media S.A.Frontiers in Psychiatry1664-06402025-01-011510.3389/fpsyt.2024.14665071466507DepITCM: an audio-visual method for detecting depressionLishan Zhang0Lishan Zhang1Zhenhua Liu2Yumei Wan3Yunli Fan4Diancai Chen5Qingxiang Wang6Kaihong Zhang7Yunshao Zheng8Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences), Jinan, ChinaShandong Provincial Key Laboratory of Computer Networks, Shandong Fundamental Research Center for Computer Science, Jinan, ChinaShandong Mental Health Center, Shandong University, Jinan, ChinaShandong Mental Health Center, Shandong University, Jinan, ChinaShandong Mental Health Center, Shandong University, Jinan, ChinaShandong Mental Health Center, Shandong University, Jinan, ChinaShandong Mental Health Center, Shandong University, Jinan, ChinaShandong Mental Health Center, Shandong University, Jinan, ChinaShandong Mental Health Center, Shandong University, Jinan, ChinaIntroductionDepression is a prevalent mental disorder, and early screening and treatment are crucial for detecting depression. However, there are still some limitations in the currently proposed deep models based on audio-video data, for example, it is difficult to effectively extract and select useful multimodal information and features from audio-video data, and very few studies have been able to focus on three dimensions of information: time, channel, and space at the same time in depression detection. In addition, there are challenges in utilizing other tasks to enhance prediction accuracy. The resolution of these issues is crucial for constructing models of depression detection.MethodsIn this paper, we propose a multi-task representation learning based on vision and audio for depression detection model (DepITCM).The model comprises three main modules: a data preprocessing module, the Inception-Temporal-Channel Principal Component Analysis Module(ITCM Encoder), and a multi-task learning module. To efficiently extract rich feature representations from audio and video data, the ITCM Encoder employs a staged feature extraction strategy, transitioning from global to local features. This approach enables the capture of global features while emphasizing the fusion of temporal, channel, and spatial information in finer detail. Furthermore, inspired by multi-task learning strategies, this paper enhances the primary task of depression classification by incorporating a secondary task (regression task) to improve overall performance.ResultsWe conducted experiments on the AVEC2017 and AVEC2019 datasets. The results show that, in the classification task, our method achieved an F1 score of 0.823 and a classification accuracy of 0.823 on the AVEC2017 dataset, and an F1 score of 0.816 and a classification accuracy of 0.810 on the AVEC2019 dataset. In the regression task, the RMSE was 6.10 (AVEC2017) and 4.89 (AVEC2019), respectively. These results demonstrate that our method outperforms most existing methods in both classification and regression tasks. Furthermore, we demonstrate that the model proposed in this paper can effectively improve the performance of depression detection when using multi-task learning.DiscussionAlthough depression detection through multimodality has shown good results in previous studies. However, multi-task learning can utilize the complementary information between different tasks. Therefore, our work combines multimodal and multi-task learning to improve the accuracy of depression detection. Previous studies have mostly focused on the extraction of global features while ignoring the importance of local features. Based on the problems of previous studies, we have made corresponding improvements to provide a more comprehensive and effective solution for depression detection.https://www.frontiersin.org/articles/10.3389/fpsyt.2024.1466507/fulldepression detectionmultimodalfeature extractionmulti-task learningDepITCM |
spellingShingle | Lishan Zhang Lishan Zhang Zhenhua Liu Yumei Wan Yunli Fan Diancai Chen Qingxiang Wang Kaihong Zhang Yunshao Zheng DepITCM: an audio-visual method for detecting depression Frontiers in Psychiatry depression detection multimodal feature extraction multi-task learning DepITCM |
title | DepITCM: an audio-visual method for detecting depression |
title_full | DepITCM: an audio-visual method for detecting depression |
title_fullStr | DepITCM: an audio-visual method for detecting depression |
title_full_unstemmed | DepITCM: an audio-visual method for detecting depression |
title_short | DepITCM: an audio-visual method for detecting depression |
title_sort | depitcm an audio visual method for detecting depression |
topic | depression detection multimodal feature extraction multi-task learning DepITCM |
url | https://www.frontiersin.org/articles/10.3389/fpsyt.2024.1466507/full |
work_keys_str_mv | AT lishanzhang depitcmanaudiovisualmethodfordetectingdepression AT lishanzhang depitcmanaudiovisualmethodfordetectingdepression AT zhenhualiu depitcmanaudiovisualmethodfordetectingdepression AT yumeiwan depitcmanaudiovisualmethodfordetectingdepression AT yunlifan depitcmanaudiovisualmethodfordetectingdepression AT diancaichen depitcmanaudiovisualmethodfordetectingdepression AT qingxiangwang depitcmanaudiovisualmethodfordetectingdepression AT kaihongzhang depitcmanaudiovisualmethodfordetectingdepression AT yunshaozheng depitcmanaudiovisualmethodfordetectingdepression |