Evaluation of large language models on mental health: from knowledge test to illness diagnosis
Large language models (LLMs) have opened up new possibilities in the field of mental health, offering applications in areas such as mental health assessment, psychological counseling, and education. This study systematically evaluates 15 state-of-the-art LLMs, including DeepSeekR1/V3 (March 24, 2025...
Saved in:
| Main Authors: | , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Frontiers Media S.A.
2025-08-01
|
| Series: | Frontiers in Psychiatry |
| Subjects: | |
| Online Access: | https://www.frontiersin.org/articles/10.3389/fpsyt.2025.1646974/full |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850036879851782144 |
|---|---|
| author | Yijun Xu Zhaoxi Fang Zhaoxi Fang Weinan Lin Yue Jiang Wen Jin Prasanalakshmi Balaji Jiangda Wang Jiangda Wang Ting Xia |
| author_facet | Yijun Xu Zhaoxi Fang Zhaoxi Fang Weinan Lin Yue Jiang Wen Jin Prasanalakshmi Balaji Jiangda Wang Jiangda Wang Ting Xia |
| author_sort | Yijun Xu |
| collection | DOAJ |
| description | Large language models (LLMs) have opened up new possibilities in the field of mental health, offering applications in areas such as mental health assessment, psychological counseling, and education. This study systematically evaluates 15 state-of-the-art LLMs, including DeepSeekR1/V3 (March 24, 2025), GPT-4.1 (April 15, 2025), Llama4 (April 5, 2025), and QwQ (March 6, 2025, developed by Alibaba), on two key tasks: mental health knowledge testing and mental illness diagnosis in the Chinese context. We use publicly available datasets, including Dreaddit, SDCNL, and questions from the CAS Counsellor Qualification Exam. Results indicate that DeepSeek-R1, QwQ, and GPT-4.1 outperform other models in both knowledge accuracy and diagnostic performance. Our findings highlight the strengths and limitations of current LLMs in Chinese mental health scenarios and provide clear guidance for selecting and improving models in this sensitive domain. |
| format | Article |
| id | doaj-art-e580b007dcda4a3ca23b109839ef9cd3 |
| institution | DOAJ |
| issn | 1664-0640 |
| language | English |
| publishDate | 2025-08-01 |
| publisher | Frontiers Media S.A. |
| record_format | Article |
| series | Frontiers in Psychiatry |
| spelling | doaj-art-e580b007dcda4a3ca23b109839ef9cd32025-08-20T02:57:01ZengFrontiers Media S.A.Frontiers in Psychiatry1664-06402025-08-011610.3389/fpsyt.2025.16469741646974Evaluation of large language models on mental health: from knowledge test to illness diagnosisYijun Xu0Zhaoxi Fang1Zhaoxi Fang2Weinan Lin3Yue Jiang4Wen Jin5Prasanalakshmi Balaji6Jiangda Wang7Jiangda Wang8Ting Xia9Department of Computer Science and Engineering, Shaoxing University, Shaoxing, ChinaDepartment of Computer Science and Engineering, Shaoxing University, Shaoxing, ChinaInstitute of Artificial Intelligence, Shaoxing University, Shaoxing, ChinaDepartment of Computer Science and Engineering, Shaoxing University, Shaoxing, ChinaDepartment of Computer Science and Engineering, Shaoxing University, Shaoxing, ChinaDepartment of Computer Science and Engineering, Shaoxing University, Shaoxing, ChinaDepartment of Computer Science, College of Computer Science, King Khalid University, Abha, Saudi ArabiaDepartment of Computer Science and Engineering, Shaoxing University, Shaoxing, ChinaInstitute of Artificial Intelligence, Shaoxing University, Shaoxing, ChinaSchool of Life and Environmental Sciences, Shaoxing University, Shaoxing, ChinaLarge language models (LLMs) have opened up new possibilities in the field of mental health, offering applications in areas such as mental health assessment, psychological counseling, and education. This study systematically evaluates 15 state-of-the-art LLMs, including DeepSeekR1/V3 (March 24, 2025), GPT-4.1 (April 15, 2025), Llama4 (April 5, 2025), and QwQ (March 6, 2025, developed by Alibaba), on two key tasks: mental health knowledge testing and mental illness diagnosis in the Chinese context. We use publicly available datasets, including Dreaddit, SDCNL, and questions from the CAS Counsellor Qualification Exam. Results indicate that DeepSeek-R1, QwQ, and GPT-4.1 outperform other models in both knowledge accuracy and diagnostic performance. Our findings highlight the strengths and limitations of current LLMs in Chinese mental health scenarios and provide clear guidance for selecting and improving models in this sensitive domain.https://www.frontiersin.org/articles/10.3389/fpsyt.2025.1646974/fulllarge language modelsmodel evaluationmental healthknowledge testillness diagnosis |
| spellingShingle | Yijun Xu Zhaoxi Fang Zhaoxi Fang Weinan Lin Yue Jiang Wen Jin Prasanalakshmi Balaji Jiangda Wang Jiangda Wang Ting Xia Evaluation of large language models on mental health: from knowledge test to illness diagnosis Frontiers in Psychiatry large language models model evaluation mental health knowledge test illness diagnosis |
| title | Evaluation of large language models on mental health: from knowledge test to illness diagnosis |
| title_full | Evaluation of large language models on mental health: from knowledge test to illness diagnosis |
| title_fullStr | Evaluation of large language models on mental health: from knowledge test to illness diagnosis |
| title_full_unstemmed | Evaluation of large language models on mental health: from knowledge test to illness diagnosis |
| title_short | Evaluation of large language models on mental health: from knowledge test to illness diagnosis |
| title_sort | evaluation of large language models on mental health from knowledge test to illness diagnosis |
| topic | large language models model evaluation mental health knowledge test illness diagnosis |
| url | https://www.frontiersin.org/articles/10.3389/fpsyt.2025.1646974/full |
| work_keys_str_mv | AT yijunxu evaluationoflargelanguagemodelsonmentalhealthfromknowledgetesttoillnessdiagnosis AT zhaoxifang evaluationoflargelanguagemodelsonmentalhealthfromknowledgetesttoillnessdiagnosis AT zhaoxifang evaluationoflargelanguagemodelsonmentalhealthfromknowledgetesttoillnessdiagnosis AT weinanlin evaluationoflargelanguagemodelsonmentalhealthfromknowledgetesttoillnessdiagnosis AT yuejiang evaluationoflargelanguagemodelsonmentalhealthfromknowledgetesttoillnessdiagnosis AT wenjin evaluationoflargelanguagemodelsonmentalhealthfromknowledgetesttoillnessdiagnosis AT prasanalakshmibalaji evaluationoflargelanguagemodelsonmentalhealthfromknowledgetesttoillnessdiagnosis AT jiangdawang evaluationoflargelanguagemodelsonmentalhealthfromknowledgetesttoillnessdiagnosis AT jiangdawang evaluationoflargelanguagemodelsonmentalhealthfromknowledgetesttoillnessdiagnosis AT tingxia evaluationoflargelanguagemodelsonmentalhealthfromknowledgetesttoillnessdiagnosis |