Evaluation of large language models on mental health: from knowledge test to illness diagnosis

Large language models (LLMs) have opened up new possibilities in the field of mental health, offering applications in areas such as mental health assessment, psychological counseling, and education. This study systematically evaluates 15 state-of-the-art LLMs, including DeepSeekR1/V3 (March 24, 2025...

Full description

Saved in:
Bibliographic Details
Main Authors: Yijun Xu, Zhaoxi Fang, Weinan Lin, Yue Jiang, Wen Jin, Prasanalakshmi Balaji, Jiangda Wang, Ting Xia
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-08-01
Series:Frontiers in Psychiatry
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fpsyt.2025.1646974/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850036879851782144
author Yijun Xu
Zhaoxi Fang
Zhaoxi Fang
Weinan Lin
Yue Jiang
Wen Jin
Prasanalakshmi Balaji
Jiangda Wang
Jiangda Wang
Ting Xia
author_facet Yijun Xu
Zhaoxi Fang
Zhaoxi Fang
Weinan Lin
Yue Jiang
Wen Jin
Prasanalakshmi Balaji
Jiangda Wang
Jiangda Wang
Ting Xia
author_sort Yijun Xu
collection DOAJ
description Large language models (LLMs) have opened up new possibilities in the field of mental health, offering applications in areas such as mental health assessment, psychological counseling, and education. This study systematically evaluates 15 state-of-the-art LLMs, including DeepSeekR1/V3 (March 24, 2025), GPT-4.1 (April 15, 2025), Llama4 (April 5, 2025), and QwQ (March 6, 2025, developed by Alibaba), on two key tasks: mental health knowledge testing and mental illness diagnosis in the Chinese context. We use publicly available datasets, including Dreaddit, SDCNL, and questions from the CAS Counsellor Qualification Exam. Results indicate that DeepSeek-R1, QwQ, and GPT-4.1 outperform other models in both knowledge accuracy and diagnostic performance. Our findings highlight the strengths and limitations of current LLMs in Chinese mental health scenarios and provide clear guidance for selecting and improving models in this sensitive domain.
format Article
id doaj-art-e580b007dcda4a3ca23b109839ef9cd3
institution DOAJ
issn 1664-0640
language English
publishDate 2025-08-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Psychiatry
spelling doaj-art-e580b007dcda4a3ca23b109839ef9cd32025-08-20T02:57:01ZengFrontiers Media S.A.Frontiers in Psychiatry1664-06402025-08-011610.3389/fpsyt.2025.16469741646974Evaluation of large language models on mental health: from knowledge test to illness diagnosisYijun Xu0Zhaoxi Fang1Zhaoxi Fang2Weinan Lin3Yue Jiang4Wen Jin5Prasanalakshmi Balaji6Jiangda Wang7Jiangda Wang8Ting Xia9Department of Computer Science and Engineering, Shaoxing University, Shaoxing, ChinaDepartment of Computer Science and Engineering, Shaoxing University, Shaoxing, ChinaInstitute of Artificial Intelligence, Shaoxing University, Shaoxing, ChinaDepartment of Computer Science and Engineering, Shaoxing University, Shaoxing, ChinaDepartment of Computer Science and Engineering, Shaoxing University, Shaoxing, ChinaDepartment of Computer Science and Engineering, Shaoxing University, Shaoxing, ChinaDepartment of Computer Science, College of Computer Science, King Khalid University, Abha, Saudi ArabiaDepartment of Computer Science and Engineering, Shaoxing University, Shaoxing, ChinaInstitute of Artificial Intelligence, Shaoxing University, Shaoxing, ChinaSchool of Life and Environmental Sciences, Shaoxing University, Shaoxing, ChinaLarge language models (LLMs) have opened up new possibilities in the field of mental health, offering applications in areas such as mental health assessment, psychological counseling, and education. This study systematically evaluates 15 state-of-the-art LLMs, including DeepSeekR1/V3 (March 24, 2025), GPT-4.1 (April 15, 2025), Llama4 (April 5, 2025), and QwQ (March 6, 2025, developed by Alibaba), on two key tasks: mental health knowledge testing and mental illness diagnosis in the Chinese context. We use publicly available datasets, including Dreaddit, SDCNL, and questions from the CAS Counsellor Qualification Exam. Results indicate that DeepSeek-R1, QwQ, and GPT-4.1 outperform other models in both knowledge accuracy and diagnostic performance. Our findings highlight the strengths and limitations of current LLMs in Chinese mental health scenarios and provide clear guidance for selecting and improving models in this sensitive domain.https://www.frontiersin.org/articles/10.3389/fpsyt.2025.1646974/fulllarge language modelsmodel evaluationmental healthknowledge testillness diagnosis
spellingShingle Yijun Xu
Zhaoxi Fang
Zhaoxi Fang
Weinan Lin
Yue Jiang
Wen Jin
Prasanalakshmi Balaji
Jiangda Wang
Jiangda Wang
Ting Xia
Evaluation of large language models on mental health: from knowledge test to illness diagnosis
Frontiers in Psychiatry
large language models
model evaluation
mental health
knowledge test
illness diagnosis
title Evaluation of large language models on mental health: from knowledge test to illness diagnosis
title_full Evaluation of large language models on mental health: from knowledge test to illness diagnosis
title_fullStr Evaluation of large language models on mental health: from knowledge test to illness diagnosis
title_full_unstemmed Evaluation of large language models on mental health: from knowledge test to illness diagnosis
title_short Evaluation of large language models on mental health: from knowledge test to illness diagnosis
title_sort evaluation of large language models on mental health from knowledge test to illness diagnosis
topic large language models
model evaluation
mental health
knowledge test
illness diagnosis
url https://www.frontiersin.org/articles/10.3389/fpsyt.2025.1646974/full
work_keys_str_mv AT yijunxu evaluationoflargelanguagemodelsonmentalhealthfromknowledgetesttoillnessdiagnosis
AT zhaoxifang evaluationoflargelanguagemodelsonmentalhealthfromknowledgetesttoillnessdiagnosis
AT zhaoxifang evaluationoflargelanguagemodelsonmentalhealthfromknowledgetesttoillnessdiagnosis
AT weinanlin evaluationoflargelanguagemodelsonmentalhealthfromknowledgetesttoillnessdiagnosis
AT yuejiang evaluationoflargelanguagemodelsonmentalhealthfromknowledgetesttoillnessdiagnosis
AT wenjin evaluationoflargelanguagemodelsonmentalhealthfromknowledgetesttoillnessdiagnosis
AT prasanalakshmibalaji evaluationoflargelanguagemodelsonmentalhealthfromknowledgetesttoillnessdiagnosis
AT jiangdawang evaluationoflargelanguagemodelsonmentalhealthfromknowledgetesttoillnessdiagnosis
AT jiangdawang evaluationoflargelanguagemodelsonmentalhealthfromknowledgetesttoillnessdiagnosis
AT tingxia evaluationoflargelanguagemodelsonmentalhealthfromknowledgetesttoillnessdiagnosis