Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study

BackgroundWith suicide rates in the United States at an all-time high, individuals experiencing suicidal ideation are increasingly turning to large language models (LLMs) for guidance and support. ObjectiveThe objective of this study was to assess the competency o...

Full description

Saved in:

Bibliographic Details
Main Authors:	Ryan K McBain, Jonathan H Cantor, Li Ang Zhang, Olesya Baker, Fang Zhang, Alyssa Halbisen, Aaron Kofner, Joshua Breslau, Bradley Stein, Ateev Mehrotra, Hao Yu
Format:	Article
Language:	English
Published:	JMIR Publications 2025-03-01
Series:	Journal of Medical Internet Research
Online Access:	https://www.jmir.org/2025/1/e67891
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849773567832489984
author	Ryan K McBain Jonathan H Cantor Li Ang Zhang Olesya Baker Fang Zhang Alyssa Halbisen Aaron Kofner Joshua Breslau Bradley Stein Ateev Mehrotra Hao Yu
author_facet	Ryan K McBain Jonathan H Cantor Li Ang Zhang Olesya Baker Fang Zhang Alyssa Halbisen Aaron Kofner Joshua Breslau Bradley Stein Ateev Mehrotra Hao Yu
author_sort	Ryan K McBain
collection	DOAJ
description	BackgroundWith suicide rates in the United States at an all-time high, individuals experiencing suicidal ideation are increasingly turning to large language models (LLMs) for guidance and support. ObjectiveThe objective of this study was to assess the competency of 3 widely used LLMs to distinguish appropriate versus inappropriate responses when engaging individuals who exhibit suicidal ideation. MethodsThis observational, cross-sectional study evaluated responses to the revised Suicidal Ideation Response Inventory (SIRI-2) generated by ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. Data collection and analyses were conducted in July 2024. A common training module for mental health professionals, SIRI-2 provides 24 hypothetical scenarios in which a patient exhibits depressive symptoms and suicidal ideation, followed by two clinician responses. Clinician responses were scored from –3 (highly inappropriate) to +3 (highly appropriate). All 3 LLMs were provided with a standardized set of instructions to rate clinician responses. We compared LLM responses to those of expert suicidologists, conducting linear regression analyses and converting LLM responses to z scores to identify outliers (z score>1.96 or <–1.96; P<0.05). Furthermore, we compared final SIRI-2 scores to those produced by health professionals in prior studies. ResultsAll 3 LLMs rated responses as more appropriate than ratings provided by expert suicidologists. The item-level mean difference was 0.86 for ChatGPT (95% CI 0.61-1.12; P<.001), 0.61 for Claude (95% CI 0.41-0.81; P<.001), and 0.73 for Gemini (95% CI 0.35-1.11; P<.001). In terms of z scores, 19% (9 of 48) of ChatGPT responses were outliers when compared to expert suicidologists. Similarly, 11% (5 of 48) of Claude responses were outliers compared to expert suicidologists. Additionally, 36% (17 of 48) of Gemini responses were outliers compared to expert suicidologists. ChatGPT produced a final SIRI-2 score of 45.7, roughly equivalent to master’s level counselors in prior studies. Claude produced an SIRI-2 score of 36.7, exceeding prior performance of mental health professionals after suicide intervention skills training. Gemini produced a final SIRI-2 score of 54.5, equivalent to untrained K-12 school staff. ConclusionsCurrent versions of 3 major LLMs demonstrated an upward bias in their evaluations of appropriate responses to suicidal ideation; however, 2 of the 3 models performed equivalent to or exceeded the performance of mental health professionals.
format	Article
id	doaj-art-4f2d5462cc7c47b3a49b7c65c5147a0e
institution	DOAJ
issn	1438-8871
language	English
publishDate	2025-03-01
publisher	JMIR Publications
record_format	Article
series	Journal of Medical Internet Research
spelling	doaj-art-4f2d5462cc7c47b3a49b7c65c5147a0e2025-08-20T03:02:02ZengJMIR PublicationsJournal of Medical Internet Research1438-88712025-03-0127e6789110.2196/67891Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative StudyRyan K McBainhttps://orcid.org/0000-0003-0073-0348Jonathan H Cantorhttps://orcid.org/0000-0003-4468-833XLi Ang Zhanghttps://orcid.org/0000-0001-9468-2513Olesya Bakerhttps://orcid.org/0000-0001-9125-2761Fang Zhanghttps://orcid.org/0000-0002-8282-8738Alyssa Halbisenhttps://orcid.org/0009-0000-0482-729XAaron Kofnerhttps://orcid.org/0000-0001-6980-1218Joshua Breslauhttps://orcid.org/0000-0002-1194-4643Bradley Steinhttps://orcid.org/0000-0003-1544-458XAteev Mehrotrahttps://orcid.org/0000-0003-2223-1582Hao Yuhttps://orcid.org/0000-0001-6169-4243 BackgroundWith suicide rates in the United States at an all-time high, individuals experiencing suicidal ideation are increasingly turning to large language models (LLMs) for guidance and support. ObjectiveThe objective of this study was to assess the competency of 3 widely used LLMs to distinguish appropriate versus inappropriate responses when engaging individuals who exhibit suicidal ideation. MethodsThis observational, cross-sectional study evaluated responses to the revised Suicidal Ideation Response Inventory (SIRI-2) generated by ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. Data collection and analyses were conducted in July 2024. A common training module for mental health professionals, SIRI-2 provides 24 hypothetical scenarios in which a patient exhibits depressive symptoms and suicidal ideation, followed by two clinician responses. Clinician responses were scored from –3 (highly inappropriate) to +3 (highly appropriate). All 3 LLMs were provided with a standardized set of instructions to rate clinician responses. We compared LLM responses to those of expert suicidologists, conducting linear regression analyses and converting LLM responses to z scores to identify outliers (z score>1.96 or <–1.96; P<0.05). Furthermore, we compared final SIRI-2 scores to those produced by health professionals in prior studies. ResultsAll 3 LLMs rated responses as more appropriate than ratings provided by expert suicidologists. The item-level mean difference was 0.86 for ChatGPT (95% CI 0.61-1.12; P<.001), 0.61 for Claude (95% CI 0.41-0.81; P<.001), and 0.73 for Gemini (95% CI 0.35-1.11; P<.001). In terms of z scores, 19% (9 of 48) of ChatGPT responses were outliers when compared to expert suicidologists. Similarly, 11% (5 of 48) of Claude responses were outliers compared to expert suicidologists. Additionally, 36% (17 of 48) of Gemini responses were outliers compared to expert suicidologists. ChatGPT produced a final SIRI-2 score of 45.7, roughly equivalent to master’s level counselors in prior studies. Claude produced an SIRI-2 score of 36.7, exceeding prior performance of mental health professionals after suicide intervention skills training. Gemini produced a final SIRI-2 score of 54.5, equivalent to untrained K-12 school staff. ConclusionsCurrent versions of 3 major LLMs demonstrated an upward bias in their evaluations of appropriate responses to suicidal ideation; however, 2 of the 3 models performed equivalent to or exceeded the performance of mental health professionals.https://www.jmir.org/2025/1/e67891
spellingShingle	Ryan K McBain Jonathan H Cantor Li Ang Zhang Olesya Baker Fang Zhang Alyssa Halbisen Aaron Kofner Joshua Breslau Bradley Stein Ateev Mehrotra Hao Yu Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study Journal of Medical Internet Research
title	Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study
title_full	Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study
title_fullStr	Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study
title_full_unstemmed	Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study
title_short	Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study
title_sort	competency of large language models in evaluating appropriate responses to suicidal ideation comparative study
url	https://www.jmir.org/2025/1/e67891
work_keys_str_mv	AT ryankmcbain competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy AT jonathanhcantor competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy AT liangzhang competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy AT olesyabaker competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy AT fangzhang competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy AT alyssahalbisen competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy AT aaronkofner competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy AT joshuabreslau competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy AT bradleystein competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy AT ateevmehrotra competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy AT haoyu competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy

Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study

Similar Items