Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study

BackgroundWith suicide rates in the United States at an all-time high, individuals experiencing suicidal ideation are increasingly turning to large language models (LLMs) for guidance and support. ObjectiveThe objective of this study was to assess the competency o...

Full description

Saved in:
Bibliographic Details
Main Authors: Ryan K McBain, Jonathan H Cantor, Li Ang Zhang, Olesya Baker, Fang Zhang, Alyssa Halbisen, Aaron Kofner, Joshua Breslau, Bradley Stein, Ateev Mehrotra, Hao Yu
Format: Article
Language:English
Published: JMIR Publications 2025-03-01
Series:Journal of Medical Internet Research
Online Access:https://www.jmir.org/2025/1/e67891
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849773567832489984
author Ryan K McBain
Jonathan H Cantor
Li Ang Zhang
Olesya Baker
Fang Zhang
Alyssa Halbisen
Aaron Kofner
Joshua Breslau
Bradley Stein
Ateev Mehrotra
Hao Yu
author_facet Ryan K McBain
Jonathan H Cantor
Li Ang Zhang
Olesya Baker
Fang Zhang
Alyssa Halbisen
Aaron Kofner
Joshua Breslau
Bradley Stein
Ateev Mehrotra
Hao Yu
author_sort Ryan K McBain
collection DOAJ
description BackgroundWith suicide rates in the United States at an all-time high, individuals experiencing suicidal ideation are increasingly turning to large language models (LLMs) for guidance and support. ObjectiveThe objective of this study was to assess the competency of 3 widely used LLMs to distinguish appropriate versus inappropriate responses when engaging individuals who exhibit suicidal ideation. MethodsThis observational, cross-sectional study evaluated responses to the revised Suicidal Ideation Response Inventory (SIRI-2) generated by ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. Data collection and analyses were conducted in July 2024. A common training module for mental health professionals, SIRI-2 provides 24 hypothetical scenarios in which a patient exhibits depressive symptoms and suicidal ideation, followed by two clinician responses. Clinician responses were scored from –3 (highly inappropriate) to +3 (highly appropriate). All 3 LLMs were provided with a standardized set of instructions to rate clinician responses. We compared LLM responses to those of expert suicidologists, conducting linear regression analyses and converting LLM responses to z scores to identify outliers (z score>1.96 or <–1.96; P<0.05). Furthermore, we compared final SIRI-2 scores to those produced by health professionals in prior studies. ResultsAll 3 LLMs rated responses as more appropriate than ratings provided by expert suicidologists. The item-level mean difference was 0.86 for ChatGPT (95% CI 0.61-1.12; P<.001), 0.61 for Claude (95% CI 0.41-0.81; P<.001), and 0.73 for Gemini (95% CI 0.35-1.11; P<.001). In terms of z scores, 19% (9 of 48) of ChatGPT responses were outliers when compared to expert suicidologists. Similarly, 11% (5 of 48) of Claude responses were outliers compared to expert suicidologists. Additionally, 36% (17 of 48) of Gemini responses were outliers compared to expert suicidologists. ChatGPT produced a final SIRI-2 score of 45.7, roughly equivalent to master’s level counselors in prior studies. Claude produced an SIRI-2 score of 36.7, exceeding prior performance of mental health professionals after suicide intervention skills training. Gemini produced a final SIRI-2 score of 54.5, equivalent to untrained K-12 school staff. ConclusionsCurrent versions of 3 major LLMs demonstrated an upward bias in their evaluations of appropriate responses to suicidal ideation; however, 2 of the 3 models performed equivalent to or exceeded the performance of mental health professionals.
format Article
id doaj-art-4f2d5462cc7c47b3a49b7c65c5147a0e
institution DOAJ
issn 1438-8871
language English
publishDate 2025-03-01
publisher JMIR Publications
record_format Article
series Journal of Medical Internet Research
spelling doaj-art-4f2d5462cc7c47b3a49b7c65c5147a0e2025-08-20T03:02:02ZengJMIR PublicationsJournal of Medical Internet Research1438-88712025-03-0127e6789110.2196/67891Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative StudyRyan K McBainhttps://orcid.org/0000-0003-0073-0348Jonathan H Cantorhttps://orcid.org/0000-0003-4468-833XLi Ang Zhanghttps://orcid.org/0000-0001-9468-2513Olesya Bakerhttps://orcid.org/0000-0001-9125-2761Fang Zhanghttps://orcid.org/0000-0002-8282-8738Alyssa Halbisenhttps://orcid.org/0009-0000-0482-729XAaron Kofnerhttps://orcid.org/0000-0001-6980-1218Joshua Breslauhttps://orcid.org/0000-0002-1194-4643Bradley Steinhttps://orcid.org/0000-0003-1544-458XAteev Mehrotrahttps://orcid.org/0000-0003-2223-1582Hao Yuhttps://orcid.org/0000-0001-6169-4243 BackgroundWith suicide rates in the United States at an all-time high, individuals experiencing suicidal ideation are increasingly turning to large language models (LLMs) for guidance and support. ObjectiveThe objective of this study was to assess the competency of 3 widely used LLMs to distinguish appropriate versus inappropriate responses when engaging individuals who exhibit suicidal ideation. MethodsThis observational, cross-sectional study evaluated responses to the revised Suicidal Ideation Response Inventory (SIRI-2) generated by ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. Data collection and analyses were conducted in July 2024. A common training module for mental health professionals, SIRI-2 provides 24 hypothetical scenarios in which a patient exhibits depressive symptoms and suicidal ideation, followed by two clinician responses. Clinician responses were scored from –3 (highly inappropriate) to +3 (highly appropriate). All 3 LLMs were provided with a standardized set of instructions to rate clinician responses. We compared LLM responses to those of expert suicidologists, conducting linear regression analyses and converting LLM responses to z scores to identify outliers (z score>1.96 or <–1.96; P<0.05). Furthermore, we compared final SIRI-2 scores to those produced by health professionals in prior studies. ResultsAll 3 LLMs rated responses as more appropriate than ratings provided by expert suicidologists. The item-level mean difference was 0.86 for ChatGPT (95% CI 0.61-1.12; P<.001), 0.61 for Claude (95% CI 0.41-0.81; P<.001), and 0.73 for Gemini (95% CI 0.35-1.11; P<.001). In terms of z scores, 19% (9 of 48) of ChatGPT responses were outliers when compared to expert suicidologists. Similarly, 11% (5 of 48) of Claude responses were outliers compared to expert suicidologists. Additionally, 36% (17 of 48) of Gemini responses were outliers compared to expert suicidologists. ChatGPT produced a final SIRI-2 score of 45.7, roughly equivalent to master’s level counselors in prior studies. Claude produced an SIRI-2 score of 36.7, exceeding prior performance of mental health professionals after suicide intervention skills training. Gemini produced a final SIRI-2 score of 54.5, equivalent to untrained K-12 school staff. ConclusionsCurrent versions of 3 major LLMs demonstrated an upward bias in their evaluations of appropriate responses to suicidal ideation; however, 2 of the 3 models performed equivalent to or exceeded the performance of mental health professionals.https://www.jmir.org/2025/1/e67891
spellingShingle Ryan K McBain
Jonathan H Cantor
Li Ang Zhang
Olesya Baker
Fang Zhang
Alyssa Halbisen
Aaron Kofner
Joshua Breslau
Bradley Stein
Ateev Mehrotra
Hao Yu
Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study
Journal of Medical Internet Research
title Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study
title_full Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study
title_fullStr Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study
title_full_unstemmed Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study
title_short Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study
title_sort competency of large language models in evaluating appropriate responses to suicidal ideation comparative study
url https://www.jmir.org/2025/1/e67891
work_keys_str_mv AT ryankmcbain competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy
AT jonathanhcantor competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy
AT liangzhang competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy
AT olesyabaker competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy
AT fangzhang competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy
AT alyssahalbisen competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy
AT aaronkofner competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy
AT joshuabreslau competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy
AT bradleystein competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy
AT ateevmehrotra competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy
AT haoyu competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy