Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study
BackgroundWith suicide rates in the United States at an all-time high, individuals experiencing suicidal ideation are increasingly turning to large language models (LLMs) for guidance and support. ObjectiveThe objective of this study was to assess the competency o...
Saved in:
| Main Authors: | , , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
JMIR Publications
2025-03-01
|
| Series: | Journal of Medical Internet Research |
| Online Access: | https://www.jmir.org/2025/1/e67891 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849773567832489984 |
|---|---|
| author | Ryan K McBain Jonathan H Cantor Li Ang Zhang Olesya Baker Fang Zhang Alyssa Halbisen Aaron Kofner Joshua Breslau Bradley Stein Ateev Mehrotra Hao Yu |
| author_facet | Ryan K McBain Jonathan H Cantor Li Ang Zhang Olesya Baker Fang Zhang Alyssa Halbisen Aaron Kofner Joshua Breslau Bradley Stein Ateev Mehrotra Hao Yu |
| author_sort | Ryan K McBain |
| collection | DOAJ |
| description |
BackgroundWith suicide rates in the United States at an all-time high, individuals experiencing suicidal ideation are increasingly turning to large language models (LLMs) for guidance and support.
ObjectiveThe objective of this study was to assess the competency of 3 widely used LLMs to distinguish appropriate versus inappropriate responses when engaging individuals who exhibit suicidal ideation.
MethodsThis observational, cross-sectional study evaluated responses to the revised Suicidal Ideation Response Inventory (SIRI-2) generated by ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. Data collection and analyses were conducted in July 2024. A common training module for mental health professionals, SIRI-2 provides 24 hypothetical scenarios in which a patient exhibits depressive symptoms and suicidal ideation, followed by two clinician responses. Clinician responses were scored from –3 (highly inappropriate) to +3 (highly appropriate). All 3 LLMs were provided with a standardized set of instructions to rate clinician responses. We compared LLM responses to those of expert suicidologists, conducting linear regression analyses and converting LLM responses to z scores to identify outliers (z score>1.96 or <–1.96; P<0.05). Furthermore, we compared final SIRI-2 scores to those produced by health professionals in prior studies.
ResultsAll 3 LLMs rated responses as more appropriate than ratings provided by expert suicidologists. The item-level mean difference was 0.86 for ChatGPT (95% CI 0.61-1.12; P<.001), 0.61 for Claude (95% CI 0.41-0.81; P<.001), and 0.73 for Gemini (95% CI 0.35-1.11; P<.001). In terms of z scores, 19% (9 of 48) of ChatGPT responses were outliers when compared to expert suicidologists. Similarly, 11% (5 of 48) of Claude responses were outliers compared to expert suicidologists. Additionally, 36% (17 of 48) of Gemini responses were outliers compared to expert suicidologists. ChatGPT produced a final SIRI-2 score of 45.7, roughly equivalent to master’s level counselors in prior studies. Claude produced an SIRI-2 score of 36.7, exceeding prior performance of mental health professionals after suicide intervention skills training. Gemini produced a final SIRI-2 score of 54.5, equivalent to untrained K-12 school staff.
ConclusionsCurrent versions of 3 major LLMs demonstrated an upward bias in their evaluations of appropriate responses to suicidal ideation; however, 2 of the 3 models performed equivalent to or exceeded the performance of mental health professionals. |
| format | Article |
| id | doaj-art-4f2d5462cc7c47b3a49b7c65c5147a0e |
| institution | DOAJ |
| issn | 1438-8871 |
| language | English |
| publishDate | 2025-03-01 |
| publisher | JMIR Publications |
| record_format | Article |
| series | Journal of Medical Internet Research |
| spelling | doaj-art-4f2d5462cc7c47b3a49b7c65c5147a0e2025-08-20T03:02:02ZengJMIR PublicationsJournal of Medical Internet Research1438-88712025-03-0127e6789110.2196/67891Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative StudyRyan K McBainhttps://orcid.org/0000-0003-0073-0348Jonathan H Cantorhttps://orcid.org/0000-0003-4468-833XLi Ang Zhanghttps://orcid.org/0000-0001-9468-2513Olesya Bakerhttps://orcid.org/0000-0001-9125-2761Fang Zhanghttps://orcid.org/0000-0002-8282-8738Alyssa Halbisenhttps://orcid.org/0009-0000-0482-729XAaron Kofnerhttps://orcid.org/0000-0001-6980-1218Joshua Breslauhttps://orcid.org/0000-0002-1194-4643Bradley Steinhttps://orcid.org/0000-0003-1544-458XAteev Mehrotrahttps://orcid.org/0000-0003-2223-1582Hao Yuhttps://orcid.org/0000-0001-6169-4243 BackgroundWith suicide rates in the United States at an all-time high, individuals experiencing suicidal ideation are increasingly turning to large language models (LLMs) for guidance and support. ObjectiveThe objective of this study was to assess the competency of 3 widely used LLMs to distinguish appropriate versus inappropriate responses when engaging individuals who exhibit suicidal ideation. MethodsThis observational, cross-sectional study evaluated responses to the revised Suicidal Ideation Response Inventory (SIRI-2) generated by ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. Data collection and analyses were conducted in July 2024. A common training module for mental health professionals, SIRI-2 provides 24 hypothetical scenarios in which a patient exhibits depressive symptoms and suicidal ideation, followed by two clinician responses. Clinician responses were scored from –3 (highly inappropriate) to +3 (highly appropriate). All 3 LLMs were provided with a standardized set of instructions to rate clinician responses. We compared LLM responses to those of expert suicidologists, conducting linear regression analyses and converting LLM responses to z scores to identify outliers (z score>1.96 or <–1.96; P<0.05). Furthermore, we compared final SIRI-2 scores to those produced by health professionals in prior studies. ResultsAll 3 LLMs rated responses as more appropriate than ratings provided by expert suicidologists. The item-level mean difference was 0.86 for ChatGPT (95% CI 0.61-1.12; P<.001), 0.61 for Claude (95% CI 0.41-0.81; P<.001), and 0.73 for Gemini (95% CI 0.35-1.11; P<.001). In terms of z scores, 19% (9 of 48) of ChatGPT responses were outliers when compared to expert suicidologists. Similarly, 11% (5 of 48) of Claude responses were outliers compared to expert suicidologists. Additionally, 36% (17 of 48) of Gemini responses were outliers compared to expert suicidologists. ChatGPT produced a final SIRI-2 score of 45.7, roughly equivalent to master’s level counselors in prior studies. Claude produced an SIRI-2 score of 36.7, exceeding prior performance of mental health professionals after suicide intervention skills training. Gemini produced a final SIRI-2 score of 54.5, equivalent to untrained K-12 school staff. ConclusionsCurrent versions of 3 major LLMs demonstrated an upward bias in their evaluations of appropriate responses to suicidal ideation; however, 2 of the 3 models performed equivalent to or exceeded the performance of mental health professionals.https://www.jmir.org/2025/1/e67891 |
| spellingShingle | Ryan K McBain Jonathan H Cantor Li Ang Zhang Olesya Baker Fang Zhang Alyssa Halbisen Aaron Kofner Joshua Breslau Bradley Stein Ateev Mehrotra Hao Yu Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study Journal of Medical Internet Research |
| title | Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study |
| title_full | Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study |
| title_fullStr | Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study |
| title_full_unstemmed | Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study |
| title_short | Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study |
| title_sort | competency of large language models in evaluating appropriate responses to suicidal ideation comparative study |
| url | https://www.jmir.org/2025/1/e67891 |
| work_keys_str_mv | AT ryankmcbain competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy AT jonathanhcantor competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy AT liangzhang competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy AT olesyabaker competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy AT fangzhang competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy AT alyssahalbisen competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy AT aaronkofner competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy AT joshuabreslau competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy AT bradleystein competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy AT ateevmehrotra competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy AT haoyu competencyoflargelanguagemodelsinevaluatingappropriateresponsestosuicidalideationcomparativestudy |