Reasoning language models for more transparent prediction of suicide risk

Background We previously demonstrated that a large language model could estimate suicide risk using hospital discharge notes.Objective With the emergence of reasoning models that can be run on consumer-grade hardware, we investigated whether these models can approximate the performance of much large...

Full description

Saved in:
Bibliographic Details
Main Authors: Roy H Perlis, Thomas H McCoy
Format: Article
Language:English
Published: BMJ Publishing Group 2025-05-01
Series:BMJ Mental Health
Online Access:https://mentalhealth.bmj.com/content/28/1/e301654.full
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Background We previously demonstrated that a large language model could estimate suicide risk using hospital discharge notes.Objective With the emergence of reasoning models that can be run on consumer-grade hardware, we investigated whether these models can approximate the performance of much larger and costlier models.Methods From 458 053 adults hospitalised at one of two academic medical centres between 4 January 2005 and 2 January 2014, we identified 1995 who died by suicide or accident, and matched them with 5 control individuals. We used Llama-DeepSeek-R1 8B to generate predictions of risk. Beyond discrimination and calibration, we examined the aspects of model reasoning—that is, the topics in the chain of thought—associated with correct or incorrect predictions.Findings The cohort included 1995 individuals who died by suicide or accidental death and 9975 individuals matched 5:1, totalling 11 954 discharges and 58 933 person-years of follow-up. In Fine and Grey regression, hazard as estimated by the Llama3-distilled model was significantly associated with observed risk (unadjusted HR 4.65 (3.58–6.04)). The corresponding c-statistic was 0.64 (0.63–0.65), modestly poorer than the GPT4o model (0.67 (0.66–0.68)). In chain-of-thought reasoning, topics including Substance Abuse, Surgical Procedure, and Age-related Comorbidities were associated with correct predictions, while Fall-related Injury was associated with incorrect prediction.Conclusions Application of a reasoning model using local, consumer-grade hardware only modestly diminished performance in stratifying suicide risk.Clinical implications Smaller models can yield more secure, scalable and transparent risk prediction.
ISSN:2755-9734