Enhancing EHR-based pancreatic cancer prediction with LLM-derived embeddings

Abstract Pancreatic cancer (PC) is often diagnosed late, as early symptoms and effective screening tools are lacking, and genetic or familial factors explain only ~10% of cases. Leveraging longitudinal electronic health record (EHR) data may offer a promising avenue for early detection. We developed...

Full description

Saved in:
Bibliographic Details
Main Authors: Jiheum Park, Jason Patterson, Jose M. Acitores Cortina, Tian Gu, Chin Hur, Nicholas Tatonetti
Format: Article
Language:English
Published: Nature Portfolio 2025-07-01
Series:npj Digital Medicine
Online Access:https://doi.org/10.1038/s41746-025-01869-8
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Pancreatic cancer (PC) is often diagnosed late, as early symptoms and effective screening tools are lacking, and genetic or familial factors explain only ~10% of cases. Leveraging longitudinal electronic health record (EHR) data may offer a promising avenue for early detection. We developed a predictive model using large language model (LLM)-derived embeddings of medical condition names to enhance learning from EHR data. Across two sites—Columbia University Medical Center and Cedars-Sinai Medical Center—LLM embeddings improved 6–12 month prediction AUROCs from 0.60 to 0.67 and 0.82 to 0.86, respectively. Excluding data from 0–3 months before diagnosis further improved AUROCs to 0.82 and 0.89. Our model achieved a higher positive predictive value (0.141) than using traditional risk factors (0.004), and identified many PC patients without these risk factors or known genetic variants. These findings suggest that the EHR-based model may serve as an independent approach for identifying high-risk individuals.
ISSN:2398-6352