Enhancing EHR-based pancreatic cancer prediction with LLM-derived embeddings

Abstract Pancreatic cancer (PC) is often diagnosed late, as early symptoms and effective screening tools are lacking, and genetic or familial factors explain only ~10% of cases. Leveraging longitudinal electronic health record (EHR) data may offer a promising avenue for early detection. We developed...

Full description

Saved in:
Bibliographic Details
Main Authors: Jiheum Park, Jason Patterson, Jose M. Acitores Cortina, Tian Gu, Chin Hur, Nicholas Tatonetti
Format: Article
Language:English
Published: Nature Portfolio 2025-07-01
Series:npj Digital Medicine
Online Access:https://doi.org/10.1038/s41746-025-01869-8
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849341747492028416
author Jiheum Park
Jason Patterson
Jose M. Acitores Cortina
Tian Gu
Chin Hur
Nicholas Tatonetti
author_facet Jiheum Park
Jason Patterson
Jose M. Acitores Cortina
Tian Gu
Chin Hur
Nicholas Tatonetti
author_sort Jiheum Park
collection DOAJ
description Abstract Pancreatic cancer (PC) is often diagnosed late, as early symptoms and effective screening tools are lacking, and genetic or familial factors explain only ~10% of cases. Leveraging longitudinal electronic health record (EHR) data may offer a promising avenue for early detection. We developed a predictive model using large language model (LLM)-derived embeddings of medical condition names to enhance learning from EHR data. Across two sites—Columbia University Medical Center and Cedars-Sinai Medical Center—LLM embeddings improved 6–12 month prediction AUROCs from 0.60 to 0.67 and 0.82 to 0.86, respectively. Excluding data from 0–3 months before diagnosis further improved AUROCs to 0.82 and 0.89. Our model achieved a higher positive predictive value (0.141) than using traditional risk factors (0.004), and identified many PC patients without these risk factors or known genetic variants. These findings suggest that the EHR-based model may serve as an independent approach for identifying high-risk individuals.
format Article
id doaj-art-254ee181164c40d5b13d2ff441ae5bc2
institution Kabale University
issn 2398-6352
language English
publishDate 2025-07-01
publisher Nature Portfolio
record_format Article
series npj Digital Medicine
spelling doaj-art-254ee181164c40d5b13d2ff441ae5bc22025-08-20T03:43:34ZengNature Portfolionpj Digital Medicine2398-63522025-07-01811910.1038/s41746-025-01869-8Enhancing EHR-based pancreatic cancer prediction with LLM-derived embeddingsJiheum Park0Jason Patterson1Jose M. Acitores Cortina2Tian Gu3Chin Hur4Nicholas Tatonetti5Department of Medicine, Columbia University Irving Medical CenterDepartment of Biomedical Informatics, Columbia UniversityDepartment of Computational Biomedicine, Cedars-Sinai Medical CenterDepartment of Biostatistics, Columbia Mailman School of Public HealthDepartment of Medicine, Columbia University Irving Medical CenterDepartment of Biomedical Informatics, Columbia UniversityAbstract Pancreatic cancer (PC) is often diagnosed late, as early symptoms and effective screening tools are lacking, and genetic or familial factors explain only ~10% of cases. Leveraging longitudinal electronic health record (EHR) data may offer a promising avenue for early detection. We developed a predictive model using large language model (LLM)-derived embeddings of medical condition names to enhance learning from EHR data. Across two sites—Columbia University Medical Center and Cedars-Sinai Medical Center—LLM embeddings improved 6–12 month prediction AUROCs from 0.60 to 0.67 and 0.82 to 0.86, respectively. Excluding data from 0–3 months before diagnosis further improved AUROCs to 0.82 and 0.89. Our model achieved a higher positive predictive value (0.141) than using traditional risk factors (0.004), and identified many PC patients without these risk factors or known genetic variants. These findings suggest that the EHR-based model may serve as an independent approach for identifying high-risk individuals.https://doi.org/10.1038/s41746-025-01869-8
spellingShingle Jiheum Park
Jason Patterson
Jose M. Acitores Cortina
Tian Gu
Chin Hur
Nicholas Tatonetti
Enhancing EHR-based pancreatic cancer prediction with LLM-derived embeddings
npj Digital Medicine
title Enhancing EHR-based pancreatic cancer prediction with LLM-derived embeddings
title_full Enhancing EHR-based pancreatic cancer prediction with LLM-derived embeddings
title_fullStr Enhancing EHR-based pancreatic cancer prediction with LLM-derived embeddings
title_full_unstemmed Enhancing EHR-based pancreatic cancer prediction with LLM-derived embeddings
title_short Enhancing EHR-based pancreatic cancer prediction with LLM-derived embeddings
title_sort enhancing ehr based pancreatic cancer prediction with llm derived embeddings
url https://doi.org/10.1038/s41746-025-01869-8
work_keys_str_mv AT jiheumpark enhancingehrbasedpancreaticcancerpredictionwithllmderivedembeddings
AT jasonpatterson enhancingehrbasedpancreaticcancerpredictionwithllmderivedembeddings
AT josemacitorescortina enhancingehrbasedpancreaticcancerpredictionwithllmderivedembeddings
AT tiangu enhancingehrbasedpancreaticcancerpredictionwithllmderivedembeddings
AT chinhur enhancingehrbasedpancreaticcancerpredictionwithllmderivedembeddings
AT nicholastatonetti enhancingehrbasedpancreaticcancerpredictionwithllmderivedembeddings