Enhancing EHR-based pancreatic cancer prediction with LLM-derived embeddings
Abstract Pancreatic cancer (PC) is often diagnosed late, as early symptoms and effective screening tools are lacking, and genetic or familial factors explain only ~10% of cases. Leveraging longitudinal electronic health record (EHR) data may offer a promising avenue for early detection. We developed...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-07-01
|
| Series: | npj Digital Medicine |
| Online Access: | https://doi.org/10.1038/s41746-025-01869-8 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849341747492028416 |
|---|---|
| author | Jiheum Park Jason Patterson Jose M. Acitores Cortina Tian Gu Chin Hur Nicholas Tatonetti |
| author_facet | Jiheum Park Jason Patterson Jose M. Acitores Cortina Tian Gu Chin Hur Nicholas Tatonetti |
| author_sort | Jiheum Park |
| collection | DOAJ |
| description | Abstract Pancreatic cancer (PC) is often diagnosed late, as early symptoms and effective screening tools are lacking, and genetic or familial factors explain only ~10% of cases. Leveraging longitudinal electronic health record (EHR) data may offer a promising avenue for early detection. We developed a predictive model using large language model (LLM)-derived embeddings of medical condition names to enhance learning from EHR data. Across two sites—Columbia University Medical Center and Cedars-Sinai Medical Center—LLM embeddings improved 6–12 month prediction AUROCs from 0.60 to 0.67 and 0.82 to 0.86, respectively. Excluding data from 0–3 months before diagnosis further improved AUROCs to 0.82 and 0.89. Our model achieved a higher positive predictive value (0.141) than using traditional risk factors (0.004), and identified many PC patients without these risk factors or known genetic variants. These findings suggest that the EHR-based model may serve as an independent approach for identifying high-risk individuals. |
| format | Article |
| id | doaj-art-254ee181164c40d5b13d2ff441ae5bc2 |
| institution | Kabale University |
| issn | 2398-6352 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | npj Digital Medicine |
| spelling | doaj-art-254ee181164c40d5b13d2ff441ae5bc22025-08-20T03:43:34ZengNature Portfolionpj Digital Medicine2398-63522025-07-01811910.1038/s41746-025-01869-8Enhancing EHR-based pancreatic cancer prediction with LLM-derived embeddingsJiheum Park0Jason Patterson1Jose M. Acitores Cortina2Tian Gu3Chin Hur4Nicholas Tatonetti5Department of Medicine, Columbia University Irving Medical CenterDepartment of Biomedical Informatics, Columbia UniversityDepartment of Computational Biomedicine, Cedars-Sinai Medical CenterDepartment of Biostatistics, Columbia Mailman School of Public HealthDepartment of Medicine, Columbia University Irving Medical CenterDepartment of Biomedical Informatics, Columbia UniversityAbstract Pancreatic cancer (PC) is often diagnosed late, as early symptoms and effective screening tools are lacking, and genetic or familial factors explain only ~10% of cases. Leveraging longitudinal electronic health record (EHR) data may offer a promising avenue for early detection. We developed a predictive model using large language model (LLM)-derived embeddings of medical condition names to enhance learning from EHR data. Across two sites—Columbia University Medical Center and Cedars-Sinai Medical Center—LLM embeddings improved 6–12 month prediction AUROCs from 0.60 to 0.67 and 0.82 to 0.86, respectively. Excluding data from 0–3 months before diagnosis further improved AUROCs to 0.82 and 0.89. Our model achieved a higher positive predictive value (0.141) than using traditional risk factors (0.004), and identified many PC patients without these risk factors or known genetic variants. These findings suggest that the EHR-based model may serve as an independent approach for identifying high-risk individuals.https://doi.org/10.1038/s41746-025-01869-8 |
| spellingShingle | Jiheum Park Jason Patterson Jose M. Acitores Cortina Tian Gu Chin Hur Nicholas Tatonetti Enhancing EHR-based pancreatic cancer prediction with LLM-derived embeddings npj Digital Medicine |
| title | Enhancing EHR-based pancreatic cancer prediction with LLM-derived embeddings |
| title_full | Enhancing EHR-based pancreatic cancer prediction with LLM-derived embeddings |
| title_fullStr | Enhancing EHR-based pancreatic cancer prediction with LLM-derived embeddings |
| title_full_unstemmed | Enhancing EHR-based pancreatic cancer prediction with LLM-derived embeddings |
| title_short | Enhancing EHR-based pancreatic cancer prediction with LLM-derived embeddings |
| title_sort | enhancing ehr based pancreatic cancer prediction with llm derived embeddings |
| url | https://doi.org/10.1038/s41746-025-01869-8 |
| work_keys_str_mv | AT jiheumpark enhancingehrbasedpancreaticcancerpredictionwithllmderivedembeddings AT jasonpatterson enhancingehrbasedpancreaticcancerpredictionwithllmderivedembeddings AT josemacitorescortina enhancingehrbasedpancreaticcancerpredictionwithllmderivedembeddings AT tiangu enhancingehrbasedpancreaticcancerpredictionwithllmderivedembeddings AT chinhur enhancingehrbasedpancreaticcancerpredictionwithllmderivedembeddings AT nicholastatonetti enhancingehrbasedpancreaticcancerpredictionwithllmderivedembeddings |