366 Universal representation of human diseases using large language models

Objectives/Goals: Understanding the interconnections among over 20,000 human diseases spanning organ systems could inform more precise diagnosis and treatment of diseases. Here, we examine whether the ability of large language models (LLMs) to learn universal representations of concepts can be lev...

Full description

Saved in:
Bibliographic Details
Main Authors: Geoffrey Siwo, Ellen R. Bowen, Akbar K. Waljee
Format: Article
Language:English
Published: Cambridge University Press 2025-04-01
Series:Journal of Clinical and Translational Science
Online Access:https://www.cambridge.org/core/product/identifier/S2059866124009919/type/journal_article
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850097937108959232
author Geoffrey Siwo
Ellen R. Bowen
Akbar K. Waljee
author_facet Geoffrey Siwo
Ellen R. Bowen
Akbar K. Waljee
author_sort Geoffrey Siwo
collection DOAJ
description Objectives/Goals: Understanding the interconnections among over 20,000 human diseases spanning organ systems could inform more precise diagnosis and treatment of diseases. Here, we examine whether the ability of large language models (LLMs) to learn universal representations of concepts can be leveraged to discover complex relationships across human diseases. Methods/Study Population: To address the challenge of computationally representing thousands diseases spanning multiple organ systems, we used internal representations of concepts by LLMs to encode diseases based on their descriptions from standard disease ontologies (ICD10 and Phecodes). To do this, we leveraged application programming interfaces (APIs) of three LLMs-GPT3.5, Mistral and Voyage to encode disease relationships. We then performed unsupervised clustering of the diseases using their encodings (embeddings) from each LLM to determine whether the resulting clusters reflect disease relationships. To enable deeper exploration of disease relationships, we developed interactive plots that provide a system level view of the relationships between thousands of diseases and their association with specific organ systems. Results/Anticipated Results: We found that unsupervised analysis of disease relationships using the LLM encodings reveal high similarities among diseases based on organ systems they affect. All the LLMs clustered diseases into groups largely defined by the organ systems they affect without being trained to specifically classify diseases into their corresponding organ system classification. An exception to this was tumors in which we observed that most tumors cluster together as a group irrespective of the organs they affect. Interestingly, we found that tumors affecting anatomically related organs show higher similarity to each other than to those affecting distantly related organs. In addition to anatomical relationships between diseases, we found that the LLM embeddings capture genetic relationships between diseases. Discussion/Significance of Impact: Overall, we found that the LLM-derived encodings uphold biologically and clinically significant relationships across organ systems and disease types. These results suggest that LLM encodings could provide a universal framework for representing diseases as computable phenotypes and enable the discovery of complex disease relationships.
format Article
id doaj-art-2566a13c0c4a4ef39b95fd2de1216e09
institution DOAJ
issn 2059-8661
language English
publishDate 2025-04-01
publisher Cambridge University Press
record_format Article
series Journal of Clinical and Translational Science
spelling doaj-art-2566a13c0c4a4ef39b95fd2de1216e092025-08-20T02:40:51ZengCambridge University PressJournal of Clinical and Translational Science2059-86612025-04-01911311310.1017/cts.2024.991366 Universal representation of human diseases using large language modelsGeoffrey Siwo0Ellen R. Bowen1Akbar K. Waljee2University of Michigan Medical SchoolUniversity of Michigan Medical SchoolUniversity of Michigan Medical School Objectives/Goals: Understanding the interconnections among over 20,000 human diseases spanning organ systems could inform more precise diagnosis and treatment of diseases. Here, we examine whether the ability of large language models (LLMs) to learn universal representations of concepts can be leveraged to discover complex relationships across human diseases. Methods/Study Population: To address the challenge of computationally representing thousands diseases spanning multiple organ systems, we used internal representations of concepts by LLMs to encode diseases based on their descriptions from standard disease ontologies (ICD10 and Phecodes). To do this, we leveraged application programming interfaces (APIs) of three LLMs-GPT3.5, Mistral and Voyage to encode disease relationships. We then performed unsupervised clustering of the diseases using their encodings (embeddings) from each LLM to determine whether the resulting clusters reflect disease relationships. To enable deeper exploration of disease relationships, we developed interactive plots that provide a system level view of the relationships between thousands of diseases and their association with specific organ systems. Results/Anticipated Results: We found that unsupervised analysis of disease relationships using the LLM encodings reveal high similarities among diseases based on organ systems they affect. All the LLMs clustered diseases into groups largely defined by the organ systems they affect without being trained to specifically classify diseases into their corresponding organ system classification. An exception to this was tumors in which we observed that most tumors cluster together as a group irrespective of the organs they affect. Interestingly, we found that tumors affecting anatomically related organs show higher similarity to each other than to those affecting distantly related organs. In addition to anatomical relationships between diseases, we found that the LLM embeddings capture genetic relationships between diseases. Discussion/Significance of Impact: Overall, we found that the LLM-derived encodings uphold biologically and clinically significant relationships across organ systems and disease types. These results suggest that LLM encodings could provide a universal framework for representing diseases as computable phenotypes and enable the discovery of complex disease relationships.https://www.cambridge.org/core/product/identifier/S2059866124009919/type/journal_article
spellingShingle Geoffrey Siwo
Ellen R. Bowen
Akbar K. Waljee
366 Universal representation of human diseases using large language models
Journal of Clinical and Translational Science
title 366 Universal representation of human diseases using large language models
title_full 366 Universal representation of human diseases using large language models
title_fullStr 366 Universal representation of human diseases using large language models
title_full_unstemmed 366 Universal representation of human diseases using large language models
title_short 366 Universal representation of human diseases using large language models
title_sort 366 universal representation of human diseases using large language models
url https://www.cambridge.org/core/product/identifier/S2059866124009919/type/journal_article
work_keys_str_mv AT geoffreysiwo 366universalrepresentationofhumandiseasesusinglargelanguagemodels
AT ellenrbowen 366universalrepresentationofhumandiseasesusinglargelanguagemodels
AT akbarkwaljee 366universalrepresentationofhumandiseasesusinglargelanguagemodels