A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches

Abstract Rare diseases (RDs) are a group of pathologies that individually affect less than 1 in 2000 people but collectively impact around 7% of the world’s population. Most of them affect children, are chronic and progressive, and have no specific treatment. RD patients face diagnostic challenges,...

Full description

Saved in:
Bibliographic Details
Main Authors: Matias Rolando, Victor Raggio, Hugo Naya, Lucia Spangenberg, Leticia Cagnina
Format: Article
Language:English
Published: Nature Portfolio 2025-02-01
Series:Scientific Reports
Online Access:https://doi.org/10.1038/s41598-025-90450-0
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849767176047689728
author Matias Rolando
Victor Raggio
Hugo Naya
Lucia Spangenberg
Leticia Cagnina
author_facet Matias Rolando
Victor Raggio
Hugo Naya
Lucia Spangenberg
Leticia Cagnina
author_sort Matias Rolando
collection DOAJ
description Abstract Rare diseases (RDs) are a group of pathologies that individually affect less than 1 in 2000 people but collectively impact around 7% of the world’s population. Most of them affect children, are chronic and progressive, and have no specific treatment. RD patients face diagnostic challenges, with an average diagnosis time of 5 years, multiple specialist visits, and invasive procedures. This ‘diagnostic odyssey’ can be detrimental to their health. Machine learning (ML) has the potential to improve healthcare by providing more personalized and accurate patient management, diagnoses, and in some cases, treatments. Leveraging the MIMIC-III database and additional medical notes from different sources such as in-house data, PubMed and chatGPT, we propose a labeled dataset for early RD detection in hospital settings. Applying various supervised ML methods, including logistic regression, decision trees, support vector machine (SVM), deep learning methods (LSTM and CNN), and Transformers (BERT), we validated the use of the proposed resource, achieving 92.7% F-measure and a 96% AUC using SVM. These findings highlight the potential of ML in redirecting RD patients towards more accurate diagnostic pathways and presents a corpus that can be used for future development and refinements.
format Article
id doaj-art-eedb4e98a5b44b52b4d7a761f6c5e3b3
institution DOAJ
issn 2045-2322
language English
publishDate 2025-02-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-eedb4e98a5b44b52b4d7a761f6c5e3b32025-08-20T03:04:20ZengNature PortfolioScientific Reports2045-23222025-02-0115111010.1038/s41598-025-90450-0A labeled medical records corpus for the timely detection of rare diseases using machine learning approachesMatias Rolando0Victor Raggio1Hugo Naya2Lucia Spangenberg3Leticia Cagnina4Bioinformatics Unit, Institut Pasteur de MontevideoDepartamento de Genética, Facultad de Medicina, Universidad de la RepúblicaBioinformatics Unit, Institut Pasteur de MontevideoBioinformatics Unit, Institut Pasteur de MontevideoUniversidad Nacional de San LuisAbstract Rare diseases (RDs) are a group of pathologies that individually affect less than 1 in 2000 people but collectively impact around 7% of the world’s population. Most of them affect children, are chronic and progressive, and have no specific treatment. RD patients face diagnostic challenges, with an average diagnosis time of 5 years, multiple specialist visits, and invasive procedures. This ‘diagnostic odyssey’ can be detrimental to their health. Machine learning (ML) has the potential to improve healthcare by providing more personalized and accurate patient management, diagnoses, and in some cases, treatments. Leveraging the MIMIC-III database and additional medical notes from different sources such as in-house data, PubMed and chatGPT, we propose a labeled dataset for early RD detection in hospital settings. Applying various supervised ML methods, including logistic regression, decision trees, support vector machine (SVM), deep learning methods (LSTM and CNN), and Transformers (BERT), we validated the use of the proposed resource, achieving 92.7% F-measure and a 96% AUC using SVM. These findings highlight the potential of ML in redirecting RD patients towards more accurate diagnostic pathways and presents a corpus that can be used for future development and refinements.https://doi.org/10.1038/s41598-025-90450-0
spellingShingle Matias Rolando
Victor Raggio
Hugo Naya
Lucia Spangenberg
Leticia Cagnina
A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches
Scientific Reports
title A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches
title_full A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches
title_fullStr A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches
title_full_unstemmed A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches
title_short A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches
title_sort labeled medical records corpus for the timely detection of rare diseases using machine learning approaches
url https://doi.org/10.1038/s41598-025-90450-0
work_keys_str_mv AT matiasrolando alabeledmedicalrecordscorpusforthetimelydetectionofrarediseasesusingmachinelearningapproaches
AT victorraggio alabeledmedicalrecordscorpusforthetimelydetectionofrarediseasesusingmachinelearningapproaches
AT hugonaya alabeledmedicalrecordscorpusforthetimelydetectionofrarediseasesusingmachinelearningapproaches
AT luciaspangenberg alabeledmedicalrecordscorpusforthetimelydetectionofrarediseasesusingmachinelearningapproaches
AT leticiacagnina alabeledmedicalrecordscorpusforthetimelydetectionofrarediseasesusingmachinelearningapproaches
AT matiasrolando labeledmedicalrecordscorpusforthetimelydetectionofrarediseasesusingmachinelearningapproaches
AT victorraggio labeledmedicalrecordscorpusforthetimelydetectionofrarediseasesusingmachinelearningapproaches
AT hugonaya labeledmedicalrecordscorpusforthetimelydetectionofrarediseasesusingmachinelearningapproaches
AT luciaspangenberg labeledmedicalrecordscorpusforthetimelydetectionofrarediseasesusingmachinelearningapproaches
AT leticiacagnina labeledmedicalrecordscorpusforthetimelydetectionofrarediseasesusingmachinelearningapproaches