A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches
Abstract Rare diseases (RDs) are a group of pathologies that individually affect less than 1 in 2000 people but collectively impact around 7% of the world’s population. Most of them affect children, are chronic and progressive, and have no specific treatment. RD patients face diagnostic challenges,...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-02-01
|
| Series: | Scientific Reports |
| Online Access: | https://doi.org/10.1038/s41598-025-90450-0 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849767176047689728 |
|---|---|
| author | Matias Rolando Victor Raggio Hugo Naya Lucia Spangenberg Leticia Cagnina |
| author_facet | Matias Rolando Victor Raggio Hugo Naya Lucia Spangenberg Leticia Cagnina |
| author_sort | Matias Rolando |
| collection | DOAJ |
| description | Abstract Rare diseases (RDs) are a group of pathologies that individually affect less than 1 in 2000 people but collectively impact around 7% of the world’s population. Most of them affect children, are chronic and progressive, and have no specific treatment. RD patients face diagnostic challenges, with an average diagnosis time of 5 years, multiple specialist visits, and invasive procedures. This ‘diagnostic odyssey’ can be detrimental to their health. Machine learning (ML) has the potential to improve healthcare by providing more personalized and accurate patient management, diagnoses, and in some cases, treatments. Leveraging the MIMIC-III database and additional medical notes from different sources such as in-house data, PubMed and chatGPT, we propose a labeled dataset for early RD detection in hospital settings. Applying various supervised ML methods, including logistic regression, decision trees, support vector machine (SVM), deep learning methods (LSTM and CNN), and Transformers (BERT), we validated the use of the proposed resource, achieving 92.7% F-measure and a 96% AUC using SVM. These findings highlight the potential of ML in redirecting RD patients towards more accurate diagnostic pathways and presents a corpus that can be used for future development and refinements. |
| format | Article |
| id | doaj-art-eedb4e98a5b44b52b4d7a761f6c5e3b3 |
| institution | DOAJ |
| issn | 2045-2322 |
| language | English |
| publishDate | 2025-02-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Scientific Reports |
| spelling | doaj-art-eedb4e98a5b44b52b4d7a761f6c5e3b32025-08-20T03:04:20ZengNature PortfolioScientific Reports2045-23222025-02-0115111010.1038/s41598-025-90450-0A labeled medical records corpus for the timely detection of rare diseases using machine learning approachesMatias Rolando0Victor Raggio1Hugo Naya2Lucia Spangenberg3Leticia Cagnina4Bioinformatics Unit, Institut Pasteur de MontevideoDepartamento de Genética, Facultad de Medicina, Universidad de la RepúblicaBioinformatics Unit, Institut Pasteur de MontevideoBioinformatics Unit, Institut Pasteur de MontevideoUniversidad Nacional de San LuisAbstract Rare diseases (RDs) are a group of pathologies that individually affect less than 1 in 2000 people but collectively impact around 7% of the world’s population. Most of them affect children, are chronic and progressive, and have no specific treatment. RD patients face diagnostic challenges, with an average diagnosis time of 5 years, multiple specialist visits, and invasive procedures. This ‘diagnostic odyssey’ can be detrimental to their health. Machine learning (ML) has the potential to improve healthcare by providing more personalized and accurate patient management, diagnoses, and in some cases, treatments. Leveraging the MIMIC-III database and additional medical notes from different sources such as in-house data, PubMed and chatGPT, we propose a labeled dataset for early RD detection in hospital settings. Applying various supervised ML methods, including logistic regression, decision trees, support vector machine (SVM), deep learning methods (LSTM and CNN), and Transformers (BERT), we validated the use of the proposed resource, achieving 92.7% F-measure and a 96% AUC using SVM. These findings highlight the potential of ML in redirecting RD patients towards more accurate diagnostic pathways and presents a corpus that can be used for future development and refinements.https://doi.org/10.1038/s41598-025-90450-0 |
| spellingShingle | Matias Rolando Victor Raggio Hugo Naya Lucia Spangenberg Leticia Cagnina A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches Scientific Reports |
| title | A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches |
| title_full | A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches |
| title_fullStr | A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches |
| title_full_unstemmed | A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches |
| title_short | A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches |
| title_sort | labeled medical records corpus for the timely detection of rare diseases using machine learning approaches |
| url | https://doi.org/10.1038/s41598-025-90450-0 |
| work_keys_str_mv | AT matiasrolando alabeledmedicalrecordscorpusforthetimelydetectionofrarediseasesusingmachinelearningapproaches AT victorraggio alabeledmedicalrecordscorpusforthetimelydetectionofrarediseasesusingmachinelearningapproaches AT hugonaya alabeledmedicalrecordscorpusforthetimelydetectionofrarediseasesusingmachinelearningapproaches AT luciaspangenberg alabeledmedicalrecordscorpusforthetimelydetectionofrarediseasesusingmachinelearningapproaches AT leticiacagnina alabeledmedicalrecordscorpusforthetimelydetectionofrarediseasesusingmachinelearningapproaches AT matiasrolando labeledmedicalrecordscorpusforthetimelydetectionofrarediseasesusingmachinelearningapproaches AT victorraggio labeledmedicalrecordscorpusforthetimelydetectionofrarediseasesusingmachinelearningapproaches AT hugonaya labeledmedicalrecordscorpusforthetimelydetectionofrarediseasesusingmachinelearningapproaches AT luciaspangenberg labeledmedicalrecordscorpusforthetimelydetectionofrarediseasesusingmachinelearningapproaches AT leticiacagnina labeledmedicalrecordscorpusforthetimelydetectionofrarediseasesusingmachinelearningapproaches |