Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss
Clinician notes are a rich source of patient information, but often contain inconsistencies due to varied writing styles, abbreviations, medical jargon, grammatical errors, and non-standard formatting. These inconsistencies hinder their direct use in patient care and degrade the performance of downs...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-05-01
|
| Series: | Information |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2078-2489/16/6/446 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849433891146825728 |
|---|---|
| author | Daniel B. Hier Michael A. Carrithers Steven K. Platt Anh Nguyen Ioannis Giannopoulos Tayo Obafemi-Ajayi |
| author_facet | Daniel B. Hier Michael A. Carrithers Steven K. Platt Anh Nguyen Ioannis Giannopoulos Tayo Obafemi-Ajayi |
| author_sort | Daniel B. Hier |
| collection | DOAJ |
| description | Clinician notes are a rich source of patient information, but often contain inconsistencies due to varied writing styles, abbreviations, medical jargon, grammatical errors, and non-standard formatting. These inconsistencies hinder their direct use in patient care and degrade the performance of downstream computational applications that rely on these notes as input, such as quality improvement, population health analytics, precision medicine, clinical decision support, and research. We present a large-language-model (LLM) approach to the preprocessing of 1618 neurology notes. The LLM corrected spelling and grammatical errors, expanded acronyms, and standardized terminology and formatting, without altering clinical content. Expert review of randomly sampled notes confirmed that no significant information was lost. To evaluate downstream impact, we applied an ontology-based NLP pipeline (Doc2Hpo) to extract biomedical concepts from the notes before and after editing. F1 scores for Human Phenotype Ontology extraction improved from 0.40 to 0.61, confirming our hypothesis that better inputs yielded better outputs. We conclude that LLM-based preprocessing is an effective error correction strategy that improves data quality at the level of free text in clinical notes. This approach may enhance the performance of a broad class of downstream applications that derive their input from unstructured clinical documentation. |
| format | Article |
| id | doaj-art-11fba727df524a25be7df1ba2a00e165 |
| institution | Kabale University |
| issn | 2078-2489 |
| language | English |
| publishDate | 2025-05-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Information |
| spelling | doaj-art-11fba727df524a25be7df1ba2a00e1652025-08-20T03:26:52ZengMDPI AGInformation2078-24892025-05-0116644610.3390/info16060446Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information LossDaniel B. Hier0Michael A. Carrithers1Steven K. Platt2Anh Nguyen3Ioannis Giannopoulos4Tayo Obafemi-Ajayi5Department of Neurology & Rehabilitation, University of Illinois at Chicago, Chicago, IL 60612, USADepartment of Neurology & Rehabilitation, University of Illinois at Chicago, Chicago, IL 60612, USALaboratory for Applied Artificial Intelligence, Loyola University Chicago, Chicago, IL 60611, USALaboratory for Applied Artificial Intelligence, Loyola University Chicago, Chicago, IL 60611, USALaboratory for Applied Artificial Intelligence, Loyola University Chicago, Chicago, IL 60611, USAEngineering Program, Missouri State University, Springfield, MO 65897, USAClinician notes are a rich source of patient information, but often contain inconsistencies due to varied writing styles, abbreviations, medical jargon, grammatical errors, and non-standard formatting. These inconsistencies hinder their direct use in patient care and degrade the performance of downstream computational applications that rely on these notes as input, such as quality improvement, population health analytics, precision medicine, clinical decision support, and research. We present a large-language-model (LLM) approach to the preprocessing of 1618 neurology notes. The LLM corrected spelling and grammatical errors, expanded acronyms, and standardized terminology and formatting, without altering clinical content. Expert review of randomly sampled notes confirmed that no significant information was lost. To evaluate downstream impact, we applied an ontology-based NLP pipeline (Doc2Hpo) to extract biomedical concepts from the notes before and after editing. F1 scores for Human Phenotype Ontology extraction improved from 0.40 to 0.61, confirming our hypothesis that better inputs yielded better outputs. We conclude that LLM-based preprocessing is an effective error correction strategy that improves data quality at the level of free text in clinical notes. This approach may enhance the performance of a broad class of downstream applications that derive their input from unstructured clinical documentation.https://www.mdpi.com/2078-2489/16/6/446electronic health recordsphysician noteshuman phenotype ontologyDoc2Hpolarge language modelsdata interoperability |
| spellingShingle | Daniel B. Hier Michael A. Carrithers Steven K. Platt Anh Nguyen Ioannis Giannopoulos Tayo Obafemi-Ajayi Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss Information electronic health records physician notes human phenotype ontology Doc2Hpo large language models data interoperability |
| title | Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss |
| title_full | Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss |
| title_fullStr | Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss |
| title_full_unstemmed | Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss |
| title_short | Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss |
| title_sort | preprocessing of physician notes by llms improves clinical concept extraction without information loss |
| topic | electronic health records physician notes human phenotype ontology Doc2Hpo large language models data interoperability |
| url | https://www.mdpi.com/2078-2489/16/6/446 |
| work_keys_str_mv | AT danielbhier preprocessingofphysiciannotesbyllmsimprovesclinicalconceptextractionwithoutinformationloss AT michaelacarrithers preprocessingofphysiciannotesbyllmsimprovesclinicalconceptextractionwithoutinformationloss AT stevenkplatt preprocessingofphysiciannotesbyllmsimprovesclinicalconceptextractionwithoutinformationloss AT anhnguyen preprocessingofphysiciannotesbyllmsimprovesclinicalconceptextractionwithoutinformationloss AT ioannisgiannopoulos preprocessingofphysiciannotesbyllmsimprovesclinicalconceptextractionwithoutinformationloss AT tayoobafemiajayi preprocessingofphysiciannotesbyllmsimprovesclinicalconceptextractionwithoutinformationloss |