Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss

Clinician notes are a rich source of patient information, but often contain inconsistencies due to varied writing styles, abbreviations, medical jargon, grammatical errors, and non-standard formatting. These inconsistencies hinder their direct use in patient care and degrade the performance of downs...

Full description

Saved in:
Bibliographic Details
Main Authors: Daniel B. Hier, Michael A. Carrithers, Steven K. Platt, Anh Nguyen, Ioannis Giannopoulos, Tayo Obafemi-Ajayi
Format: Article
Language:English
Published: MDPI AG 2025-05-01
Series:Information
Subjects:
Online Access:https://www.mdpi.com/2078-2489/16/6/446
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849433891146825728
author Daniel B. Hier
Michael A. Carrithers
Steven K. Platt
Anh Nguyen
Ioannis Giannopoulos
Tayo Obafemi-Ajayi
author_facet Daniel B. Hier
Michael A. Carrithers
Steven K. Platt
Anh Nguyen
Ioannis Giannopoulos
Tayo Obafemi-Ajayi
author_sort Daniel B. Hier
collection DOAJ
description Clinician notes are a rich source of patient information, but often contain inconsistencies due to varied writing styles, abbreviations, medical jargon, grammatical errors, and non-standard formatting. These inconsistencies hinder their direct use in patient care and degrade the performance of downstream computational applications that rely on these notes as input, such as quality improvement, population health analytics, precision medicine, clinical decision support, and research. We present a large-language-model (LLM) approach to the preprocessing of 1618 neurology notes. The LLM corrected spelling and grammatical errors, expanded acronyms, and standardized terminology and formatting, without altering clinical content. Expert review of randomly sampled notes confirmed that no significant information was lost. To evaluate downstream impact, we applied an ontology-based NLP pipeline (Doc2Hpo) to extract biomedical concepts from the notes before and after editing. F1 scores for Human Phenotype Ontology extraction improved from 0.40 to 0.61, confirming our hypothesis that better inputs yielded better outputs. We conclude that LLM-based preprocessing is an effective error correction strategy that improves data quality at the level of free text in clinical notes. This approach may enhance the performance of a broad class of downstream applications that derive their input from unstructured clinical documentation.
format Article
id doaj-art-11fba727df524a25be7df1ba2a00e165
institution Kabale University
issn 2078-2489
language English
publishDate 2025-05-01
publisher MDPI AG
record_format Article
series Information
spelling doaj-art-11fba727df524a25be7df1ba2a00e1652025-08-20T03:26:52ZengMDPI AGInformation2078-24892025-05-0116644610.3390/info16060446Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information LossDaniel B. Hier0Michael A. Carrithers1Steven K. Platt2Anh Nguyen3Ioannis Giannopoulos4Tayo Obafemi-Ajayi5Department of Neurology & Rehabilitation, University of Illinois at Chicago, Chicago, IL 60612, USADepartment of Neurology & Rehabilitation, University of Illinois at Chicago, Chicago, IL 60612, USALaboratory for Applied Artificial Intelligence, Loyola University Chicago, Chicago, IL 60611, USALaboratory for Applied Artificial Intelligence, Loyola University Chicago, Chicago, IL 60611, USALaboratory for Applied Artificial Intelligence, Loyola University Chicago, Chicago, IL 60611, USAEngineering Program, Missouri State University, Springfield, MO 65897, USAClinician notes are a rich source of patient information, but often contain inconsistencies due to varied writing styles, abbreviations, medical jargon, grammatical errors, and non-standard formatting. These inconsistencies hinder their direct use in patient care and degrade the performance of downstream computational applications that rely on these notes as input, such as quality improvement, population health analytics, precision medicine, clinical decision support, and research. We present a large-language-model (LLM) approach to the preprocessing of 1618 neurology notes. The LLM corrected spelling and grammatical errors, expanded acronyms, and standardized terminology and formatting, without altering clinical content. Expert review of randomly sampled notes confirmed that no significant information was lost. To evaluate downstream impact, we applied an ontology-based NLP pipeline (Doc2Hpo) to extract biomedical concepts from the notes before and after editing. F1 scores for Human Phenotype Ontology extraction improved from 0.40 to 0.61, confirming our hypothesis that better inputs yielded better outputs. We conclude that LLM-based preprocessing is an effective error correction strategy that improves data quality at the level of free text in clinical notes. This approach may enhance the performance of a broad class of downstream applications that derive their input from unstructured clinical documentation.https://www.mdpi.com/2078-2489/16/6/446electronic health recordsphysician noteshuman phenotype ontologyDoc2Hpolarge language modelsdata interoperability
spellingShingle Daniel B. Hier
Michael A. Carrithers
Steven K. Platt
Anh Nguyen
Ioannis Giannopoulos
Tayo Obafemi-Ajayi
Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss
Information
electronic health records
physician notes
human phenotype ontology
Doc2Hpo
large language models
data interoperability
title Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss
title_full Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss
title_fullStr Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss
title_full_unstemmed Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss
title_short Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss
title_sort preprocessing of physician notes by llms improves clinical concept extraction without information loss
topic electronic health records
physician notes
human phenotype ontology
Doc2Hpo
large language models
data interoperability
url https://www.mdpi.com/2078-2489/16/6/446
work_keys_str_mv AT danielbhier preprocessingofphysiciannotesbyllmsimprovesclinicalconceptextractionwithoutinformationloss
AT michaelacarrithers preprocessingofphysiciannotesbyllmsimprovesclinicalconceptextractionwithoutinformationloss
AT stevenkplatt preprocessingofphysiciannotesbyllmsimprovesclinicalconceptextractionwithoutinformationloss
AT anhnguyen preprocessingofphysiciannotesbyllmsimprovesclinicalconceptextractionwithoutinformationloss
AT ioannisgiannopoulos preprocessingofphysiciannotesbyllmsimprovesclinicalconceptextractionwithoutinformationloss
AT tayoobafemiajayi preprocessingofphysiciannotesbyllmsimprovesclinicalconceptextractionwithoutinformationloss