Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss

Clinician notes are a rich source of patient information, but often contain inconsistencies due to varied writing styles, abbreviations, medical jargon, grammatical errors, and non-standard formatting. These inconsistencies hinder their direct use in patient care and degrade the performance of downs...

Full description

Saved in:

Bibliographic Details
Main Authors:	Daniel B. Hier, Michael A. Carrithers, Steven K. Platt, Anh Nguyen, Ioannis Giannopoulos, Tayo Obafemi-Ajayi
Format:	Article
Language:	English
Published:	MDPI AG 2025-05-01
Series:	Information
Subjects:	electronic health records physician notes human phenotype ontology Doc2Hpo large language models data interoperability
Online Access:	https://www.mdpi.com/2078-2489/16/6/446
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849433891146825728
author	Daniel B. Hier Michael A. Carrithers Steven K. Platt Anh Nguyen Ioannis Giannopoulos Tayo Obafemi-Ajayi
author_facet	Daniel B. Hier Michael A. Carrithers Steven K. Platt Anh Nguyen Ioannis Giannopoulos Tayo Obafemi-Ajayi
author_sort	Daniel B. Hier
collection	DOAJ
description	Clinician notes are a rich source of patient information, but often contain inconsistencies due to varied writing styles, abbreviations, medical jargon, grammatical errors, and non-standard formatting. These inconsistencies hinder their direct use in patient care and degrade the performance of downstream computational applications that rely on these notes as input, such as quality improvement, population health analytics, precision medicine, clinical decision support, and research. We present a large-language-model (LLM) approach to the preprocessing of 1618 neurology notes. The LLM corrected spelling and grammatical errors, expanded acronyms, and standardized terminology and formatting, without altering clinical content. Expert review of randomly sampled notes confirmed that no significant information was lost. To evaluate downstream impact, we applied an ontology-based NLP pipeline (Doc2Hpo) to extract biomedical concepts from the notes before and after editing. F1 scores for Human Phenotype Ontology extraction improved from 0.40 to 0.61, confirming our hypothesis that better inputs yielded better outputs. We conclude that LLM-based preprocessing is an effective error correction strategy that improves data quality at the level of free text in clinical notes. This approach may enhance the performance of a broad class of downstream applications that derive their input from unstructured clinical documentation.
format	Article
id	doaj-art-11fba727df524a25be7df1ba2a00e165
institution	Kabale University
issn	2078-2489
language	English
publishDate	2025-05-01
publisher	MDPI AG
record_format	Article
series	Information
spelling	doaj-art-11fba727df524a25be7df1ba2a00e1652025-08-20T03:26:52ZengMDPI AGInformation2078-24892025-05-0116644610.3390/info16060446Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information LossDaniel B. Hier0Michael A. Carrithers1Steven K. Platt2Anh Nguyen3Ioannis Giannopoulos4Tayo Obafemi-Ajayi5Department of Neurology & Rehabilitation, University of Illinois at Chicago, Chicago, IL 60612, USADepartment of Neurology & Rehabilitation, University of Illinois at Chicago, Chicago, IL 60612, USALaboratory for Applied Artificial Intelligence, Loyola University Chicago, Chicago, IL 60611, USALaboratory for Applied Artificial Intelligence, Loyola University Chicago, Chicago, IL 60611, USALaboratory for Applied Artificial Intelligence, Loyola University Chicago, Chicago, IL 60611, USAEngineering Program, Missouri State University, Springfield, MO 65897, USAClinician notes are a rich source of patient information, but often contain inconsistencies due to varied writing styles, abbreviations, medical jargon, grammatical errors, and non-standard formatting. These inconsistencies hinder their direct use in patient care and degrade the performance of downstream computational applications that rely on these notes as input, such as quality improvement, population health analytics, precision medicine, clinical decision support, and research. We present a large-language-model (LLM) approach to the preprocessing of 1618 neurology notes. The LLM corrected spelling and grammatical errors, expanded acronyms, and standardized terminology and formatting, without altering clinical content. Expert review of randomly sampled notes confirmed that no significant information was lost. To evaluate downstream impact, we applied an ontology-based NLP pipeline (Doc2Hpo) to extract biomedical concepts from the notes before and after editing. F1 scores for Human Phenotype Ontology extraction improved from 0.40 to 0.61, confirming our hypothesis that better inputs yielded better outputs. We conclude that LLM-based preprocessing is an effective error correction strategy that improves data quality at the level of free text in clinical notes. This approach may enhance the performance of a broad class of downstream applications that derive their input from unstructured clinical documentation.https://www.mdpi.com/2078-2489/16/6/446electronic health recordsphysician noteshuman phenotype ontologyDoc2Hpolarge language modelsdata interoperability
spellingShingle	Daniel B. Hier Michael A. Carrithers Steven K. Platt Anh Nguyen Ioannis Giannopoulos Tayo Obafemi-Ajayi Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss Information electronic health records physician notes human phenotype ontology Doc2Hpo large language models data interoperability
title	Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss
title_full	Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss
title_fullStr	Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss
title_full_unstemmed	Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss
title_short	Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss
title_sort	preprocessing of physician notes by llms improves clinical concept extraction without information loss
topic	electronic health records physician notes human phenotype ontology Doc2Hpo large language models data interoperability
url	https://www.mdpi.com/2078-2489/16/6/446
work_keys_str_mv	AT danielbhier preprocessingofphysiciannotesbyllmsimprovesclinicalconceptextractionwithoutinformationloss AT michaelacarrithers preprocessingofphysiciannotesbyllmsimprovesclinicalconceptextractionwithoutinformationloss AT stevenkplatt preprocessingofphysiciannotesbyllmsimprovesclinicalconceptextractionwithoutinformationloss AT anhnguyen preprocessingofphysiciannotesbyllmsimprovesclinicalconceptextractionwithoutinformationloss AT ioannisgiannopoulos preprocessingofphysiciannotesbyllmsimprovesclinicalconceptextractionwithoutinformationloss AT tayoobafemiajayi preprocessingofphysiciannotesbyllmsimprovesclinicalconceptextractionwithoutinformationloss

Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss

Similar Items