Automated derivation of diagnostic criteria for lung cancer using natural language processing on electronic health records: a pilot study

Abstract Background The digitisation of healthcare records has generated vast amounts of unstructured data, presenting opportunities for improvements in disease diagnosis when clinical coding falls short, such as in the recording of patient symptoms. This study presents an approach using natural lan...

Full description

Saved in:

Bibliographic Details
Main Authors:	Andrew Houston, Sophie Williams, William Ricketts, Charles Gutteridge, Chris Tackaberry, John Conibear
Format:	Article
Language:	English
Published:	BMC 2024-12-01
Series:	BMC Medical Informatics and Decision Making
Subjects:	Electronic health records Natural language processing Cancer Diagnostics SNOMED-CT Machine learning
Online Access:	https://doi.org/10.1186/s12911-024-02790-y
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1846137078405922816
author	Andrew Houston Sophie Williams William Ricketts Charles Gutteridge Chris Tackaberry John Conibear
author_facet	Andrew Houston Sophie Williams William Ricketts Charles Gutteridge Chris Tackaberry John Conibear
author_sort	Andrew Houston
collection	DOAJ
description	Abstract Background The digitisation of healthcare records has generated vast amounts of unstructured data, presenting opportunities for improvements in disease diagnosis when clinical coding falls short, such as in the recording of patient symptoms. This study presents an approach using natural language processing to extract clinical concepts from free-text which are used to automatically form diagnostic criteria for lung cancer from unstructured secondary-care data. Methods Patients aged 40 and above who underwent a chest x-ray (CXR) between 2016 and 2022 were included. ICD-10 and unstructured data were pulled from their electronic health records (EHRs) over the preceding 12 months to the CXR. The unstructured data were processed using named entity recognition to extract symptoms, which were mapped to SNOMED-CT codes. Subsumption of features up the SNOMED-CT hierarchy was used to mitigate against sparse features and a frequency-based criteria, combined with univariate logarithmic probabilities, was applied to select candidate features to take forward to the model development phase. A genetic algorithm was employed to identify the most discriminating features to form the diagnostic criteria. Results 75002 patients were included, with 1012 lung cancer diagnoses made within 12 months of the CXR. The best-performing model achieved an AUROC of 0.72. Results showed that an existing ‘disorder of the lung’, such as pneumonia, and a ‘cough’ increased the probability of a lung cancer diagnosis. ‘Anomalies of great vessel’, ‘disorder of the retroperitoneal compartment’ and ‘context-dependent findings’, such as pain, statistically reduced the risk of lung cancer, making other diagnoses more likely. The performance of the developed model was compared to the existing cancer risk scores, demonstrating superior performance. Conclusions The proposed methods demonstrated success in leveraging unstructured secondary-care data to derive diagnostic criteria for lung cancer, outperforming existing risk tools. These advancements show potential for enhancing patient care and results. However, it is essential to tackle specific limitations by integrating primary care data to ensure a more thorough and unbiased development of diagnostic criteria. Moreover, the study highlights the importance of contextualising SNOMED-CT concepts into meaningful terminology that resonates with clinicians, facilitating a clearer and more tangible understanding of the criteria applied.
format	Article
id	doaj-art-fbcd42d6ad6b48029bc049e86d162234
institution	Kabale University
issn	1472-6947
language	English
publishDate	2024-12-01
publisher	BMC
record_format	Article
series	BMC Medical Informatics and Decision Making
spelling	doaj-art-fbcd42d6ad6b48029bc049e86d1622342024-12-08T12:32:48ZengBMCBMC Medical Informatics and Decision Making1472-69472024-12-0124111010.1186/s12911-024-02790-yAutomated derivation of diagnostic criteria for lung cancer using natural language processing on electronic health records: a pilot studyAndrew Houston0Sophie Williams1William Ricketts2Charles Gutteridge3Chris Tackaberry4John Conibear5Barts Life Sciences, Barts Health NHS TrustBarts Life Sciences, Barts Health NHS TrustRespiratory Medicine, Barts Health NHS TrustBarts Life Sciences, Barts Health NHS TrustClinithink Ltd.Barts Cancer Centre, Barts Health NHS TrustAbstract Background The digitisation of healthcare records has generated vast amounts of unstructured data, presenting opportunities for improvements in disease diagnosis when clinical coding falls short, such as in the recording of patient symptoms. This study presents an approach using natural language processing to extract clinical concepts from free-text which are used to automatically form diagnostic criteria for lung cancer from unstructured secondary-care data. Methods Patients aged 40 and above who underwent a chest x-ray (CXR) between 2016 and 2022 were included. ICD-10 and unstructured data were pulled from their electronic health records (EHRs) over the preceding 12 months to the CXR. The unstructured data were processed using named entity recognition to extract symptoms, which were mapped to SNOMED-CT codes. Subsumption of features up the SNOMED-CT hierarchy was used to mitigate against sparse features and a frequency-based criteria, combined with univariate logarithmic probabilities, was applied to select candidate features to take forward to the model development phase. A genetic algorithm was employed to identify the most discriminating features to form the diagnostic criteria. Results 75002 patients were included, with 1012 lung cancer diagnoses made within 12 months of the CXR. The best-performing model achieved an AUROC of 0.72. Results showed that an existing ‘disorder of the lung’, such as pneumonia, and a ‘cough’ increased the probability of a lung cancer diagnosis. ‘Anomalies of great vessel’, ‘disorder of the retroperitoneal compartment’ and ‘context-dependent findings’, such as pain, statistically reduced the risk of lung cancer, making other diagnoses more likely. The performance of the developed model was compared to the existing cancer risk scores, demonstrating superior performance. Conclusions The proposed methods demonstrated success in leveraging unstructured secondary-care data to derive diagnostic criteria for lung cancer, outperforming existing risk tools. These advancements show potential for enhancing patient care and results. However, it is essential to tackle specific limitations by integrating primary care data to ensure a more thorough and unbiased development of diagnostic criteria. Moreover, the study highlights the importance of contextualising SNOMED-CT concepts into meaningful terminology that resonates with clinicians, facilitating a clearer and more tangible understanding of the criteria applied.https://doi.org/10.1186/s12911-024-02790-yElectronic health recordsNatural language processingCancerDiagnosticsSNOMED-CTMachine learning
spellingShingle	Andrew Houston Sophie Williams William Ricketts Charles Gutteridge Chris Tackaberry John Conibear Automated derivation of diagnostic criteria for lung cancer using natural language processing on electronic health records: a pilot study BMC Medical Informatics and Decision Making Electronic health records Natural language processing Cancer Diagnostics SNOMED-CT Machine learning
title	Automated derivation of diagnostic criteria for lung cancer using natural language processing on electronic health records: a pilot study
title_full	Automated derivation of diagnostic criteria for lung cancer using natural language processing on electronic health records: a pilot study
title_fullStr	Automated derivation of diagnostic criteria for lung cancer using natural language processing on electronic health records: a pilot study
title_full_unstemmed	Automated derivation of diagnostic criteria for lung cancer using natural language processing on electronic health records: a pilot study
title_short	Automated derivation of diagnostic criteria for lung cancer using natural language processing on electronic health records: a pilot study
title_sort	automated derivation of diagnostic criteria for lung cancer using natural language processing on electronic health records a pilot study
topic	Electronic health records Natural language processing Cancer Diagnostics SNOMED-CT Machine learning
url	https://doi.org/10.1186/s12911-024-02790-y
work_keys_str_mv	AT andrewhouston automatedderivationofdiagnosticcriteriaforlungcancerusingnaturallanguageprocessingonelectronichealthrecordsapilotstudy AT sophiewilliams automatedderivationofdiagnosticcriteriaforlungcancerusingnaturallanguageprocessingonelectronichealthrecordsapilotstudy AT williamricketts automatedderivationofdiagnosticcriteriaforlungcancerusingnaturallanguageprocessingonelectronichealthrecordsapilotstudy AT charlesgutteridge automatedderivationofdiagnosticcriteriaforlungcancerusingnaturallanguageprocessingonelectronichealthrecordsapilotstudy AT christackaberry automatedderivationofdiagnosticcriteriaforlungcancerusingnaturallanguageprocessingonelectronichealthrecordsapilotstudy AT johnconibear automatedderivationofdiagnosticcriteriaforlungcancerusingnaturallanguageprocessingonelectronichealthrecordsapilotstudy

Automated derivation of diagnostic criteria for lung cancer using natural language processing on electronic health records: a pilot study

Similar Items