Automated derivation of diagnostic criteria for lung cancer using natural language processing on electronic health records: a pilot study

Abstract Background The digitisation of healthcare records has generated vast amounts of unstructured data, presenting opportunities for improvements in disease diagnosis when clinical coding falls short, such as in the recording of patient symptoms. This study presents an approach using natural lan...

Full description

Saved in:
Bibliographic Details
Main Authors: Andrew Houston, Sophie Williams, William Ricketts, Charles Gutteridge, Chris Tackaberry, John Conibear
Format: Article
Language:English
Published: BMC 2024-12-01
Series:BMC Medical Informatics and Decision Making
Subjects:
Online Access:https://doi.org/10.1186/s12911-024-02790-y
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846137078405922816
author Andrew Houston
Sophie Williams
William Ricketts
Charles Gutteridge
Chris Tackaberry
John Conibear
author_facet Andrew Houston
Sophie Williams
William Ricketts
Charles Gutteridge
Chris Tackaberry
John Conibear
author_sort Andrew Houston
collection DOAJ
description Abstract Background The digitisation of healthcare records has generated vast amounts of unstructured data, presenting opportunities for improvements in disease diagnosis when clinical coding falls short, such as in the recording of patient symptoms. This study presents an approach using natural language processing to extract clinical concepts from free-text which are used to automatically form diagnostic criteria for lung cancer from unstructured secondary-care data. Methods Patients aged 40 and above who underwent a chest x-ray (CXR) between 2016 and 2022 were included. ICD-10 and unstructured data were pulled from their electronic health records (EHRs) over the preceding 12 months to the CXR. The unstructured data were processed using named entity recognition to extract symptoms, which were mapped to SNOMED-CT codes. Subsumption of features up the SNOMED-CT hierarchy was used to mitigate against sparse features and a frequency-based criteria, combined with univariate logarithmic probabilities, was applied to select candidate features to take forward to the model development phase. A genetic algorithm was employed to identify the most discriminating features to form the diagnostic criteria. Results 75002 patients were included, with 1012 lung cancer diagnoses made within 12 months of the CXR. The best-performing model achieved an AUROC of 0.72. Results showed that an existing ‘disorder of the lung’, such as pneumonia, and a ‘cough’ increased the probability of a lung cancer diagnosis. ‘Anomalies of great vessel’, ‘disorder of the retroperitoneal compartment’ and ‘context-dependent findings’, such as pain, statistically reduced the risk of lung cancer, making other diagnoses more likely. The performance of the developed model was compared to the existing cancer risk scores, demonstrating superior performance. Conclusions The proposed methods demonstrated success in leveraging unstructured secondary-care data to derive diagnostic criteria for lung cancer, outperforming existing risk tools. These advancements show potential for enhancing patient care and results. However, it is essential to tackle specific limitations by integrating primary care data to ensure a more thorough and unbiased development of diagnostic criteria. Moreover, the study highlights the importance of contextualising SNOMED-CT concepts into meaningful terminology that resonates with clinicians, facilitating a clearer and more tangible understanding of the criteria applied.
format Article
id doaj-art-fbcd42d6ad6b48029bc049e86d162234
institution Kabale University
issn 1472-6947
language English
publishDate 2024-12-01
publisher BMC
record_format Article
series BMC Medical Informatics and Decision Making
spelling doaj-art-fbcd42d6ad6b48029bc049e86d1622342024-12-08T12:32:48ZengBMCBMC Medical Informatics and Decision Making1472-69472024-12-0124111010.1186/s12911-024-02790-yAutomated derivation of diagnostic criteria for lung cancer using natural language processing on electronic health records: a pilot studyAndrew Houston0Sophie Williams1William Ricketts2Charles Gutteridge3Chris Tackaberry4John Conibear5Barts Life Sciences, Barts Health NHS TrustBarts Life Sciences, Barts Health NHS TrustRespiratory Medicine, Barts Health NHS TrustBarts Life Sciences, Barts Health NHS TrustClinithink Ltd.Barts Cancer Centre, Barts Health NHS TrustAbstract Background The digitisation of healthcare records has generated vast amounts of unstructured data, presenting opportunities for improvements in disease diagnosis when clinical coding falls short, such as in the recording of patient symptoms. This study presents an approach using natural language processing to extract clinical concepts from free-text which are used to automatically form diagnostic criteria for lung cancer from unstructured secondary-care data. Methods Patients aged 40 and above who underwent a chest x-ray (CXR) between 2016 and 2022 were included. ICD-10 and unstructured data were pulled from their electronic health records (EHRs) over the preceding 12 months to the CXR. The unstructured data were processed using named entity recognition to extract symptoms, which were mapped to SNOMED-CT codes. Subsumption of features up the SNOMED-CT hierarchy was used to mitigate against sparse features and a frequency-based criteria, combined with univariate logarithmic probabilities, was applied to select candidate features to take forward to the model development phase. A genetic algorithm was employed to identify the most discriminating features to form the diagnostic criteria. Results 75002 patients were included, with 1012 lung cancer diagnoses made within 12 months of the CXR. The best-performing model achieved an AUROC of 0.72. Results showed that an existing ‘disorder of the lung’, such as pneumonia, and a ‘cough’ increased the probability of a lung cancer diagnosis. ‘Anomalies of great vessel’, ‘disorder of the retroperitoneal compartment’ and ‘context-dependent findings’, such as pain, statistically reduced the risk of lung cancer, making other diagnoses more likely. The performance of the developed model was compared to the existing cancer risk scores, demonstrating superior performance. Conclusions The proposed methods demonstrated success in leveraging unstructured secondary-care data to derive diagnostic criteria for lung cancer, outperforming existing risk tools. These advancements show potential for enhancing patient care and results. However, it is essential to tackle specific limitations by integrating primary care data to ensure a more thorough and unbiased development of diagnostic criteria. Moreover, the study highlights the importance of contextualising SNOMED-CT concepts into meaningful terminology that resonates with clinicians, facilitating a clearer and more tangible understanding of the criteria applied.https://doi.org/10.1186/s12911-024-02790-yElectronic health recordsNatural language processingCancerDiagnosticsSNOMED-CTMachine learning
spellingShingle Andrew Houston
Sophie Williams
William Ricketts
Charles Gutteridge
Chris Tackaberry
John Conibear
Automated derivation of diagnostic criteria for lung cancer using natural language processing on electronic health records: a pilot study
BMC Medical Informatics and Decision Making
Electronic health records
Natural language processing
Cancer
Diagnostics
SNOMED-CT
Machine learning
title Automated derivation of diagnostic criteria for lung cancer using natural language processing on electronic health records: a pilot study
title_full Automated derivation of diagnostic criteria for lung cancer using natural language processing on electronic health records: a pilot study
title_fullStr Automated derivation of diagnostic criteria for lung cancer using natural language processing on electronic health records: a pilot study
title_full_unstemmed Automated derivation of diagnostic criteria for lung cancer using natural language processing on electronic health records: a pilot study
title_short Automated derivation of diagnostic criteria for lung cancer using natural language processing on electronic health records: a pilot study
title_sort automated derivation of diagnostic criteria for lung cancer using natural language processing on electronic health records a pilot study
topic Electronic health records
Natural language processing
Cancer
Diagnostics
SNOMED-CT
Machine learning
url https://doi.org/10.1186/s12911-024-02790-y
work_keys_str_mv AT andrewhouston automatedderivationofdiagnosticcriteriaforlungcancerusingnaturallanguageprocessingonelectronichealthrecordsapilotstudy
AT sophiewilliams automatedderivationofdiagnosticcriteriaforlungcancerusingnaturallanguageprocessingonelectronichealthrecordsapilotstudy
AT williamricketts automatedderivationofdiagnosticcriteriaforlungcancerusingnaturallanguageprocessingonelectronichealthrecordsapilotstudy
AT charlesgutteridge automatedderivationofdiagnosticcriteriaforlungcancerusingnaturallanguageprocessingonelectronichealthrecordsapilotstudy
AT christackaberry automatedderivationofdiagnosticcriteriaforlungcancerusingnaturallanguageprocessingonelectronichealthrecordsapilotstudy
AT johnconibear automatedderivationofdiagnosticcriteriaforlungcancerusingnaturallanguageprocessingonelectronichealthrecordsapilotstudy