A comparative analysis of large language models versus traditional information extraction methods for real-world evidence of patient symptomatology in acute and post-acute sequelae of SARS-CoV-2.

<h4>Background</h4>Patient symptoms, crucial for disease progression and diagnosis, are often captured in unstructured clinical notes. Large language models (LLMs) offer potential advantages in extracting patient symptoms compared to traditional rule-based information extraction (IE) sys...

Full description

Saved in:
Bibliographic Details
Main Authors: Vedansh Thakkar, Greg M Silverman, Abhinab Kc, Nicholas E Ingraham, Emma K Jones, Samantha King, Genevieve B Melton, Rui Zhang, Christopher J Tignanelli
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2025-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0323535
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850269880226414592
author Vedansh Thakkar
Greg M Silverman
Abhinab Kc
Nicholas E Ingraham
Emma K Jones
Samantha King
Genevieve B Melton
Rui Zhang
Christopher J Tignanelli
author_facet Vedansh Thakkar
Greg M Silverman
Abhinab Kc
Nicholas E Ingraham
Emma K Jones
Samantha King
Genevieve B Melton
Rui Zhang
Christopher J Tignanelli
author_sort Vedansh Thakkar
collection DOAJ
description <h4>Background</h4>Patient symptoms, crucial for disease progression and diagnosis, are often captured in unstructured clinical notes. Large language models (LLMs) offer potential advantages in extracting patient symptoms compared to traditional rule-based information extraction (IE) systems.<h4>Methods</h4>This study compared fine-tuned LLMs (LLaMA2-13B and LLaMA3-8B) against BioMedICUS, a rule-based IE system, for extracting symptoms related to acute and post-acute sequelae of SARS-CoV-2 from clinical notes. The study utilized three corpora: UMN-COVID, UMN-PASC, and N3C-COVID. Prevalence, keyword and fairness analyses were conducted to assess symptom distribution and model equity across demographics.<h4>Results</h4>BioMedICUS outperformed fine-tuned LLMs in most cases. On the UMN PASC dataset, BioMedICUS achieved a macro-averaged F1-score of 0.70 for positive mention detection, compared to 0.66 for LLaMA2-13B and 0.62 for LLaMA3-8B. For the N3C COVID dataset, BioMedICUS scored 0.75, while LLaMA2-13B and LLaMA3-8B scored 0.53 and 0.68, respectively for positive mention detection. However, LLMs performed better in specific instances, such as detecting positive mentions of change in sleep in the UMN PASC dataset, where LLaMA2-13B (0.79) and LLaMA3-8B (0.65) outperformed BioMedICUS (0.60). For fairness analysis, BioMedICUS generally showed stronger performance across patient demographics. Keyword analysis using ANOVA on symptom distributions across all three corpora showed that both corpus (df = 2, p < 0.001) and symptom (df = 79, p < 0.001) have a statistically significant effect on log-transformed term frequency-inverse document frequency (TF-IDF) values such that corpus accounts for 52% of the variance in log_tfidf values and symptom accounts for 35%.<h4>Conclusion</h4>While BioMedICUS generally outperformed the LLMs, the latter showed promising results in specific areas, particularly LLaMA3-8B, in identifying negative symptom mentions. However, both LLaMA models faced challenges in demographic fairness and generalizability. These findings underscore the need for diverse, high-quality training datasets and robust annotation processes to enhance LLMs' performance and reliability in clinical applications.
format Article
id doaj-art-734c7e80dc7e46b8b86e6a6e3b832e2b
institution OA Journals
issn 1932-6203
language English
publishDate 2025-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj-art-734c7e80dc7e46b8b86e6a6e3b832e2b2025-08-20T01:52:54ZengPublic Library of Science (PLoS)PLoS ONE1932-62032025-01-01205e032353510.1371/journal.pone.0323535A comparative analysis of large language models versus traditional information extraction methods for real-world evidence of patient symptomatology in acute and post-acute sequelae of SARS-CoV-2.Vedansh ThakkarGreg M SilvermanAbhinab KcNicholas E IngrahamEmma K JonesSamantha KingGenevieve B MeltonRui ZhangChristopher J Tignanelli<h4>Background</h4>Patient symptoms, crucial for disease progression and diagnosis, are often captured in unstructured clinical notes. Large language models (LLMs) offer potential advantages in extracting patient symptoms compared to traditional rule-based information extraction (IE) systems.<h4>Methods</h4>This study compared fine-tuned LLMs (LLaMA2-13B and LLaMA3-8B) against BioMedICUS, a rule-based IE system, for extracting symptoms related to acute and post-acute sequelae of SARS-CoV-2 from clinical notes. The study utilized three corpora: UMN-COVID, UMN-PASC, and N3C-COVID. Prevalence, keyword and fairness analyses were conducted to assess symptom distribution and model equity across demographics.<h4>Results</h4>BioMedICUS outperformed fine-tuned LLMs in most cases. On the UMN PASC dataset, BioMedICUS achieved a macro-averaged F1-score of 0.70 for positive mention detection, compared to 0.66 for LLaMA2-13B and 0.62 for LLaMA3-8B. For the N3C COVID dataset, BioMedICUS scored 0.75, while LLaMA2-13B and LLaMA3-8B scored 0.53 and 0.68, respectively for positive mention detection. However, LLMs performed better in specific instances, such as detecting positive mentions of change in sleep in the UMN PASC dataset, where LLaMA2-13B (0.79) and LLaMA3-8B (0.65) outperformed BioMedICUS (0.60). For fairness analysis, BioMedICUS generally showed stronger performance across patient demographics. Keyword analysis using ANOVA on symptom distributions across all three corpora showed that both corpus (df = 2, p < 0.001) and symptom (df = 79, p < 0.001) have a statistically significant effect on log-transformed term frequency-inverse document frequency (TF-IDF) values such that corpus accounts for 52% of the variance in log_tfidf values and symptom accounts for 35%.<h4>Conclusion</h4>While BioMedICUS generally outperformed the LLMs, the latter showed promising results in specific areas, particularly LLaMA3-8B, in identifying negative symptom mentions. However, both LLaMA models faced challenges in demographic fairness and generalizability. These findings underscore the need for diverse, high-quality training datasets and robust annotation processes to enhance LLMs' performance and reliability in clinical applications.https://doi.org/10.1371/journal.pone.0323535
spellingShingle Vedansh Thakkar
Greg M Silverman
Abhinab Kc
Nicholas E Ingraham
Emma K Jones
Samantha King
Genevieve B Melton
Rui Zhang
Christopher J Tignanelli
A comparative analysis of large language models versus traditional information extraction methods for real-world evidence of patient symptomatology in acute and post-acute sequelae of SARS-CoV-2.
PLoS ONE
title A comparative analysis of large language models versus traditional information extraction methods for real-world evidence of patient symptomatology in acute and post-acute sequelae of SARS-CoV-2.
title_full A comparative analysis of large language models versus traditional information extraction methods for real-world evidence of patient symptomatology in acute and post-acute sequelae of SARS-CoV-2.
title_fullStr A comparative analysis of large language models versus traditional information extraction methods for real-world evidence of patient symptomatology in acute and post-acute sequelae of SARS-CoV-2.
title_full_unstemmed A comparative analysis of large language models versus traditional information extraction methods for real-world evidence of patient symptomatology in acute and post-acute sequelae of SARS-CoV-2.
title_short A comparative analysis of large language models versus traditional information extraction methods for real-world evidence of patient symptomatology in acute and post-acute sequelae of SARS-CoV-2.
title_sort comparative analysis of large language models versus traditional information extraction methods for real world evidence of patient symptomatology in acute and post acute sequelae of sars cov 2
url https://doi.org/10.1371/journal.pone.0323535
work_keys_str_mv AT vedanshthakkar acomparativeanalysisoflargelanguagemodelsversustraditionalinformationextractionmethodsforrealworldevidenceofpatientsymptomatologyinacuteandpostacutesequelaeofsarscov2
AT gregmsilverman acomparativeanalysisoflargelanguagemodelsversustraditionalinformationextractionmethodsforrealworldevidenceofpatientsymptomatologyinacuteandpostacutesequelaeofsarscov2
AT abhinabkc acomparativeanalysisoflargelanguagemodelsversustraditionalinformationextractionmethodsforrealworldevidenceofpatientsymptomatologyinacuteandpostacutesequelaeofsarscov2
AT nicholaseingraham acomparativeanalysisoflargelanguagemodelsversustraditionalinformationextractionmethodsforrealworldevidenceofpatientsymptomatologyinacuteandpostacutesequelaeofsarscov2
AT emmakjones acomparativeanalysisoflargelanguagemodelsversustraditionalinformationextractionmethodsforrealworldevidenceofpatientsymptomatologyinacuteandpostacutesequelaeofsarscov2
AT samanthaking acomparativeanalysisoflargelanguagemodelsversustraditionalinformationextractionmethodsforrealworldevidenceofpatientsymptomatologyinacuteandpostacutesequelaeofsarscov2
AT genevievebmelton acomparativeanalysisoflargelanguagemodelsversustraditionalinformationextractionmethodsforrealworldevidenceofpatientsymptomatologyinacuteandpostacutesequelaeofsarscov2
AT ruizhang acomparativeanalysisoflargelanguagemodelsversustraditionalinformationextractionmethodsforrealworldevidenceofpatientsymptomatologyinacuteandpostacutesequelaeofsarscov2
AT christopherjtignanelli acomparativeanalysisoflargelanguagemodelsversustraditionalinformationextractionmethodsforrealworldevidenceofpatientsymptomatologyinacuteandpostacutesequelaeofsarscov2
AT vedanshthakkar comparativeanalysisoflargelanguagemodelsversustraditionalinformationextractionmethodsforrealworldevidenceofpatientsymptomatologyinacuteandpostacutesequelaeofsarscov2
AT gregmsilverman comparativeanalysisoflargelanguagemodelsversustraditionalinformationextractionmethodsforrealworldevidenceofpatientsymptomatologyinacuteandpostacutesequelaeofsarscov2
AT abhinabkc comparativeanalysisoflargelanguagemodelsversustraditionalinformationextractionmethodsforrealworldevidenceofpatientsymptomatologyinacuteandpostacutesequelaeofsarscov2
AT nicholaseingraham comparativeanalysisoflargelanguagemodelsversustraditionalinformationextractionmethodsforrealworldevidenceofpatientsymptomatologyinacuteandpostacutesequelaeofsarscov2
AT emmakjones comparativeanalysisoflargelanguagemodelsversustraditionalinformationextractionmethodsforrealworldevidenceofpatientsymptomatologyinacuteandpostacutesequelaeofsarscov2
AT samanthaking comparativeanalysisoflargelanguagemodelsversustraditionalinformationextractionmethodsforrealworldevidenceofpatientsymptomatologyinacuteandpostacutesequelaeofsarscov2
AT genevievebmelton comparativeanalysisoflargelanguagemodelsversustraditionalinformationextractionmethodsforrealworldevidenceofpatientsymptomatologyinacuteandpostacutesequelaeofsarscov2
AT ruizhang comparativeanalysisoflargelanguagemodelsversustraditionalinformationextractionmethodsforrealworldevidenceofpatientsymptomatologyinacuteandpostacutesequelaeofsarscov2
AT christopherjtignanelli comparativeanalysisoflargelanguagemodelsversustraditionalinformationextractionmethodsforrealworldevidenceofpatientsymptomatologyinacuteandpostacutesequelaeofsarscov2