Classification performance and reproducibility of GPT-4 omni for information extraction from veterinary electronic health records

Large language models (LLMs) can extract information from veterinary electronic health records (EHRs), but performance differences between models, the effect of hyperparameter settings, and the influence of text ambiguity have not been previously evaluated. This study addresses these gaps by compari...

Full description

Saved in:
Bibliographic Details
Main Authors: Judit M. Wulcan, Kevin L. Jacques, Mary Ann Lee, Samantha L. Kovacs, Nicole Dausend, Lauren E. Prince, Jonatan Wulcan, Sina Marsilio, Stefan M. Keller
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-01-01
Series:Frontiers in Veterinary Science
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fvets.2024.1490030/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841526686281105408
author Judit M. Wulcan
Kevin L. Jacques
Mary Ann Lee
Samantha L. Kovacs
Nicole Dausend
Lauren E. Prince
Jonatan Wulcan
Sina Marsilio
Stefan M. Keller
author_facet Judit M. Wulcan
Kevin L. Jacques
Mary Ann Lee
Samantha L. Kovacs
Nicole Dausend
Lauren E. Prince
Jonatan Wulcan
Sina Marsilio
Stefan M. Keller
author_sort Judit M. Wulcan
collection DOAJ
description Large language models (LLMs) can extract information from veterinary electronic health records (EHRs), but performance differences between models, the effect of hyperparameter settings, and the influence of text ambiguity have not been previously evaluated. This study addresses these gaps by comparing the performance of GPT-4 omni (GPT-4o) and GPT-3.5 Turbo under different conditions and by investigating the relationship between human interobserver agreement and LLM errors. The LLMs and five humans were tasked with identifying six clinical signs associated with feline chronic enteropathy in 250 EHRs from a veterinary referral hospital. When compared to the majority opinion of human respondents, GPT-4o demonstrated 96.9% sensitivity [interquartile range (IQR) 92.9–99.3%], 97.6% specificity (IQR 96.5–98.5%), 80.7% positive predictive value (IQR 70.8–84.6%), 99.5% negative predictive value (IQR 99.0–99.9%), 84.4% F1 score (IQR 77.3–90.4%), and 96.3% balanced accuracy (IQR 95.0–97.9%). The performance of GPT-4o was significantly better than that of its predecessor, GPT-3.5 Turbo, particularly with respect to sensitivity where GPT-3.5 Turbo only achieved 81.7% (IQR 78.9–84.8%). GPT-4o demonstrated greater reproducibility than human pairs, with an average Cohen's kappa of 0.98 (IQR 0.98–0.99) compared to 0.80 (IQR 0.78–0.81) with humans. Most GPT-4o errors occurred in instances where humans disagreed [35/43 errors (81.4%)], suggesting that these errors were more likely caused by ambiguity of the EHR than explicit model faults. Using GPT-4o to automate information extraction from veterinary EHRs is a viable alternative to manual extraction, but requires validation for the intended setting to ensure accuracy and reliability.
format Article
id doaj-art-e48ba2f1e52345868511a68da5de3d72
institution Kabale University
issn 2297-1769
language English
publishDate 2025-01-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Veterinary Science
spelling doaj-art-e48ba2f1e52345868511a68da5de3d722025-01-16T13:48:55ZengFrontiers Media S.A.Frontiers in Veterinary Science2297-17692025-01-011110.3389/fvets.2024.14900301490030Classification performance and reproducibility of GPT-4 omni for information extraction from veterinary electronic health recordsJudit M. Wulcan0Kevin L. Jacques1Mary Ann Lee2Samantha L. Kovacs3Nicole Dausend4Lauren E. Prince5Jonatan Wulcan6Sina Marsilio7Stefan M. Keller8Department of Pathology, Microbiology and Immunology, School of Veterinary Medicine, University of California, Davis, Davis, CA, United StatesDepartment of Pathology, Microbiology and Immunology, School of Veterinary Medicine, University of California, Davis, Davis, CA, United StatesCollege of Veterinary Medicine and Biomedical Sciences, James L. Voss Veterinary Teaching Hospital, Colorado State University, Fort Collins, CO, United StatesDepartment of Pathology, Microbiology and Immunology, School of Veterinary Medicine, University of California, Davis, Davis, CA, United StatesDepartment of Medicine and Epidemiology, School of Veterinary Medicine, University of California, Davis, Davis, CA, United StatesDepartment of Pathology, Microbiology and Immunology, School of Veterinary Medicine, University of California, Davis, Davis, CA, United StatesIndependent Researcher, Malmö, SwedenDepartment of Medicine and Epidemiology, School of Veterinary Medicine, University of California, Davis, Davis, CA, United StatesDepartment of Pathology, Microbiology and Immunology, School of Veterinary Medicine, University of California, Davis, Davis, CA, United StatesLarge language models (LLMs) can extract information from veterinary electronic health records (EHRs), but performance differences between models, the effect of hyperparameter settings, and the influence of text ambiguity have not been previously evaluated. This study addresses these gaps by comparing the performance of GPT-4 omni (GPT-4o) and GPT-3.5 Turbo under different conditions and by investigating the relationship between human interobserver agreement and LLM errors. The LLMs and five humans were tasked with identifying six clinical signs associated with feline chronic enteropathy in 250 EHRs from a veterinary referral hospital. When compared to the majority opinion of human respondents, GPT-4o demonstrated 96.9% sensitivity [interquartile range (IQR) 92.9–99.3%], 97.6% specificity (IQR 96.5–98.5%), 80.7% positive predictive value (IQR 70.8–84.6%), 99.5% negative predictive value (IQR 99.0–99.9%), 84.4% F1 score (IQR 77.3–90.4%), and 96.3% balanced accuracy (IQR 95.0–97.9%). The performance of GPT-4o was significantly better than that of its predecessor, GPT-3.5 Turbo, particularly with respect to sensitivity where GPT-3.5 Turbo only achieved 81.7% (IQR 78.9–84.8%). GPT-4o demonstrated greater reproducibility than human pairs, with an average Cohen's kappa of 0.98 (IQR 0.98–0.99) compared to 0.80 (IQR 0.78–0.81) with humans. Most GPT-4o errors occurred in instances where humans disagreed [35/43 errors (81.4%)], suggesting that these errors were more likely caused by ambiguity of the EHR than explicit model faults. Using GPT-4o to automate information extraction from veterinary EHRs is a viable alternative to manual extraction, but requires validation for the intended setting to ensure accuracy and reliability.https://www.frontiersin.org/articles/10.3389/fvets.2024.1490030/fullmachine learningartificial intelligencegenerative-pretrained transformersChat-GPTtext miningfeline chronic enteropathy
spellingShingle Judit M. Wulcan
Kevin L. Jacques
Mary Ann Lee
Samantha L. Kovacs
Nicole Dausend
Lauren E. Prince
Jonatan Wulcan
Sina Marsilio
Stefan M. Keller
Classification performance and reproducibility of GPT-4 omni for information extraction from veterinary electronic health records
Frontiers in Veterinary Science
machine learning
artificial intelligence
generative-pretrained transformers
Chat-GPT
text mining
feline chronic enteropathy
title Classification performance and reproducibility of GPT-4 omni for information extraction from veterinary electronic health records
title_full Classification performance and reproducibility of GPT-4 omni for information extraction from veterinary electronic health records
title_fullStr Classification performance and reproducibility of GPT-4 omni for information extraction from veterinary electronic health records
title_full_unstemmed Classification performance and reproducibility of GPT-4 omni for information extraction from veterinary electronic health records
title_short Classification performance and reproducibility of GPT-4 omni for information extraction from veterinary electronic health records
title_sort classification performance and reproducibility of gpt 4 omni for information extraction from veterinary electronic health records
topic machine learning
artificial intelligence
generative-pretrained transformers
Chat-GPT
text mining
feline chronic enteropathy
url https://www.frontiersin.org/articles/10.3389/fvets.2024.1490030/full
work_keys_str_mv AT juditmwulcan classificationperformanceandreproducibilityofgpt4omniforinformationextractionfromveterinaryelectronichealthrecords
AT kevinljacques classificationperformanceandreproducibilityofgpt4omniforinformationextractionfromveterinaryelectronichealthrecords
AT maryannlee classificationperformanceandreproducibilityofgpt4omniforinformationextractionfromveterinaryelectronichealthrecords
AT samanthalkovacs classificationperformanceandreproducibilityofgpt4omniforinformationextractionfromveterinaryelectronichealthrecords
AT nicoledausend classificationperformanceandreproducibilityofgpt4omniforinformationextractionfromveterinaryelectronichealthrecords
AT laureneprince classificationperformanceandreproducibilityofgpt4omniforinformationextractionfromveterinaryelectronichealthrecords
AT jonatanwulcan classificationperformanceandreproducibilityofgpt4omniforinformationextractionfromveterinaryelectronichealthrecords
AT sinamarsilio classificationperformanceandreproducibilityofgpt4omniforinformationextractionfromveterinaryelectronichealthrecords
AT stefanmkeller classificationperformanceandreproducibilityofgpt4omniforinformationextractionfromveterinaryelectronichealthrecords