Evaluating chain-of-thought prompting in a GPT chatbot for BCID2 interpretation and stewardship: how does AI compare to human experts?

Abstract Background: Rapid molecular diagnostics, such as the BIOFIRE® Blood Culture Identification 2 (BCID2) panel, have improved the time to pathogen identification in bloodstream infections. However, accurate interpretation and antimicrobial optimization require Infectious Disease (ID) expertis...

Full description

Saved in:
Bibliographic Details
Main Authors: Daniel M. Tassone, Matthew M. Hitchcock, Connor J. Rossier, Douglas Fletcher, Julia Ye, Ian Langford, Julie Boatman, J. Daniel Markley
Format: Article
Language:English
Published: Cambridge University Press 2025-01-01
Series:Antimicrobial Stewardship & Healthcare Epidemiology
Online Access:https://www.cambridge.org/core/product/identifier/S2732494X25100594/type/journal_article
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849702071017668608
author Daniel M. Tassone
Matthew M. Hitchcock
Connor J. Rossier
Douglas Fletcher
Julia Ye
Ian Langford
Julie Boatman
J. Daniel Markley
author_facet Daniel M. Tassone
Matthew M. Hitchcock
Connor J. Rossier
Douglas Fletcher
Julia Ye
Ian Langford
Julie Boatman
J. Daniel Markley
author_sort Daniel M. Tassone
collection DOAJ
description Abstract Background: Rapid molecular diagnostics, such as the BIOFIRE® Blood Culture Identification 2 (BCID2) panel, have improved the time to pathogen identification in bloodstream infections. However, accurate interpretation and antimicrobial optimization require Infectious Disease (ID) expertise, which may not always be readily available. GPT-powered chatbots could support antimicrobial stewardship programs (ASPs) by assisting non-specialist providers in BCID2 result interpretation and treatment recommendations. This study evaluates the performance of a GPT-4 chatbot compared to ASP prospective audit and feedback interventions. Methods: This prospective observational study assessed 43 consecutive real-world cases of bacteremia at a 399-bed VA Medical Center from January to May 2024. The GPT-chatbot utilized “chain-of-thought” prompting and external knowledge integration to generate recommendations. Two independent ID physicians evaluated chatbot and ASP recommendations across four domains: BCID2 interpretation, source control, antibiotic therapy, and additional diagnostic workup. The primary endpoint was the combined rate of harmful or inadequate recommendations. Secondary endpoints assessed the rate of harmful or inadequate responses for each domain. Results: The chatbot had a significantly higher rate of harmful or inadequate recommendations (13%) compared to ASP (4%, p = 0.047). The most significant discrepancy was observed in the domain of antibiotic therapy, where harmful recommendations occurred in up to 10% (p <0.05) of chatbot evaluations. The chatbot performed well in BCID2 interpretation (100% accuracy) but provided more inadequate responses in source control consideration (10% vs. 2% for ASP, p = 0.022). Conclusions: GPT-powered chatbots show potential for supporting antimicrobial stewardship but should only complement, not replace, human expertise in infectious disease management.
format Article
id doaj-art-b10d267c0f8d44e2abcc0e876b9b95d4
institution DOAJ
issn 2732-494X
language English
publishDate 2025-01-01
publisher Cambridge University Press
record_format Article
series Antimicrobial Stewardship & Healthcare Epidemiology
spelling doaj-art-b10d267c0f8d44e2abcc0e876b9b95d42025-08-20T03:17:46ZengCambridge University PressAntimicrobial Stewardship & Healthcare Epidemiology2732-494X2025-01-01510.1017/ash.2025.10059Evaluating chain-of-thought prompting in a GPT chatbot for BCID2 interpretation and stewardship: how does AI compare to human experts?Daniel M. Tassone0https://orcid.org/0009-0002-4562-8056Matthew M. Hitchcock1Connor J. Rossier2Douglas Fletcher3Julia Ye4Ian Langford5Julie Boatman6J. Daniel Markley7Division of Infectious Diseases, Department of Medicine, Central Virginia VA Health Care System, Richmond, VA, USA Virginia Commonwealth University, School of Pharmacy, Richmond, VA, USADivision of Infectious Diseases, Department of Medicine, Central Virginia VA Health Care System, Richmond, VA, USA Division of Infectious Diseases, Department of Medicine, Virginia Commonwealth University School of Medicine, Richmond, VA, USADepartment of Health Informatics, Central Virginia VA Health Care System, Richmond, VA, USADepartment of Health Informatics, Central Virginia VA Health Care System, Richmond, VA, USADivision of Infectious Diseases, Department of Medicine, Central Virginia VA Health Care System, Richmond, VA, USA Virginia Commonwealth University, School of Pharmacy, Richmond, VA, USADivision of Infectious Diseases, Department of Medicine, Virginia Commonwealth University School of Medicine, Richmond, VA, USADivision of Infectious Diseases, Department of Medicine, Central Virginia VA Health Care System, Richmond, VA, USA Division of Infectious Diseases, Department of Medicine, Virginia Commonwealth University School of Medicine, Richmond, VA, USADivision of Infectious Diseases, Department of Medicine, Central Virginia VA Health Care System, Richmond, VA, USA Division of Infectious Diseases, Department of Medicine, Virginia Commonwealth University School of Medicine, Richmond, VA, USA Abstract Background: Rapid molecular diagnostics, such as the BIOFIRE® Blood Culture Identification 2 (BCID2) panel, have improved the time to pathogen identification in bloodstream infections. However, accurate interpretation and antimicrobial optimization require Infectious Disease (ID) expertise, which may not always be readily available. GPT-powered chatbots could support antimicrobial stewardship programs (ASPs) by assisting non-specialist providers in BCID2 result interpretation and treatment recommendations. This study evaluates the performance of a GPT-4 chatbot compared to ASP prospective audit and feedback interventions. Methods: This prospective observational study assessed 43 consecutive real-world cases of bacteremia at a 399-bed VA Medical Center from January to May 2024. The GPT-chatbot utilized “chain-of-thought” prompting and external knowledge integration to generate recommendations. Two independent ID physicians evaluated chatbot and ASP recommendations across four domains: BCID2 interpretation, source control, antibiotic therapy, and additional diagnostic workup. The primary endpoint was the combined rate of harmful or inadequate recommendations. Secondary endpoints assessed the rate of harmful or inadequate responses for each domain. Results: The chatbot had a significantly higher rate of harmful or inadequate recommendations (13%) compared to ASP (4%, p = 0.047). The most significant discrepancy was observed in the domain of antibiotic therapy, where harmful recommendations occurred in up to 10% (p <0.05) of chatbot evaluations. The chatbot performed well in BCID2 interpretation (100% accuracy) but provided more inadequate responses in source control consideration (10% vs. 2% for ASP, p = 0.022). Conclusions: GPT-powered chatbots show potential for supporting antimicrobial stewardship but should only complement, not replace, human expertise in infectious disease management. https://www.cambridge.org/core/product/identifier/S2732494X25100594/type/journal_article
spellingShingle Daniel M. Tassone
Matthew M. Hitchcock
Connor J. Rossier
Douglas Fletcher
Julia Ye
Ian Langford
Julie Boatman
J. Daniel Markley
Evaluating chain-of-thought prompting in a GPT chatbot for BCID2 interpretation and stewardship: how does AI compare to human experts?
Antimicrobial Stewardship & Healthcare Epidemiology
title Evaluating chain-of-thought prompting in a GPT chatbot for BCID2 interpretation and stewardship: how does AI compare to human experts?
title_full Evaluating chain-of-thought prompting in a GPT chatbot for BCID2 interpretation and stewardship: how does AI compare to human experts?
title_fullStr Evaluating chain-of-thought prompting in a GPT chatbot for BCID2 interpretation and stewardship: how does AI compare to human experts?
title_full_unstemmed Evaluating chain-of-thought prompting in a GPT chatbot for BCID2 interpretation and stewardship: how does AI compare to human experts?
title_short Evaluating chain-of-thought prompting in a GPT chatbot for BCID2 interpretation and stewardship: how does AI compare to human experts?
title_sort evaluating chain of thought prompting in a gpt chatbot for bcid2 interpretation and stewardship how does ai compare to human experts
url https://www.cambridge.org/core/product/identifier/S2732494X25100594/type/journal_article
work_keys_str_mv AT danielmtassone evaluatingchainofthoughtpromptinginagptchatbotforbcid2interpretationandstewardshiphowdoesaicomparetohumanexperts
AT matthewmhitchcock evaluatingchainofthoughtpromptinginagptchatbotforbcid2interpretationandstewardshiphowdoesaicomparetohumanexperts
AT connorjrossier evaluatingchainofthoughtpromptinginagptchatbotforbcid2interpretationandstewardshiphowdoesaicomparetohumanexperts
AT douglasfletcher evaluatingchainofthoughtpromptinginagptchatbotforbcid2interpretationandstewardshiphowdoesaicomparetohumanexperts
AT juliaye evaluatingchainofthoughtpromptinginagptchatbotforbcid2interpretationandstewardshiphowdoesaicomparetohumanexperts
AT ianlangford evaluatingchainofthoughtpromptinginagptchatbotforbcid2interpretationandstewardshiphowdoesaicomparetohumanexperts
AT julieboatman evaluatingchainofthoughtpromptinginagptchatbotforbcid2interpretationandstewardshiphowdoesaicomparetohumanexperts
AT jdanielmarkley evaluatingchainofthoughtpromptinginagptchatbotforbcid2interpretationandstewardshiphowdoesaicomparetohumanexperts