Accuracy, appropriateness, and readability of ChatGPT-4 and ChatGPT-3.5 in answering pediatric emergency medicine post-discharge questions

Purpose Large language models (LLMs) like ChatGPT (OpenAI) are increasingly used in healthcare, raising questions about their accuracy and reliability for medical information. This study compared 2 versions of ChatGPT in answering post-discharge follow-up questions in the area of pediatric emergency...

Full description

Saved in:

Bibliographic Details
Main Authors:	Mitul Gupta, Aiza Kahlun, Ria Sur, Pramiti Gupta, Andrew Kienstra, Winnie Whitaker, Graham Aufricht
Format:	Article
Language:	English
Published:	Korean Society of Pediatric Emergency Medicine 2025-04-01
Series:	Pediatric Emergency Medicine Journal
Subjects:	artificial intelligence patient discharge patient education as topic pediatric emergency medicine language
Online Access:	http://pemj.org/upload/pdf/pemj-2024-01074.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849734631668056064
author	Mitul Gupta Aiza Kahlun Ria Sur Pramiti Gupta Andrew Kienstra Winnie Whitaker Graham Aufricht
author_facet	Mitul Gupta Aiza Kahlun Ria Sur Pramiti Gupta Andrew Kienstra Winnie Whitaker Graham Aufricht
author_sort	Mitul Gupta
collection	DOAJ
description	Purpose Large language models (LLMs) like ChatGPT (OpenAI) are increasingly used in healthcare, raising questions about their accuracy and reliability for medical information. This study compared 2 versions of ChatGPT in answering post-discharge follow-up questions in the area of pediatric emergency medicine (PEM). Methods Twenty-three common post-discharge questions were posed to ChatGPT-4 and -3.5, with responses generated before and after a simplification request. Two blinded PEM physicians evaluated appropriateness and accuracy as the primary endpoint. Secondary endpoints included word count and readability. Six established reading scales were averaged, including the Automated Readability Index, Gunning Fog Index, Flesch-Kincaid Grade Level, Coleman-Liau Index, Simple Measure of Gobbledygook Grade Level, and Flesch Reading Ease. T-tests and Cohen’s kappa were used to determine differences and inter-rater agreement, respectively. Results The physician evaluations showed high appropriateness for both defaults (ChatGPT-4, 91.3%-100% vs. ChatGPT-3.5, 91.3%) and simplified responses (both 87.0%-91.3%). The accuracy was also high for default (87.0%-95.7% vs. 87.0%-91.3%) and simplified responses (both 82.6%-91.3%). The inter-rater agreement was fair overall (κ = 0.37; P < 0.001). For default responses, ChatGPT-4 produced longer outputs than ChatGPT-3.5 (233.0 ± 97.1 vs. 199.6 ± 94.7 words; P = 0.043), with a similar readability (13.3 ± 1.9 vs. 13.5 ± 1.8; P = 0.404). After simplification, both LLMs improved word count and readability (P < 0.001), with ChatGPT-4 achieving a readability suitable for the eighth grade students in the United States (7.7 ± 1.3 vs. 8.2 ± 1.5; P = 0.027). Conclusion The responses of ChatGPT-4 and -3.5 to post-discharge questions were deemed appropriate and accurate by the PEM physicians. While ChatGPT-4 showed an edge in simplifying language, neither LLM consistently met the recommended reading level of sixth grade students. These findings suggest a potential for LLMs to communicate with guardians.
format	Article
id	doaj-art-ecbeca31b7f341b7be7421a3fd49d411
institution	DOAJ
issn	2383-4897 2508-5506
language	English
publishDate	2025-04-01
publisher	Korean Society of Pediatric Emergency Medicine
record_format	Article
series	Pediatric Emergency Medicine Journal
spelling	doaj-art-ecbeca31b7f341b7be7421a3fd49d4112025-08-20T03:07:44ZengKorean Society of Pediatric Emergency MedicinePediatric Emergency Medicine Journal2383-48972508-55062025-04-01122627210.22470/pemj.2024.01074228Accuracy, appropriateness, and readability of ChatGPT-4 and ChatGPT-3.5 in answering pediatric emergency medicine post-discharge questionsMitul Gupta0Aiza Kahlun1Ria Sur2Pramiti Gupta3Andrew Kienstra4Winnie Whitaker5Graham Aufricht6Department of Diagnostic Medicine, Dell Medical School, The University of Texas at Austin, Austin, TX, USADepartment of Diagnostic Medicine, Dell Medical School, The University of Texas at Austin, Austin, TX, USADepartment of Diagnostic Medicine, Dell Medical School, The University of Texas at Austin, Austin, TX, USAUndergraduate Program, The University of Texas at Austin, Austin, TX, USADepartment of Pediatric Emergency Medicine, Dell Medical School, The University of Texas at Austin, Austin, TX, USADepartment of Pediatric Emergency Medicine, Dell Medical School, The University of Texas at Austin, Austin, TX, USADepartment of Pediatric Emergency Medicine, Dell Medical School, The University of Texas at Austin, Austin, TX, USAPurpose Large language models (LLMs) like ChatGPT (OpenAI) are increasingly used in healthcare, raising questions about their accuracy and reliability for medical information. This study compared 2 versions of ChatGPT in answering post-discharge follow-up questions in the area of pediatric emergency medicine (PEM). Methods Twenty-three common post-discharge questions were posed to ChatGPT-4 and -3.5, with responses generated before and after a simplification request. Two blinded PEM physicians evaluated appropriateness and accuracy as the primary endpoint. Secondary endpoints included word count and readability. Six established reading scales were averaged, including the Automated Readability Index, Gunning Fog Index, Flesch-Kincaid Grade Level, Coleman-Liau Index, Simple Measure of Gobbledygook Grade Level, and Flesch Reading Ease. T-tests and Cohen’s kappa were used to determine differences and inter-rater agreement, respectively. Results The physician evaluations showed high appropriateness for both defaults (ChatGPT-4, 91.3%-100% vs. ChatGPT-3.5, 91.3%) and simplified responses (both 87.0%-91.3%). The accuracy was also high for default (87.0%-95.7% vs. 87.0%-91.3%) and simplified responses (both 82.6%-91.3%). The inter-rater agreement was fair overall (κ = 0.37; P < 0.001). For default responses, ChatGPT-4 produced longer outputs than ChatGPT-3.5 (233.0 ± 97.1 vs. 199.6 ± 94.7 words; P = 0.043), with a similar readability (13.3 ± 1.9 vs. 13.5 ± 1.8; P = 0.404). After simplification, both LLMs improved word count and readability (P < 0.001), with ChatGPT-4 achieving a readability suitable for the eighth grade students in the United States (7.7 ± 1.3 vs. 8.2 ± 1.5; P = 0.027). Conclusion The responses of ChatGPT-4 and -3.5 to post-discharge questions were deemed appropriate and accurate by the PEM physicians. While ChatGPT-4 showed an edge in simplifying language, neither LLM consistently met the recommended reading level of sixth grade students. These findings suggest a potential for LLMs to communicate with guardians.http://pemj.org/upload/pdf/pemj-2024-01074.pdfartificial intelligencepatient dischargepatient education as topicpediatric emergency medicinelanguage
spellingShingle	Mitul Gupta Aiza Kahlun Ria Sur Pramiti Gupta Andrew Kienstra Winnie Whitaker Graham Aufricht Accuracy, appropriateness, and readability of ChatGPT-4 and ChatGPT-3.5 in answering pediatric emergency medicine post-discharge questions Pediatric Emergency Medicine Journal artificial intelligence patient discharge patient education as topic pediatric emergency medicine language
title	Accuracy, appropriateness, and readability of ChatGPT-4 and ChatGPT-3.5 in answering pediatric emergency medicine post-discharge questions
title_full	Accuracy, appropriateness, and readability of ChatGPT-4 and ChatGPT-3.5 in answering pediatric emergency medicine post-discharge questions
title_fullStr	Accuracy, appropriateness, and readability of ChatGPT-4 and ChatGPT-3.5 in answering pediatric emergency medicine post-discharge questions
title_full_unstemmed	Accuracy, appropriateness, and readability of ChatGPT-4 and ChatGPT-3.5 in answering pediatric emergency medicine post-discharge questions
title_short	Accuracy, appropriateness, and readability of ChatGPT-4 and ChatGPT-3.5 in answering pediatric emergency medicine post-discharge questions
title_sort	accuracy appropriateness and readability of chatgpt 4 and chatgpt 3 5 in answering pediatric emergency medicine post discharge questions
topic	artificial intelligence patient discharge patient education as topic pediatric emergency medicine language
url	http://pemj.org/upload/pdf/pemj-2024-01074.pdf
work_keys_str_mv	AT mitulgupta accuracyappropriatenessandreadabilityofchatgpt4andchatgpt35inansweringpediatricemergencymedicinepostdischargequestions AT aizakahlun accuracyappropriatenessandreadabilityofchatgpt4andchatgpt35inansweringpediatricemergencymedicinepostdischargequestions AT riasur accuracyappropriatenessandreadabilityofchatgpt4andchatgpt35inansweringpediatricemergencymedicinepostdischargequestions AT pramitigupta accuracyappropriatenessandreadabilityofchatgpt4andchatgpt35inansweringpediatricemergencymedicinepostdischargequestions AT andrewkienstra accuracyappropriatenessandreadabilityofchatgpt4andchatgpt35inansweringpediatricemergencymedicinepostdischargequestions AT winniewhitaker accuracyappropriatenessandreadabilityofchatgpt4andchatgpt35inansweringpediatricemergencymedicinepostdischargequestions AT grahamaufricht accuracyappropriatenessandreadabilityofchatgpt4andchatgpt35inansweringpediatricemergencymedicinepostdischargequestions

Accuracy, appropriateness, and readability of ChatGPT-4 and ChatGPT-3.5 in answering pediatric emergency medicine post-discharge questions

Similar Items