Assessment of ChatGPT’s adherence to evidence-based clinical practice guidelines for plantar fasciitis management

Abstract Purpose This study aimed to test the multidimensional performance of Chat-Generative Pre-trained Transformer (ChatGPT) in generating recommendations for the management of plantar fasciitis (PF) that adhere to well-established clinical practice guidelines. Materials and methods 21 queries we...

Full description

Saved in:

Bibliographic Details
Main Authors:	Le Zhang, Tianyi Wang, Yinfeng Zheng, Xiaochuan Kong, Gang Hong, Lei Zang
Format:	Article
Language:	English
Published:	BMC 2025-04-01
Series:	Journal of Orthopaedic Surgery and Research
Subjects:	Plantar fasciitis Artificial intelligence ChatGPT Large Language models Clinical guidelines Physical therapy
Online Access:	https://doi.org/10.1186/s13018-025-05831-y
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850284553758834688
author	Le Zhang Tianyi Wang Yinfeng Zheng Xiaochuan Kong Gang Hong Lei Zang
author_facet	Le Zhang Tianyi Wang Yinfeng Zheng Xiaochuan Kong Gang Hong Lei Zang
author_sort	Le Zhang
collection	DOAJ
description	Abstract Purpose This study aimed to test the multidimensional performance of Chat-Generative Pre-trained Transformer (ChatGPT) in generating recommendations for the management of plantar fasciitis (PF) that adhere to well-established clinical practice guidelines. Materials and methods 21 queries were raised from the 2023 APTA guideline recommendations for PF and prompted into ChatGPT-4o and ChatGPT-4 Turbo. Two experienced orthopaedic physicians evaluated the responses for accuracy, consistency, self-awareness, and fabrication and falsification using five-point Likert scales. The group-wise comparisons were conducted between the two models and subgroups. Results The interrater agreement between evaluators was moderate to good (intraclass correlation coefficients of 0.573–0.757). Both versions of ChatGPT were outperformed and comparable across all dimensions, including accuracy ([4.1 ± 0.8] vs. [4.1 ± 0.7], P = 0.959), consistency ([4.6 ± 0.5] vs. [4.6 ± 0.6], P = 0.890), self-awareness ([4.3 ± 0.6] vs. [4.5 ± 0.5], P = 0.407), and fabrication and falsification ([4.6 ± 0.6] vs. [4.5 ± 0.4], P = 0.681). In the subgroup comparisons, better performance was identified in closed-ended questions and for positive rather than negative recommendations (P < 0.05). No significant differences were found between recommendation strength subgroups, except in fabrication and falsification ([4.4 ± 0.6] vs. [5.0 ± 0], P = 0.001). Conclusions The two mainstream versions of ChatGPT showed comparable and superior performance in generating recommendations concordant with clinical guidelines for PF management. However, notable specific issues included performance variations between different prompt strategies, recommendation grades, and recommendation type, and the models should still be utilized with caution.
format	Article
id	doaj-art-7fe675ebb1f74603811a2e1913c8b9d8
institution	OA Journals
issn	1749-799X
language	English
publishDate	2025-04-01
publisher	BMC
record_format	Article
series	Journal of Orthopaedic Surgery and Research
spelling	doaj-art-7fe675ebb1f74603811a2e1913c8b9d82025-08-20T01:47:32ZengBMCJournal of Orthopaedic Surgery and Research1749-799X2025-04-012011710.1186/s13018-025-05831-yAssessment of ChatGPT’s adherence to evidence-based clinical practice guidelines for plantar fasciitis managementLe Zhang0Tianyi Wang1Yinfeng Zheng2Xiaochuan Kong3Gang Hong4Lei Zang5Department of Orthopedics, Beijing Chaoyang Hospital, Capital Medical UniversityDepartment of Orthopedics, Beijing Chaoyang Hospital, Capital Medical UniversityDepartment of Orthopedics, Beijing Chaoyang Hospital, Capital Medical UniversityDepartment of Orthopedics, Beijing Chaoyang Hospital, Capital Medical UniversityDepartment of Orthopedics, Beijing Chaoyang Hospital, Capital Medical UniversityDepartment of Orthopedics, Beijing Chaoyang Hospital, Capital Medical UniversityAbstract Purpose This study aimed to test the multidimensional performance of Chat-Generative Pre-trained Transformer (ChatGPT) in generating recommendations for the management of plantar fasciitis (PF) that adhere to well-established clinical practice guidelines. Materials and methods 21 queries were raised from the 2023 APTA guideline recommendations for PF and prompted into ChatGPT-4o and ChatGPT-4 Turbo. Two experienced orthopaedic physicians evaluated the responses for accuracy, consistency, self-awareness, and fabrication and falsification using five-point Likert scales. The group-wise comparisons were conducted between the two models and subgroups. Results The interrater agreement between evaluators was moderate to good (intraclass correlation coefficients of 0.573–0.757). Both versions of ChatGPT were outperformed and comparable across all dimensions, including accuracy ([4.1 ± 0.8] vs. [4.1 ± 0.7], P = 0.959), consistency ([4.6 ± 0.5] vs. [4.6 ± 0.6], P = 0.890), self-awareness ([4.3 ± 0.6] vs. [4.5 ± 0.5], P = 0.407), and fabrication and falsification ([4.6 ± 0.6] vs. [4.5 ± 0.4], P = 0.681). In the subgroup comparisons, better performance was identified in closed-ended questions and for positive rather than negative recommendations (P < 0.05). No significant differences were found between recommendation strength subgroups, except in fabrication and falsification ([4.4 ± 0.6] vs. [5.0 ± 0], P = 0.001). Conclusions The two mainstream versions of ChatGPT showed comparable and superior performance in generating recommendations concordant with clinical guidelines for PF management. However, notable specific issues included performance variations between different prompt strategies, recommendation grades, and recommendation type, and the models should still be utilized with caution.https://doi.org/10.1186/s13018-025-05831-yPlantar fasciitisArtificial intelligenceChatGPTLarge Language modelsClinical guidelinesPhysical therapy
spellingShingle	Le Zhang Tianyi Wang Yinfeng Zheng Xiaochuan Kong Gang Hong Lei Zang Assessment of ChatGPT’s adherence to evidence-based clinical practice guidelines for plantar fasciitis management Journal of Orthopaedic Surgery and Research Plantar fasciitis Artificial intelligence ChatGPT Large Language models Clinical guidelines Physical therapy
title	Assessment of ChatGPT’s adherence to evidence-based clinical practice guidelines for plantar fasciitis management
title_full	Assessment of ChatGPT’s adherence to evidence-based clinical practice guidelines for plantar fasciitis management
title_fullStr	Assessment of ChatGPT’s adherence to evidence-based clinical practice guidelines for plantar fasciitis management
title_full_unstemmed	Assessment of ChatGPT’s adherence to evidence-based clinical practice guidelines for plantar fasciitis management
title_short	Assessment of ChatGPT’s adherence to evidence-based clinical practice guidelines for plantar fasciitis management
title_sort	assessment of chatgpt s adherence to evidence based clinical practice guidelines for plantar fasciitis management
topic	Plantar fasciitis Artificial intelligence ChatGPT Large Language models Clinical guidelines Physical therapy
url	https://doi.org/10.1186/s13018-025-05831-y
work_keys_str_mv	AT lezhang assessmentofchatgptsadherencetoevidencebasedclinicalpracticeguidelinesforplantarfasciitismanagement AT tianyiwang assessmentofchatgptsadherencetoevidencebasedclinicalpracticeguidelinesforplantarfasciitismanagement AT yinfengzheng assessmentofchatgptsadherencetoevidencebasedclinicalpracticeguidelinesforplantarfasciitismanagement AT xiaochuankong assessmentofchatgptsadherencetoevidencebasedclinicalpracticeguidelinesforplantarfasciitismanagement AT ganghong assessmentofchatgptsadherencetoevidencebasedclinicalpracticeguidelinesforplantarfasciitismanagement AT leizang assessmentofchatgptsadherencetoevidencebasedclinicalpracticeguidelinesforplantarfasciitismanagement

Assessment of ChatGPT’s adherence to evidence-based clinical practice guidelines for plantar fasciitis management

Similar Items