Assessment of ChatGPT’s adherence to evidence-based clinical practice guidelines for plantar fasciitis management

Abstract Purpose This study aimed to test the multidimensional performance of Chat-Generative Pre-trained Transformer (ChatGPT) in generating recommendations for the management of plantar fasciitis (PF) that adhere to well-established clinical practice guidelines. Materials and methods 21 queries we...

Full description

Saved in:
Bibliographic Details
Main Authors: Le Zhang, Tianyi Wang, Yinfeng Zheng, Xiaochuan Kong, Gang Hong, Lei Zang
Format: Article
Language:English
Published: BMC 2025-04-01
Series:Journal of Orthopaedic Surgery and Research
Subjects:
Online Access:https://doi.org/10.1186/s13018-025-05831-y
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850284553758834688
author Le Zhang
Tianyi Wang
Yinfeng Zheng
Xiaochuan Kong
Gang Hong
Lei Zang
author_facet Le Zhang
Tianyi Wang
Yinfeng Zheng
Xiaochuan Kong
Gang Hong
Lei Zang
author_sort Le Zhang
collection DOAJ
description Abstract Purpose This study aimed to test the multidimensional performance of Chat-Generative Pre-trained Transformer (ChatGPT) in generating recommendations for the management of plantar fasciitis (PF) that adhere to well-established clinical practice guidelines. Materials and methods 21 queries were raised from the 2023 APTA guideline recommendations for PF and prompted into ChatGPT-4o and ChatGPT-4 Turbo. Two experienced orthopaedic physicians evaluated the responses for accuracy, consistency, self-awareness, and fabrication and falsification using five-point Likert scales. The group-wise comparisons were conducted between the two models and subgroups. Results The interrater agreement between evaluators was moderate to good (intraclass correlation coefficients of 0.573–0.757). Both versions of ChatGPT were outperformed and comparable across all dimensions, including accuracy ([4.1 ± 0.8] vs. [4.1 ± 0.7], P = 0.959), consistency ([4.6 ± 0.5] vs. [4.6 ± 0.6], P = 0.890), self-awareness ([4.3 ± 0.6] vs. [4.5 ± 0.5], P = 0.407), and fabrication and falsification ([4.6 ± 0.6] vs. [4.5 ± 0.4], P = 0.681). In the subgroup comparisons, better performance was identified in closed-ended questions and for positive rather than negative recommendations (P < 0.05). No significant differences were found between recommendation strength subgroups, except in fabrication and falsification ([4.4 ± 0.6] vs. [5.0 ± 0], P = 0.001). Conclusions The two mainstream versions of ChatGPT showed comparable and superior performance in generating recommendations concordant with clinical guidelines for PF management. However, notable specific issues included performance variations between different prompt strategies, recommendation grades, and recommendation type, and the models should still be utilized with caution.
format Article
id doaj-art-7fe675ebb1f74603811a2e1913c8b9d8
institution OA Journals
issn 1749-799X
language English
publishDate 2025-04-01
publisher BMC
record_format Article
series Journal of Orthopaedic Surgery and Research
spelling doaj-art-7fe675ebb1f74603811a2e1913c8b9d82025-08-20T01:47:32ZengBMCJournal of Orthopaedic Surgery and Research1749-799X2025-04-012011710.1186/s13018-025-05831-yAssessment of ChatGPT’s adherence to evidence-based clinical practice guidelines for plantar fasciitis managementLe Zhang0Tianyi Wang1Yinfeng Zheng2Xiaochuan Kong3Gang Hong4Lei Zang5Department of Orthopedics, Beijing Chaoyang Hospital, Capital Medical UniversityDepartment of Orthopedics, Beijing Chaoyang Hospital, Capital Medical UniversityDepartment of Orthopedics, Beijing Chaoyang Hospital, Capital Medical UniversityDepartment of Orthopedics, Beijing Chaoyang Hospital, Capital Medical UniversityDepartment of Orthopedics, Beijing Chaoyang Hospital, Capital Medical UniversityDepartment of Orthopedics, Beijing Chaoyang Hospital, Capital Medical UniversityAbstract Purpose This study aimed to test the multidimensional performance of Chat-Generative Pre-trained Transformer (ChatGPT) in generating recommendations for the management of plantar fasciitis (PF) that adhere to well-established clinical practice guidelines. Materials and methods 21 queries were raised from the 2023 APTA guideline recommendations for PF and prompted into ChatGPT-4o and ChatGPT-4 Turbo. Two experienced orthopaedic physicians evaluated the responses for accuracy, consistency, self-awareness, and fabrication and falsification using five-point Likert scales. The group-wise comparisons were conducted between the two models and subgroups. Results The interrater agreement between evaluators was moderate to good (intraclass correlation coefficients of 0.573–0.757). Both versions of ChatGPT were outperformed and comparable across all dimensions, including accuracy ([4.1 ± 0.8] vs. [4.1 ± 0.7], P = 0.959), consistency ([4.6 ± 0.5] vs. [4.6 ± 0.6], P = 0.890), self-awareness ([4.3 ± 0.6] vs. [4.5 ± 0.5], P = 0.407), and fabrication and falsification ([4.6 ± 0.6] vs. [4.5 ± 0.4], P = 0.681). In the subgroup comparisons, better performance was identified in closed-ended questions and for positive rather than negative recommendations (P < 0.05). No significant differences were found between recommendation strength subgroups, except in fabrication and falsification ([4.4 ± 0.6] vs. [5.0 ± 0], P = 0.001). Conclusions The two mainstream versions of ChatGPT showed comparable and superior performance in generating recommendations concordant with clinical guidelines for PF management. However, notable specific issues included performance variations between different prompt strategies, recommendation grades, and recommendation type, and the models should still be utilized with caution.https://doi.org/10.1186/s13018-025-05831-yPlantar fasciitisArtificial intelligenceChatGPTLarge Language modelsClinical guidelinesPhysical therapy
spellingShingle Le Zhang
Tianyi Wang
Yinfeng Zheng
Xiaochuan Kong
Gang Hong
Lei Zang
Assessment of ChatGPT’s adherence to evidence-based clinical practice guidelines for plantar fasciitis management
Journal of Orthopaedic Surgery and Research
Plantar fasciitis
Artificial intelligence
ChatGPT
Large Language models
Clinical guidelines
Physical therapy
title Assessment of ChatGPT’s adherence to evidence-based clinical practice guidelines for plantar fasciitis management
title_full Assessment of ChatGPT’s adherence to evidence-based clinical practice guidelines for plantar fasciitis management
title_fullStr Assessment of ChatGPT’s adherence to evidence-based clinical practice guidelines for plantar fasciitis management
title_full_unstemmed Assessment of ChatGPT’s adherence to evidence-based clinical practice guidelines for plantar fasciitis management
title_short Assessment of ChatGPT’s adherence to evidence-based clinical practice guidelines for plantar fasciitis management
title_sort assessment of chatgpt s adherence to evidence based clinical practice guidelines for plantar fasciitis management
topic Plantar fasciitis
Artificial intelligence
ChatGPT
Large Language models
Clinical guidelines
Physical therapy
url https://doi.org/10.1186/s13018-025-05831-y
work_keys_str_mv AT lezhang assessmentofchatgptsadherencetoevidencebasedclinicalpracticeguidelinesforplantarfasciitismanagement
AT tianyiwang assessmentofchatgptsadherencetoevidencebasedclinicalpracticeguidelinesforplantarfasciitismanagement
AT yinfengzheng assessmentofchatgptsadherencetoevidencebasedclinicalpracticeguidelinesforplantarfasciitismanagement
AT xiaochuankong assessmentofchatgptsadherencetoevidencebasedclinicalpracticeguidelinesforplantarfasciitismanagement
AT ganghong assessmentofchatgptsadherencetoevidencebasedclinicalpracticeguidelinesforplantarfasciitismanagement
AT leizang assessmentofchatgptsadherencetoevidencebasedclinicalpracticeguidelinesforplantarfasciitismanagement