Assessment of ChatGPT’s adherence to evidence-based clinical practice guidelines for plantar fasciitis management
Abstract Purpose This study aimed to test the multidimensional performance of Chat-Generative Pre-trained Transformer (ChatGPT) in generating recommendations for the management of plantar fasciitis (PF) that adhere to well-established clinical practice guidelines. Materials and methods 21 queries we...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
BMC
2025-04-01
|
| Series: | Journal of Orthopaedic Surgery and Research |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s13018-025-05831-y |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850284553758834688 |
|---|---|
| author | Le Zhang Tianyi Wang Yinfeng Zheng Xiaochuan Kong Gang Hong Lei Zang |
| author_facet | Le Zhang Tianyi Wang Yinfeng Zheng Xiaochuan Kong Gang Hong Lei Zang |
| author_sort | Le Zhang |
| collection | DOAJ |
| description | Abstract Purpose This study aimed to test the multidimensional performance of Chat-Generative Pre-trained Transformer (ChatGPT) in generating recommendations for the management of plantar fasciitis (PF) that adhere to well-established clinical practice guidelines. Materials and methods 21 queries were raised from the 2023 APTA guideline recommendations for PF and prompted into ChatGPT-4o and ChatGPT-4 Turbo. Two experienced orthopaedic physicians evaluated the responses for accuracy, consistency, self-awareness, and fabrication and falsification using five-point Likert scales. The group-wise comparisons were conducted between the two models and subgroups. Results The interrater agreement between evaluators was moderate to good (intraclass correlation coefficients of 0.573–0.757). Both versions of ChatGPT were outperformed and comparable across all dimensions, including accuracy ([4.1 ± 0.8] vs. [4.1 ± 0.7], P = 0.959), consistency ([4.6 ± 0.5] vs. [4.6 ± 0.6], P = 0.890), self-awareness ([4.3 ± 0.6] vs. [4.5 ± 0.5], P = 0.407), and fabrication and falsification ([4.6 ± 0.6] vs. [4.5 ± 0.4], P = 0.681). In the subgroup comparisons, better performance was identified in closed-ended questions and for positive rather than negative recommendations (P < 0.05). No significant differences were found between recommendation strength subgroups, except in fabrication and falsification ([4.4 ± 0.6] vs. [5.0 ± 0], P = 0.001). Conclusions The two mainstream versions of ChatGPT showed comparable and superior performance in generating recommendations concordant with clinical guidelines for PF management. However, notable specific issues included performance variations between different prompt strategies, recommendation grades, and recommendation type, and the models should still be utilized with caution. |
| format | Article |
| id | doaj-art-7fe675ebb1f74603811a2e1913c8b9d8 |
| institution | OA Journals |
| issn | 1749-799X |
| language | English |
| publishDate | 2025-04-01 |
| publisher | BMC |
| record_format | Article |
| series | Journal of Orthopaedic Surgery and Research |
| spelling | doaj-art-7fe675ebb1f74603811a2e1913c8b9d82025-08-20T01:47:32ZengBMCJournal of Orthopaedic Surgery and Research1749-799X2025-04-012011710.1186/s13018-025-05831-yAssessment of ChatGPT’s adherence to evidence-based clinical practice guidelines for plantar fasciitis managementLe Zhang0Tianyi Wang1Yinfeng Zheng2Xiaochuan Kong3Gang Hong4Lei Zang5Department of Orthopedics, Beijing Chaoyang Hospital, Capital Medical UniversityDepartment of Orthopedics, Beijing Chaoyang Hospital, Capital Medical UniversityDepartment of Orthopedics, Beijing Chaoyang Hospital, Capital Medical UniversityDepartment of Orthopedics, Beijing Chaoyang Hospital, Capital Medical UniversityDepartment of Orthopedics, Beijing Chaoyang Hospital, Capital Medical UniversityDepartment of Orthopedics, Beijing Chaoyang Hospital, Capital Medical UniversityAbstract Purpose This study aimed to test the multidimensional performance of Chat-Generative Pre-trained Transformer (ChatGPT) in generating recommendations for the management of plantar fasciitis (PF) that adhere to well-established clinical practice guidelines. Materials and methods 21 queries were raised from the 2023 APTA guideline recommendations for PF and prompted into ChatGPT-4o and ChatGPT-4 Turbo. Two experienced orthopaedic physicians evaluated the responses for accuracy, consistency, self-awareness, and fabrication and falsification using five-point Likert scales. The group-wise comparisons were conducted between the two models and subgroups. Results The interrater agreement between evaluators was moderate to good (intraclass correlation coefficients of 0.573–0.757). Both versions of ChatGPT were outperformed and comparable across all dimensions, including accuracy ([4.1 ± 0.8] vs. [4.1 ± 0.7], P = 0.959), consistency ([4.6 ± 0.5] vs. [4.6 ± 0.6], P = 0.890), self-awareness ([4.3 ± 0.6] vs. [4.5 ± 0.5], P = 0.407), and fabrication and falsification ([4.6 ± 0.6] vs. [4.5 ± 0.4], P = 0.681). In the subgroup comparisons, better performance was identified in closed-ended questions and for positive rather than negative recommendations (P < 0.05). No significant differences were found between recommendation strength subgroups, except in fabrication and falsification ([4.4 ± 0.6] vs. [5.0 ± 0], P = 0.001). Conclusions The two mainstream versions of ChatGPT showed comparable and superior performance in generating recommendations concordant with clinical guidelines for PF management. However, notable specific issues included performance variations between different prompt strategies, recommendation grades, and recommendation type, and the models should still be utilized with caution.https://doi.org/10.1186/s13018-025-05831-yPlantar fasciitisArtificial intelligenceChatGPTLarge Language modelsClinical guidelinesPhysical therapy |
| spellingShingle | Le Zhang Tianyi Wang Yinfeng Zheng Xiaochuan Kong Gang Hong Lei Zang Assessment of ChatGPT’s adherence to evidence-based clinical practice guidelines for plantar fasciitis management Journal of Orthopaedic Surgery and Research Plantar fasciitis Artificial intelligence ChatGPT Large Language models Clinical guidelines Physical therapy |
| title | Assessment of ChatGPT’s adherence to evidence-based clinical practice guidelines for plantar fasciitis management |
| title_full | Assessment of ChatGPT’s adherence to evidence-based clinical practice guidelines for plantar fasciitis management |
| title_fullStr | Assessment of ChatGPT’s adherence to evidence-based clinical practice guidelines for plantar fasciitis management |
| title_full_unstemmed | Assessment of ChatGPT’s adherence to evidence-based clinical practice guidelines for plantar fasciitis management |
| title_short | Assessment of ChatGPT’s adherence to evidence-based clinical practice guidelines for plantar fasciitis management |
| title_sort | assessment of chatgpt s adherence to evidence based clinical practice guidelines for plantar fasciitis management |
| topic | Plantar fasciitis Artificial intelligence ChatGPT Large Language models Clinical guidelines Physical therapy |
| url | https://doi.org/10.1186/s13018-025-05831-y |
| work_keys_str_mv | AT lezhang assessmentofchatgptsadherencetoevidencebasedclinicalpracticeguidelinesforplantarfasciitismanagement AT tianyiwang assessmentofchatgptsadherencetoevidencebasedclinicalpracticeguidelinesforplantarfasciitismanagement AT yinfengzheng assessmentofchatgptsadherencetoevidencebasedclinicalpracticeguidelinesforplantarfasciitismanagement AT xiaochuankong assessmentofchatgptsadherencetoevidencebasedclinicalpracticeguidelinesforplantarfasciitismanagement AT ganghong assessmentofchatgptsadherencetoevidencebasedclinicalpracticeguidelinesforplantarfasciitismanagement AT leizang assessmentofchatgptsadherencetoevidencebasedclinicalpracticeguidelinesforplantarfasciitismanagement |