Can AI-Based ChatGPT Models Accurately Analyze Hand–Wrist Radiographs? A Comparative Study

<b>Background/Aims:</b> The aim of this study was to evaluate the effectiveness of large language model (LLM)-based chatbot systems in predicting bone age and identifying growth stages, and to explore their potential as practical, infrastructure-independent alternatives to conventional m...

Full description

Saved in:
Bibliographic Details
Main Authors: Ahmet Yıldırım, Orhan Cicek, Yavuz Selim Genç
Format: Article
Language:English
Published: MDPI AG 2025-06-01
Series:Diagnostics
Subjects:
Online Access:https://www.mdpi.com/2075-4418/15/12/1513
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849472461153763328
author Ahmet Yıldırım
Orhan Cicek
Yavuz Selim Genç
author_facet Ahmet Yıldırım
Orhan Cicek
Yavuz Selim Genç
author_sort Ahmet Yıldırım
collection DOAJ
description <b>Background/Aims:</b> The aim of this study was to evaluate the effectiveness of large language model (LLM)-based chatbot systems in predicting bone age and identifying growth stages, and to explore their potential as practical, infrastructure-independent alternatives to conventional methods and convolutional neural network (CNN)-based deep learning models. <b>Methods</b>: This study evaluated the performance of three ChatGPT-based models (GPT-4o, GPT-o4-mini-high, and GPT-o1-pro) in predicting bone age and growth stage using 90 anonymized hand–wrist radiographs (30 from each growth stage—pre-peak, peak, and post-peak—with equal male and female distribution). Reference standards were ensured by expert orthodontists using Fishman’s Skeletal Maturity Indicators (SMI) system and the Greulich–Pyle Atlas, with each radiograph analyzed by three GPT models using standardized prompts. Model performances were evaluated through statistical analyses assessing agreement and prediction accuracy. <b>Results</b>: All models showed significant agreement with the reference values in bone age prediction (<i>p</i> < 0.001), with GPT-o1-pro having the highest concordance (Pearson r = 0.546). No statistically significant difference was observed in the mean absolute error (MAE) among the models (<i>p</i> > 0.05). The GPT-o4-mini-high model achieved an accuracy rate of 72.2% within a ±2 year deviation range for bone age prediction. The GPT-o1-pro and GPT-o4-mini-high models showed bias in the Bland–Altman analysis of bone age predictions; however, GPT-o1-pro yielded more reliable predictions with narrower limits of agreement. In terms of growth stage classification, the GPT-4o model achieved the highest agreement with the reference values (κ = 0.283, <i>p</i> < 0.001). <b>Conclusions</b>: This study shows that general-purpose GPT models can support bone age and growth stages prediction, with each model having distinct strengths. While GPT models do not replace clinical examination, their contextual reasoning and ability to perform preliminary assessments without domain-specific training make them promising tools, though further development is needed.
format Article
id doaj-art-518b4dfd60ce47bf83405a18854fdf5f
institution Kabale University
issn 2075-4418
language English
publishDate 2025-06-01
publisher MDPI AG
record_format Article
series Diagnostics
spelling doaj-art-518b4dfd60ce47bf83405a18854fdf5f2025-08-20T03:24:32ZengMDPI AGDiagnostics2075-44182025-06-011512151310.3390/diagnostics15121513Can AI-Based ChatGPT Models Accurately Analyze Hand–Wrist Radiographs? A Comparative StudyAhmet Yıldırım0Orhan Cicek1Yavuz Selim Genç2Department of Orthodontics, Faculty of Dentistry, Zonguldak Bulent Ecevit University, Zonguldak 67600, TürkiyeDepartment of Orthodontics, Faculty of Dentistry, Zonguldak Bulent Ecevit University, Zonguldak 67600, TürkiyeSamsun Oral and Dental Health Hospital, Samsun Provincial Health Directorate, Samsun 55060, Türkiye<b>Background/Aims:</b> The aim of this study was to evaluate the effectiveness of large language model (LLM)-based chatbot systems in predicting bone age and identifying growth stages, and to explore their potential as practical, infrastructure-independent alternatives to conventional methods and convolutional neural network (CNN)-based deep learning models. <b>Methods</b>: This study evaluated the performance of three ChatGPT-based models (GPT-4o, GPT-o4-mini-high, and GPT-o1-pro) in predicting bone age and growth stage using 90 anonymized hand–wrist radiographs (30 from each growth stage—pre-peak, peak, and post-peak—with equal male and female distribution). Reference standards were ensured by expert orthodontists using Fishman’s Skeletal Maturity Indicators (SMI) system and the Greulich–Pyle Atlas, with each radiograph analyzed by three GPT models using standardized prompts. Model performances were evaluated through statistical analyses assessing agreement and prediction accuracy. <b>Results</b>: All models showed significant agreement with the reference values in bone age prediction (<i>p</i> < 0.001), with GPT-o1-pro having the highest concordance (Pearson r = 0.546). No statistically significant difference was observed in the mean absolute error (MAE) among the models (<i>p</i> > 0.05). The GPT-o4-mini-high model achieved an accuracy rate of 72.2% within a ±2 year deviation range for bone age prediction. The GPT-o1-pro and GPT-o4-mini-high models showed bias in the Bland–Altman analysis of bone age predictions; however, GPT-o1-pro yielded more reliable predictions with narrower limits of agreement. In terms of growth stage classification, the GPT-4o model achieved the highest agreement with the reference values (κ = 0.283, <i>p</i> < 0.001). <b>Conclusions</b>: This study shows that general-purpose GPT models can support bone age and growth stages prediction, with each model having distinct strengths. While GPT models do not replace clinical examination, their contextual reasoning and ability to perform preliminary assessments without domain-specific training make them promising tools, though further development is needed.https://www.mdpi.com/2075-4418/15/12/1513large language modelsartificial intelligencedeep learningChatGPTbone agegrowth stage
spellingShingle Ahmet Yıldırım
Orhan Cicek
Yavuz Selim Genç
Can AI-Based ChatGPT Models Accurately Analyze Hand–Wrist Radiographs? A Comparative Study
Diagnostics
large language models
artificial intelligence
deep learning
ChatGPT
bone age
growth stage
title Can AI-Based ChatGPT Models Accurately Analyze Hand–Wrist Radiographs? A Comparative Study
title_full Can AI-Based ChatGPT Models Accurately Analyze Hand–Wrist Radiographs? A Comparative Study
title_fullStr Can AI-Based ChatGPT Models Accurately Analyze Hand–Wrist Radiographs? A Comparative Study
title_full_unstemmed Can AI-Based ChatGPT Models Accurately Analyze Hand–Wrist Radiographs? A Comparative Study
title_short Can AI-Based ChatGPT Models Accurately Analyze Hand–Wrist Radiographs? A Comparative Study
title_sort can ai based chatgpt models accurately analyze hand wrist radiographs a comparative study
topic large language models
artificial intelligence
deep learning
ChatGPT
bone age
growth stage
url https://www.mdpi.com/2075-4418/15/12/1513
work_keys_str_mv AT ahmetyıldırım canaibasedchatgptmodelsaccuratelyanalyzehandwristradiographsacomparativestudy
AT orhancicek canaibasedchatgptmodelsaccuratelyanalyzehandwristradiographsacomparativestudy
AT yavuzselimgenc canaibasedchatgptmodelsaccuratelyanalyzehandwristradiographsacomparativestudy