ChatGPT performance in answering medical residency questions in nephrology: a pilot study in Brazil

Abstract Objective: This study evaluated the performance of ChatGPT 4 and 3.5 versions in answering nephrology questions from medical residency exams in Brazil. Methods: A total of 411 multiple-choice questions, with and without images, were analyzed, organized into four main themes: chronic kidne...

Full description

Saved in:
Bibliographic Details
Main Authors: Helvécio Neves Feitosa Filho, João Filipe Cavalcante Uchoa Furtado, Eduardo Correia Eulálio, Pedro Vianna Caldas Ribeiro, Lucas Macêdo Aurélio Paiva, Matheus Maia Gonçalves Bringel Correia, Geraldo Bezerra da Silva Júnior
Format: Article
Language:English
Published: Sociedade Brasileira de Nefrologia 2025-07-01
Series:Brazilian Journal of Nephrology
Subjects:
Online Access:http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0101-28002025000400302&lng=en&tlng=en
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Objective: This study evaluated the performance of ChatGPT 4 and 3.5 versions in answering nephrology questions from medical residency exams in Brazil. Methods: A total of 411 multiple-choice questions, with and without images, were analyzed, organized into four main themes: chronic kidney disease (CKD), hydroelectrolytic and acid-base disorders (HABD), tubulointerstitial diseases (TID), and glomerular diseases (GD). Questions with images were answered only by ChatGPT-4. Statistical analysis was performed using the chi-square test. Results: ChatGPT-4 achieved an overall accuracy of 79.80%, while ChatGPT-3.5 achieved 56.29%, with a statistically significant difference (p < 0.001). In the main themes, ChatGPT-4 performed better in HABD (79.11% vs. 55.17%), TID (88.23% vs. 52.23%), CKD (75.51% vs. 61.95%), and DG (79.31% vs. 55.29%), all with p < 0.001. ChatGPT-4 presented an accuracy of 81.49% in questions without images and 54.54% in questions with images, with an accuracy of 60% for electrocardiogram analysis. This study is limited by the small number of image-based questions and the use of outdated examination items, reducing its ability to assess visual diagnostic skills and current clinical relevance. Furthermore, addressing only 4 areas of Nephrology may not fully represent the breadth of nephrology practice. Conclusion: ChatGPT-3.5 was found to have limitations in nephrology reasoning compared to ChatGPT-4, evidencing gaps in knowledge. The study suggests that further exploration is needed in other nephrology themes to improve the use of these AI tools.
ISSN:2175-8239