Accuracy of LLMs in medical education: evidence from a concordance test with medical teacher
Abstract Background There is an unprecedented increase in the use of Generative AI in medical education. There is a need to assess these models’ accuracy to ensure patient safety. This study assesses the accuracy of ChatGPT, Gemini, and Copilot in answering multiple-choice questions (MCQs) compared...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
BMC
2025-03-01
|
| Series: | BMC Medical Education |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s12909-025-07009-w |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract Background There is an unprecedented increase in the use of Generative AI in medical education. There is a need to assess these models’ accuracy to ensure patient safety. This study assesses the accuracy of ChatGPT, Gemini, and Copilot in answering multiple-choice questions (MCQs) compared to a qualified medical teacher. Methods This study randomly selected 40 Multiple Choice Questions (MCQs) from past United States Medical Licensing Examination (USMLE) and asked for answers to three LLMs: ChatGPT, Gemini, and Copilot. The results of an LLM are then compared with those of a qualified medical teacher and with responses from other LLMs. The Fleiss’ Kappa Test was used to determine the concordance between four responders (3 LLMs + 1 Medical Teacher). In case of poor agreement between responders, Cohen’s Kappa test was performed to assess the agreement between responders. Results ChatGPT demonstrated the highest accuracy (70%, Cohen’s Kappa = 0.84), followed by Copilot (60%, Cohen’s Kappa = 0.69), while Gemini showed the lowest accuracy (50%, Cohen’s Kappa = 0.53). The Fleiss’ Kappa value of -0.056 indicated significant disagreement among all four responders. Conclusion The study provides an approach for assessing the accuracy of different LLMs. The study concludes that ChatGPT is far superior (70%) to other LLMs when asked medical questions across different specialties, while contrary to expectations, Gemini (50%) performed poorly. When compared with medical teachers, the low accuracy of LLMs suggests that general-purpose LLMs should be used with caution in medical education. |
|---|---|
| ISSN: | 1472-6920 |