Autonomous medical evaluation for guideline adherence of large language models
Abstract Autonomous Medical Evaluation for Guideline Adherence (AMEGA) is a comprehensive benchmark designed to evaluate large language models’ adherence to medical guidelines across 20 diagnostic scenarios spanning 13 specialties. It includes an evaluation framework and methodology to assess models...
Saved in:
| Main Authors: | , , , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2024-12-01
|
| Series: | npj Digital Medicine |
| Online Access: | https://doi.org/10.1038/s41746-024-01356-6 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850118513275961344 |
|---|---|
| author | Dennis Fast Lisa C. Adams Felix Busch Conor Fallon Marc Huppertz Robert Siepmann Philipp Prucker Nadine Bayerl Daniel Truhn Marcus Makowski Alexander Löser Keno K. Bressem |
| author_facet | Dennis Fast Lisa C. Adams Felix Busch Conor Fallon Marc Huppertz Robert Siepmann Philipp Prucker Nadine Bayerl Daniel Truhn Marcus Makowski Alexander Löser Keno K. Bressem |
| author_sort | Dennis Fast |
| collection | DOAJ |
| description | Abstract Autonomous Medical Evaluation for Guideline Adherence (AMEGA) is a comprehensive benchmark designed to evaluate large language models’ adherence to medical guidelines across 20 diagnostic scenarios spanning 13 specialties. It includes an evaluation framework and methodology to assess models’ capabilities in medical reasoning, differential diagnosis, treatment planning, and guideline adherence, using open-ended questions that mirror real-world clinical interactions. It includes 135 questions and 1337 weighted scoring elements designed to assess comprehensive medical knowledge. In tests of 17 LLMs, GPT-4 scored highest with 41.9/50, followed closely by Llama-3 70B and WizardLM-2-8x22B. For comparison, a recent medical graduate scored 25.8/50. The benchmark introduces novel content to avoid the issue of LLMs memorizing existing medical data. AMEGA’s publicly available code supports further research in AI-assisted clinical decision-making, aiming to enhance patient care by aiding clinicians in diagnosis and treatment under time constraints. |
| format | Article |
| id | doaj-art-c8280213f8ff4f67a33a9ced4675b321 |
| institution | OA Journals |
| issn | 2398-6352 |
| language | English |
| publishDate | 2024-12-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | npj Digital Medicine |
| spelling | doaj-art-c8280213f8ff4f67a33a9ced4675b3212025-08-20T02:35:51ZengNature Portfolionpj Digital Medicine2398-63522024-12-017111410.1038/s41746-024-01356-6Autonomous medical evaluation for guideline adherence of large language modelsDennis Fast0Lisa C. Adams1Felix Busch2Conor Fallon3Marc Huppertz4Robert Siepmann5Philipp Prucker6Nadine Bayerl7Daniel Truhn8Marcus Makowski9Alexander Löser10Keno K. Bressem11DATEXIS, Berliner Hochschule für Technik (BHT)Department of Diagnostic and Interventional Radiology, Technical University of Munich, School of Medicine and Health, Klinikum rechts der Isar, TUM University HospitalDepartment of Diagnostic and Interventional Radiology, Technical University of Munich, School of Medicine and Health, Klinikum rechts der Isar, TUM University HospitalDATEXIS, Berliner Hochschule für Technik (BHT)Department of Radiology, University Hospital AachenDepartment of Radiology, University Hospital AachenDepartment of Diagnostic and Interventional Radiology, Technical University of Munich, School of Medicine and Health, Klinikum rechts der Isar, TUM University HospitalDepartment of Radiology, Department of Radiology, University Hospital Erlangen, Friedrich- Alexander-University (FAU) Erlangen-NurembergDepartment of Radiology, University Hospital AachenDepartment of Diagnostic and Interventional Radiology, Technical University of Munich, School of Medicine and Health, Klinikum rechts der Isar, TUM University HospitalDATEXIS, Berliner Hochschule für Technik (BHT)Department of Diagnostic and Interventional Radiology, Technical University of Munich, School of Medicine and Health, Klinikum rechts der Isar, TUM University HospitalAbstract Autonomous Medical Evaluation for Guideline Adherence (AMEGA) is a comprehensive benchmark designed to evaluate large language models’ adherence to medical guidelines across 20 diagnostic scenarios spanning 13 specialties. It includes an evaluation framework and methodology to assess models’ capabilities in medical reasoning, differential diagnosis, treatment planning, and guideline adherence, using open-ended questions that mirror real-world clinical interactions. It includes 135 questions and 1337 weighted scoring elements designed to assess comprehensive medical knowledge. In tests of 17 LLMs, GPT-4 scored highest with 41.9/50, followed closely by Llama-3 70B and WizardLM-2-8x22B. For comparison, a recent medical graduate scored 25.8/50. The benchmark introduces novel content to avoid the issue of LLMs memorizing existing medical data. AMEGA’s publicly available code supports further research in AI-assisted clinical decision-making, aiming to enhance patient care by aiding clinicians in diagnosis and treatment under time constraints.https://doi.org/10.1038/s41746-024-01356-6 |
| spellingShingle | Dennis Fast Lisa C. Adams Felix Busch Conor Fallon Marc Huppertz Robert Siepmann Philipp Prucker Nadine Bayerl Daniel Truhn Marcus Makowski Alexander Löser Keno K. Bressem Autonomous medical evaluation for guideline adherence of large language models npj Digital Medicine |
| title | Autonomous medical evaluation for guideline adherence of large language models |
| title_full | Autonomous medical evaluation for guideline adherence of large language models |
| title_fullStr | Autonomous medical evaluation for guideline adherence of large language models |
| title_full_unstemmed | Autonomous medical evaluation for guideline adherence of large language models |
| title_short | Autonomous medical evaluation for guideline adherence of large language models |
| title_sort | autonomous medical evaluation for guideline adherence of large language models |
| url | https://doi.org/10.1038/s41746-024-01356-6 |
| work_keys_str_mv | AT dennisfast autonomousmedicalevaluationforguidelineadherenceoflargelanguagemodels AT lisacadams autonomousmedicalevaluationforguidelineadherenceoflargelanguagemodels AT felixbusch autonomousmedicalevaluationforguidelineadherenceoflargelanguagemodels AT conorfallon autonomousmedicalevaluationforguidelineadherenceoflargelanguagemodels AT marchuppertz autonomousmedicalevaluationforguidelineadherenceoflargelanguagemodels AT robertsiepmann autonomousmedicalevaluationforguidelineadherenceoflargelanguagemodels AT philippprucker autonomousmedicalevaluationforguidelineadherenceoflargelanguagemodels AT nadinebayerl autonomousmedicalevaluationforguidelineadherenceoflargelanguagemodels AT danieltruhn autonomousmedicalevaluationforguidelineadherenceoflargelanguagemodels AT marcusmakowski autonomousmedicalevaluationforguidelineadherenceoflargelanguagemodels AT alexanderloser autonomousmedicalevaluationforguidelineadherenceoflargelanguagemodels AT kenokbressem autonomousmedicalevaluationforguidelineadherenceoflargelanguagemodels |