Beyond Benchmarks: Evaluating Generalist Medical Artificial Intelligence With Psychometrics

AbstractRigorous evaluation of generalist medical artificial intelligence (GMAI) is imperative to ensure their utility and safety before implementation in health care. Current evaluation strategies rely heavily on benchmarks, which can suffer from issues with data contamination and cannot...

Full description

Saved in:
Bibliographic Details
Main Authors: Luning Sun, Christopher Gibbons, José Hernández-Orallo, Xiting Wang, Liming Jiang, David Stillwell, Fang Luo, Xing Xie
Format: Article
Language:English
Published: JMIR Publications 2025-05-01
Series:Journal of Medical Internet Research
Online Access:https://www.jmir.org/2025/1/e70901
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849683200555614208
author Luning Sun
Christopher Gibbons
José Hernández-Orallo
Xiting Wang
Liming Jiang
David Stillwell
Fang Luo
Xing Xie
author_facet Luning Sun
Christopher Gibbons
José Hernández-Orallo
Xiting Wang
Liming Jiang
David Stillwell
Fang Luo
Xing Xie
author_sort Luning Sun
collection DOAJ
description AbstractRigorous evaluation of generalist medical artificial intelligence (GMAI) is imperative to ensure their utility and safety before implementation in health care. Current evaluation strategies rely heavily on benchmarks, which can suffer from issues with data contamination and cannot explain how GMAI might fail (lacking explanatory power) or in what circumstances (lacking predictive power). To address these limitations, we propose a new methodology to improve the quality of GMAI evaluation using construct-oriented processes. Drawing on modern psychometric techniques, we introduce approaches to construct identification and present alternative assessment formats for different domains of professional skills, knowledge, and behaviors that are essential for safe practice. We also discuss the need for human oversight in future GMAI adoption.
format Article
id doaj-art-bf5d76370de843bc80cb9272ba5b1341
institution DOAJ
issn 1438-8871
language English
publishDate 2025-05-01
publisher JMIR Publications
record_format Article
series Journal of Medical Internet Research
spelling doaj-art-bf5d76370de843bc80cb9272ba5b13412025-08-20T03:23:59ZengJMIR PublicationsJournal of Medical Internet Research1438-88712025-05-0127e70901e7090110.2196/70901Beyond Benchmarks: Evaluating Generalist Medical Artificial Intelligence With PsychometricsLuning Sunhttp://orcid.org/0000-0002-2470-4278Christopher Gibbonshttp://orcid.org/0000-0002-4732-7305José Hernández-Orallohttp://orcid.org/0000-0001-9746-7632Xiting Wanghttp://orcid.org/0000-0001-5768-1095Liming Jianghttp://orcid.org/0000-0001-6464-2326David Stillwellhttp://orcid.org/0000-0003-0174-3212Fang Luohttp://orcid.org/0000-0003-3281-9574Xing Xiehttp://orcid.org/0009-0009-3257-3077 AbstractRigorous evaluation of generalist medical artificial intelligence (GMAI) is imperative to ensure their utility and safety before implementation in health care. Current evaluation strategies rely heavily on benchmarks, which can suffer from issues with data contamination and cannot explain how GMAI might fail (lacking explanatory power) or in what circumstances (lacking predictive power). To address these limitations, we propose a new methodology to improve the quality of GMAI evaluation using construct-oriented processes. Drawing on modern psychometric techniques, we introduce approaches to construct identification and present alternative assessment formats for different domains of professional skills, knowledge, and behaviors that are essential for safe practice. We also discuss the need for human oversight in future GMAI adoption.https://www.jmir.org/2025/1/e70901
spellingShingle Luning Sun
Christopher Gibbons
José Hernández-Orallo
Xiting Wang
Liming Jiang
David Stillwell
Fang Luo
Xing Xie
Beyond Benchmarks: Evaluating Generalist Medical Artificial Intelligence With Psychometrics
Journal of Medical Internet Research
title Beyond Benchmarks: Evaluating Generalist Medical Artificial Intelligence With Psychometrics
title_full Beyond Benchmarks: Evaluating Generalist Medical Artificial Intelligence With Psychometrics
title_fullStr Beyond Benchmarks: Evaluating Generalist Medical Artificial Intelligence With Psychometrics
title_full_unstemmed Beyond Benchmarks: Evaluating Generalist Medical Artificial Intelligence With Psychometrics
title_short Beyond Benchmarks: Evaluating Generalist Medical Artificial Intelligence With Psychometrics
title_sort beyond benchmarks evaluating generalist medical artificial intelligence with psychometrics
url https://www.jmir.org/2025/1/e70901
work_keys_str_mv AT luningsun beyondbenchmarksevaluatinggeneralistmedicalartificialintelligencewithpsychometrics
AT christophergibbons beyondbenchmarksevaluatinggeneralistmedicalartificialintelligencewithpsychometrics
AT josehernandezorallo beyondbenchmarksevaluatinggeneralistmedicalartificialintelligencewithpsychometrics
AT xitingwang beyondbenchmarksevaluatinggeneralistmedicalartificialintelligencewithpsychometrics
AT limingjiang beyondbenchmarksevaluatinggeneralistmedicalartificialintelligencewithpsychometrics
AT davidstillwell beyondbenchmarksevaluatinggeneralistmedicalartificialintelligencewithpsychometrics
AT fangluo beyondbenchmarksevaluatinggeneralistmedicalartificialintelligencewithpsychometrics
AT xingxie beyondbenchmarksevaluatinggeneralistmedicalartificialintelligencewithpsychometrics