Beyond Benchmarks: Evaluating Generalist Medical Artificial Intelligence With Psychometrics

AbstractRigorous evaluation of generalist medical artificial intelligence (GMAI) is imperative to ensure their utility and safety before implementation in health care. Current evaluation strategies rely heavily on benchmarks, which can suffer from issues with data contamination and cannot...

Full description

Saved in:

Bibliographic Details
Main Authors:	Luning Sun, Christopher Gibbons, José Hernández-Orallo, Xiting Wang, Liming Jiang, David Stillwell, Fang Luo, Xing Xie
Format:	Article
Language:	English
Published:	JMIR Publications 2025-05-01
Series:	Journal of Medical Internet Research
Online Access:	https://www.jmir.org/2025/1/e70901
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849683200555614208
author	Luning Sun Christopher Gibbons José Hernández-Orallo Xiting Wang Liming Jiang David Stillwell Fang Luo Xing Xie
author_facet	Luning Sun Christopher Gibbons José Hernández-Orallo Xiting Wang Liming Jiang David Stillwell Fang Luo Xing Xie
author_sort	Luning Sun
collection	DOAJ
description	AbstractRigorous evaluation of generalist medical artificial intelligence (GMAI) is imperative to ensure their utility and safety before implementation in health care. Current evaluation strategies rely heavily on benchmarks, which can suffer from issues with data contamination and cannot explain how GMAI might fail (lacking explanatory power) or in what circumstances (lacking predictive power). To address these limitations, we propose a new methodology to improve the quality of GMAI evaluation using construct-oriented processes. Drawing on modern psychometric techniques, we introduce approaches to construct identification and present alternative assessment formats for different domains of professional skills, knowledge, and behaviors that are essential for safe practice. We also discuss the need for human oversight in future GMAI adoption.
format	Article
id	doaj-art-bf5d76370de843bc80cb9272ba5b1341
institution	DOAJ
issn	1438-8871
language	English
publishDate	2025-05-01
publisher	JMIR Publications
record_format	Article
series	Journal of Medical Internet Research
spelling	doaj-art-bf5d76370de843bc80cb9272ba5b13412025-08-20T03:23:59ZengJMIR PublicationsJournal of Medical Internet Research1438-88712025-05-0127e70901e7090110.2196/70901Beyond Benchmarks: Evaluating Generalist Medical Artificial Intelligence With PsychometricsLuning Sunhttp://orcid.org/0000-0002-2470-4278Christopher Gibbonshttp://orcid.org/0000-0002-4732-7305José Hernández-Orallohttp://orcid.org/0000-0001-9746-7632Xiting Wanghttp://orcid.org/0000-0001-5768-1095Liming Jianghttp://orcid.org/0000-0001-6464-2326David Stillwellhttp://orcid.org/0000-0003-0174-3212Fang Luohttp://orcid.org/0000-0003-3281-9574Xing Xiehttp://orcid.org/0009-0009-3257-3077 AbstractRigorous evaluation of generalist medical artificial intelligence (GMAI) is imperative to ensure their utility and safety before implementation in health care. Current evaluation strategies rely heavily on benchmarks, which can suffer from issues with data contamination and cannot explain how GMAI might fail (lacking explanatory power) or in what circumstances (lacking predictive power). To address these limitations, we propose a new methodology to improve the quality of GMAI evaluation using construct-oriented processes. Drawing on modern psychometric techniques, we introduce approaches to construct identification and present alternative assessment formats for different domains of professional skills, knowledge, and behaviors that are essential for safe practice. We also discuss the need for human oversight in future GMAI adoption.https://www.jmir.org/2025/1/e70901
spellingShingle	Luning Sun Christopher Gibbons José Hernández-Orallo Xiting Wang Liming Jiang David Stillwell Fang Luo Xing Xie Beyond Benchmarks: Evaluating Generalist Medical Artificial Intelligence With Psychometrics Journal of Medical Internet Research
title	Beyond Benchmarks: Evaluating Generalist Medical Artificial Intelligence With Psychometrics
title_full	Beyond Benchmarks: Evaluating Generalist Medical Artificial Intelligence With Psychometrics
title_fullStr	Beyond Benchmarks: Evaluating Generalist Medical Artificial Intelligence With Psychometrics
title_full_unstemmed	Beyond Benchmarks: Evaluating Generalist Medical Artificial Intelligence With Psychometrics
title_short	Beyond Benchmarks: Evaluating Generalist Medical Artificial Intelligence With Psychometrics
title_sort	beyond benchmarks evaluating generalist medical artificial intelligence with psychometrics
url	https://www.jmir.org/2025/1/e70901
work_keys_str_mv	AT luningsun beyondbenchmarksevaluatinggeneralistmedicalartificialintelligencewithpsychometrics AT christophergibbons beyondbenchmarksevaluatinggeneralistmedicalartificialintelligencewithpsychometrics AT josehernandezorallo beyondbenchmarksevaluatinggeneralistmedicalartificialintelligencewithpsychometrics AT xitingwang beyondbenchmarksevaluatinggeneralistmedicalartificialintelligencewithpsychometrics AT limingjiang beyondbenchmarksevaluatinggeneralistmedicalartificialintelligencewithpsychometrics AT davidstillwell beyondbenchmarksevaluatinggeneralistmedicalartificialintelligencewithpsychometrics AT fangluo beyondbenchmarksevaluatinggeneralistmedicalartificialintelligencewithpsychometrics AT xingxie beyondbenchmarksevaluatinggeneralistmedicalartificialintelligencewithpsychometrics

Beyond Benchmarks: Evaluating Generalist Medical Artificial Intelligence With Psychometrics

Similar Items