Is a score enough? Pitfalls and solutions for AI severity scores

Abstract Severity scores, which often refer to the likelihood or probability of a pathology, are commonly provided by artificial intelligence (AI) tools in radiology. However, little attention has been given to the use of these AI scores, and there is a lack of transparency into how they are generat...

Full description

Saved in:

Bibliographic Details
Main Authors:	Michael H. Bernstein, Marly van Assen, Michael A. Bruno, Elizabeth A. Krupinski, Carlo De Cecco, Grayson L. Baird
Format:	Article
Language:	English
Published:	SpringerOpen 2025-07-01
Series:	European Radiology Experimental
Subjects:	Artificial intelligence Bias Cognition Radiology Reproducibility of results
Online Access:	https://doi.org/10.1186/s41747-025-00603-z
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849238179784163328
author	Michael H. Bernstein Marly van Assen Michael A. Bruno Elizabeth A. Krupinski Carlo De Cecco Grayson L. Baird
author_facet	Michael H. Bernstein Marly van Assen Michael A. Bruno Elizabeth A. Krupinski Carlo De Cecco Grayson L. Baird
author_sort	Michael H. Bernstein
collection	DOAJ
description	Abstract Severity scores, which often refer to the likelihood or probability of a pathology, are commonly provided by artificial intelligence (AI) tools in radiology. However, little attention has been given to the use of these AI scores, and there is a lack of transparency into how they are generated. In this comment, we draw on key principles from psychological science and statistics to elucidate six human factors limitations of AI scores that undermine their utility: (1) variability across AI systems; (2) variability within AI systems; (3) variability between radiologists; (4) variability within radiologists; (5) unknown distribution of AI scores; and (6) perceptual challenges. We hypothesize that these limitations can be mitigated by providing the false discovery rate and false omission rate for each score as a threshold. We discuss how this hypothesis could be empirically tested. Key Points The radiologist-AI interaction has not been given sufficient attention. The utility of AI scores is limited by six key human factors limitations. We propose a hypothesis for how to mitigate these limitations by using false discovery rate and false omission rate. Graphical Abstract
format	Article
id	doaj-art-dea8791f126b4a38b791ef0d87ff481c
institution	Kabale University
issn	2509-9280
language	English
publishDate	2025-07-01
publisher	SpringerOpen
record_format	Article
series	European Radiology Experimental
spelling	doaj-art-dea8791f126b4a38b791ef0d87ff481c2025-08-20T04:01:43ZengSpringerOpenEuropean Radiology Experimental2509-92802025-07-01911510.1186/s41747-025-00603-zIs a score enough? Pitfalls and solutions for AI severity scoresMichael H. Bernstein0Marly van Assen1Michael A. Bruno2Elizabeth A. Krupinski3Carlo De Cecco4Grayson L. Baird5Department of Diagnostic Imaging, Brown Radiology Human Factors Lab, Rhode Island Hospital, Warren Alpert School of Medicine of Brown UniversityDepartment of Radiology and Imaging Sciences, Emory University, School of MedicinePenn State College of Medicine, The Milton S. Hershey Medical Center, Penn State HealthDepartment of Radiology and Imaging Sciences, Emory University, School of MedicineDepartment of Radiology and Imaging Sciences, Emory University, School of MedicineDepartment of Diagnostic Imaging, Brown Radiology Human Factors Lab, Rhode Island Hospital, Warren Alpert School of Medicine of Brown UniversityAbstract Severity scores, which often refer to the likelihood or probability of a pathology, are commonly provided by artificial intelligence (AI) tools in radiology. However, little attention has been given to the use of these AI scores, and there is a lack of transparency into how they are generated. In this comment, we draw on key principles from psychological science and statistics to elucidate six human factors limitations of AI scores that undermine their utility: (1) variability across AI systems; (2) variability within AI systems; (3) variability between radiologists; (4) variability within radiologists; (5) unknown distribution of AI scores; and (6) perceptual challenges. We hypothesize that these limitations can be mitigated by providing the false discovery rate and false omission rate for each score as a threshold. We discuss how this hypothesis could be empirically tested. Key Points The radiologist-AI interaction has not been given sufficient attention. The utility of AI scores is limited by six key human factors limitations. We propose a hypothesis for how to mitigate these limitations by using false discovery rate and false omission rate. Graphical Abstracthttps://doi.org/10.1186/s41747-025-00603-zArtificial intelligenceBiasCognitionRadiologyReproducibility of results
spellingShingle	Michael H. Bernstein Marly van Assen Michael A. Bruno Elizabeth A. Krupinski Carlo De Cecco Grayson L. Baird Is a score enough? Pitfalls and solutions for AI severity scores European Radiology Experimental Artificial intelligence Bias Cognition Radiology Reproducibility of results
title	Is a score enough? Pitfalls and solutions for AI severity scores
title_full	Is a score enough? Pitfalls and solutions for AI severity scores
title_fullStr	Is a score enough? Pitfalls and solutions for AI severity scores
title_full_unstemmed	Is a score enough? Pitfalls and solutions for AI severity scores
title_short	Is a score enough? Pitfalls and solutions for AI severity scores
title_sort	is a score enough pitfalls and solutions for ai severity scores
topic	Artificial intelligence Bias Cognition Radiology Reproducibility of results
url	https://doi.org/10.1186/s41747-025-00603-z
work_keys_str_mv	AT michaelhbernstein isascoreenoughpitfallsandsolutionsforaiseverityscores AT marlyvanassen isascoreenoughpitfallsandsolutionsforaiseverityscores AT michaelabruno isascoreenoughpitfallsandsolutionsforaiseverityscores AT elizabethakrupinski isascoreenoughpitfallsandsolutionsforaiseverityscores AT carlodececco isascoreenoughpitfallsandsolutionsforaiseverityscores AT graysonlbaird isascoreenoughpitfallsandsolutionsforaiseverityscores

Is a score enough? Pitfalls and solutions for AI severity scores

Similar Items