Evaluating the factors influencing accuracy, interpretability, and reproducibility in the use of machine learning classifiers in biology to enable standardization

Abstract The complexity and variability of biological data has promoted the increased use of machine learning methods to understand processes and predict outcomes. These same features complicate reliable, reproducible, interpretable, and responsible use of such methods, resulting in questionable rel...

Full description

Saved in:
Bibliographic Details
Main Authors: Kaitlyn M. Martinez, Kristen Wilding, Trent R. Llewellyn, Daniel E. Jacobsen, Makaela M. Montoya, Jessica Z. Kubicek-Sutherland, Sweta Batni, Carrie Manore, Harshini Mukundan
Format: Article
Language:English
Published: Nature Portfolio 2025-05-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-00245-6
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850132778715185152
author Kaitlyn M. Martinez
Kristen Wilding
Trent R. Llewellyn
Daniel E. Jacobsen
Makaela M. Montoya
Jessica Z. Kubicek-Sutherland
Sweta Batni
Carrie Manore
Harshini Mukundan
author_facet Kaitlyn M. Martinez
Kristen Wilding
Trent R. Llewellyn
Daniel E. Jacobsen
Makaela M. Montoya
Jessica Z. Kubicek-Sutherland
Sweta Batni
Carrie Manore
Harshini Mukundan
author_sort Kaitlyn M. Martinez
collection DOAJ
description Abstract The complexity and variability of biological data has promoted the increased use of machine learning methods to understand processes and predict outcomes. These same features complicate reliable, reproducible, interpretable, and responsible use of such methods, resulting in questionable relevance of the derived. outcomes. Here we systematically explore challenges associated with applying machine learning to predict and understand biological processes using a well- characterized in vitro experimental system. We evaluated factors that vary while applying machine learning classifers: (1) type of biochemical signature (transcripts vs. proteins), (2) data curation methods (pre- and post-processing), and (3) choice of machine learning classifier. Using accuracy, generalizability, interpretability, and reproducibility as metrics, we found that the above factors significantly mod- ulate outcomes even within a simple model system. Our results caution against the unregulated use of machine learning methods in the biological sciences, and strongly advocate the need for data standards and validation tool-kits for such studies.
format Article
id doaj-art-da2ab013b04e4669abe50265fb0d76ae
institution OA Journals
issn 2045-2322
language English
publishDate 2025-05-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-da2ab013b04e4669abe50265fb0d76ae2025-08-20T02:32:07ZengNature PortfolioScientific Reports2045-23222025-05-0115111210.1038/s41598-025-00245-6Evaluating the factors influencing accuracy, interpretability, and reproducibility in the use of machine learning classifiers in biology to enable standardizationKaitlyn M. Martinez0Kristen Wilding1Trent R. Llewellyn2Daniel E. Jacobsen3Makaela M. Montoya4Jessica Z. Kubicek-Sutherland5Sweta Batni6Carrie Manore7Harshini Mukundan8A-1 Information Systems and Modeling, Los Alamos National LaboratoryT-6 Theoretical Biology and Biophysics, Los Alamos National LaboratoryC-PCS Physical Chemistry and Applied Spectroscopy, Los Alamos National LaboratoryC-PCS Physical Chemistry and Applied Spectroscopy, Los Alamos National LaboratoryC-PCS Physical Chemistry and Applied Spectroscopy, Los Alamos National LaboratoryC-PCS Physical Chemistry and Applied Spectroscopy, Los Alamos National LaboratoryDefense Threat Reduction AgencyT-6 Theoretical Biology and Biophysics, Los Alamos National LaboratoryC-PCS Physical Chemistry and Applied Spectroscopy, Los Alamos National LaboratoryAbstract The complexity and variability of biological data has promoted the increased use of machine learning methods to understand processes and predict outcomes. These same features complicate reliable, reproducible, interpretable, and responsible use of such methods, resulting in questionable relevance of the derived. outcomes. Here we systematically explore challenges associated with applying machine learning to predict and understand biological processes using a well- characterized in vitro experimental system. We evaluated factors that vary while applying machine learning classifers: (1) type of biochemical signature (transcripts vs. proteins), (2) data curation methods (pre- and post-processing), and (3) choice of machine learning classifier. Using accuracy, generalizability, interpretability, and reproducibility as metrics, we found that the above factors significantly mod- ulate outcomes even within a simple model system. Our results caution against the unregulated use of machine learning methods in the biological sciences, and strongly advocate the need for data standards and validation tool-kits for such studies.https://doi.org/10.1038/s41598-025-00245-6Machine learningBiological dataStandardizationLipopolysaccharide
spellingShingle Kaitlyn M. Martinez
Kristen Wilding
Trent R. Llewellyn
Daniel E. Jacobsen
Makaela M. Montoya
Jessica Z. Kubicek-Sutherland
Sweta Batni
Carrie Manore
Harshini Mukundan
Evaluating the factors influencing accuracy, interpretability, and reproducibility in the use of machine learning classifiers in biology to enable standardization
Scientific Reports
Machine learning
Biological data
Standardization
Lipopolysaccharide
title Evaluating the factors influencing accuracy, interpretability, and reproducibility in the use of machine learning classifiers in biology to enable standardization
title_full Evaluating the factors influencing accuracy, interpretability, and reproducibility in the use of machine learning classifiers in biology to enable standardization
title_fullStr Evaluating the factors influencing accuracy, interpretability, and reproducibility in the use of machine learning classifiers in biology to enable standardization
title_full_unstemmed Evaluating the factors influencing accuracy, interpretability, and reproducibility in the use of machine learning classifiers in biology to enable standardization
title_short Evaluating the factors influencing accuracy, interpretability, and reproducibility in the use of machine learning classifiers in biology to enable standardization
title_sort evaluating the factors influencing accuracy interpretability and reproducibility in the use of machine learning classifiers in biology to enable standardization
topic Machine learning
Biological data
Standardization
Lipopolysaccharide
url https://doi.org/10.1038/s41598-025-00245-6
work_keys_str_mv AT kaitlynmmartinez evaluatingthefactorsinfluencingaccuracyinterpretabilityandreproducibilityintheuseofmachinelearningclassifiersinbiologytoenablestandardization
AT kristenwilding evaluatingthefactorsinfluencingaccuracyinterpretabilityandreproducibilityintheuseofmachinelearningclassifiersinbiologytoenablestandardization
AT trentrllewellyn evaluatingthefactorsinfluencingaccuracyinterpretabilityandreproducibilityintheuseofmachinelearningclassifiersinbiologytoenablestandardization
AT danielejacobsen evaluatingthefactorsinfluencingaccuracyinterpretabilityandreproducibilityintheuseofmachinelearningclassifiersinbiologytoenablestandardization
AT makaelammontoya evaluatingthefactorsinfluencingaccuracyinterpretabilityandreproducibilityintheuseofmachinelearningclassifiersinbiologytoenablestandardization
AT jessicazkubiceksutherland evaluatingthefactorsinfluencingaccuracyinterpretabilityandreproducibilityintheuseofmachinelearningclassifiersinbiologytoenablestandardization
AT swetabatni evaluatingthefactorsinfluencingaccuracyinterpretabilityandreproducibilityintheuseofmachinelearningclassifiersinbiologytoenablestandardization
AT carriemanore evaluatingthefactorsinfluencingaccuracyinterpretabilityandreproducibilityintheuseofmachinelearningclassifiersinbiologytoenablestandardization
AT harshinimukundan evaluatingthefactorsinfluencingaccuracyinterpretabilityandreproducibilityintheuseofmachinelearningclassifiersinbiologytoenablestandardization