Enabling interpretable machine learning for biological data with reliability scores.

Machine learning tools have proven useful across biological disciplines, allowing researchers to draw conclusions from large datasets, and opening up new opportunities for interpreting complex and heterogeneous biological data. Alongside the rapid growth of machine learning, there have also been gro...

Full description

Saved in:
Bibliographic Details
Main Authors: K D Ahlquist, Lauren A Sugden, Sohini Ramachandran
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2023-05-01
Series:PLoS Computational Biology
Online Access:https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1011175&type=printable
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849712160838516736
author K D Ahlquist
Lauren A Sugden
Sohini Ramachandran
author_facet K D Ahlquist
Lauren A Sugden
Sohini Ramachandran
author_sort K D Ahlquist
collection DOAJ
description Machine learning tools have proven useful across biological disciplines, allowing researchers to draw conclusions from large datasets, and opening up new opportunities for interpreting complex and heterogeneous biological data. Alongside the rapid growth of machine learning, there have also been growing pains: some models that appear to perform well have later been revealed to rely on features of the data that are artifactual or biased; this feeds into the general criticism that machine learning models are designed to optimize model performance over the creation of new biological insights. A natural question arises: how do we develop machine learning models that are inherently interpretable or explainable? In this manuscript, we describe the SWIF(r) reliability score (SRS), a method building on the SWIF(r) generative framework that reflects the trustworthiness of the classification of a specific instance. The concept of the reliability score has the potential to generalize to other machine learning methods. We demonstrate the utility of the SRS when faced with common challenges in machine learning including: 1) an unknown class present in testing data that was not present in training data, 2) systemic mismatch between training and testing data, and 3) instances of testing data that have missing values for some attributes. We explore these applications of the SRS using a range of biological datasets, from agricultural data on seed morphology, to 22 quantitative traits in the UK Biobank, and population genetic simulations and 1000 Genomes Project data. With each of these examples, we demonstrate how the SRS can allow researchers to interrogate their data and training approach thoroughly, and to pair their domain-specific knowledge with powerful machine-learning frameworks. We also compare the SRS to related tools for outlier and novelty detection, and find that it has comparable performance, with the advantage of being able to operate when some data are missing. The SRS, and the broader discussion of interpretable scientific machine learning, will aid researchers in the biological machine learning space as they seek to harness the power of machine learning without sacrificing rigor and biological insight.
format Article
id doaj-art-e64e8ab025c045458a7b1f0cbf59cd09
institution DOAJ
issn 1553-734X
1553-7358
language English
publishDate 2023-05-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS Computational Biology
spelling doaj-art-e64e8ab025c045458a7b1f0cbf59cd092025-08-20T03:14:22ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582023-05-01195e101117510.1371/journal.pcbi.1011175Enabling interpretable machine learning for biological data with reliability scores.K D AhlquistLauren A SugdenSohini RamachandranMachine learning tools have proven useful across biological disciplines, allowing researchers to draw conclusions from large datasets, and opening up new opportunities for interpreting complex and heterogeneous biological data. Alongside the rapid growth of machine learning, there have also been growing pains: some models that appear to perform well have later been revealed to rely on features of the data that are artifactual or biased; this feeds into the general criticism that machine learning models are designed to optimize model performance over the creation of new biological insights. A natural question arises: how do we develop machine learning models that are inherently interpretable or explainable? In this manuscript, we describe the SWIF(r) reliability score (SRS), a method building on the SWIF(r) generative framework that reflects the trustworthiness of the classification of a specific instance. The concept of the reliability score has the potential to generalize to other machine learning methods. We demonstrate the utility of the SRS when faced with common challenges in machine learning including: 1) an unknown class present in testing data that was not present in training data, 2) systemic mismatch between training and testing data, and 3) instances of testing data that have missing values for some attributes. We explore these applications of the SRS using a range of biological datasets, from agricultural data on seed morphology, to 22 quantitative traits in the UK Biobank, and population genetic simulations and 1000 Genomes Project data. With each of these examples, we demonstrate how the SRS can allow researchers to interrogate their data and training approach thoroughly, and to pair their domain-specific knowledge with powerful machine-learning frameworks. We also compare the SRS to related tools for outlier and novelty detection, and find that it has comparable performance, with the advantage of being able to operate when some data are missing. The SRS, and the broader discussion of interpretable scientific machine learning, will aid researchers in the biological machine learning space as they seek to harness the power of machine learning without sacrificing rigor and biological insight.https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1011175&type=printable
spellingShingle K D Ahlquist
Lauren A Sugden
Sohini Ramachandran
Enabling interpretable machine learning for biological data with reliability scores.
PLoS Computational Biology
title Enabling interpretable machine learning for biological data with reliability scores.
title_full Enabling interpretable machine learning for biological data with reliability scores.
title_fullStr Enabling interpretable machine learning for biological data with reliability scores.
title_full_unstemmed Enabling interpretable machine learning for biological data with reliability scores.
title_short Enabling interpretable machine learning for biological data with reliability scores.
title_sort enabling interpretable machine learning for biological data with reliability scores
url https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1011175&type=printable
work_keys_str_mv AT kdahlquist enablinginterpretablemachinelearningforbiologicaldatawithreliabilityscores
AT laurenasugden enablinginterpretablemachinelearningforbiologicaldatawithreliabilityscores
AT sohiniramachandran enablinginterpretablemachinelearningforbiologicaldatawithreliabilityscores