MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data

Abstract MLinvitroTox is an automated Python pipeline developed for high-throughput hazard-driven prioritization of toxicologically relevant signals detected in complex environmental samples through high-resolution tandem mass spectrometry (HRMS/MS). MLinvitroTox is a machine learning (ML) framework...

Full description

Saved in:
Bibliographic Details
Main Authors: Katarzyna Arturi, Eliza J. Harris, Lilian Gasser, Beate I. Escher, Georg Braun, Robin Bosshard, Juliane Hollender
Format: Article
Language:English
Published: BMC 2025-01-01
Series:Journal of Cheminformatics
Subjects:
Online Access:https://doi.org/10.1186/s13321-025-00950-4
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832571369357836288
author Katarzyna Arturi
Eliza J. Harris
Lilian Gasser
Beate I. Escher
Georg Braun
Robin Bosshard
Juliane Hollender
author_facet Katarzyna Arturi
Eliza J. Harris
Lilian Gasser
Beate I. Escher
Georg Braun
Robin Bosshard
Juliane Hollender
author_sort Katarzyna Arturi
collection DOAJ
description Abstract MLinvitroTox is an automated Python pipeline developed for high-throughput hazard-driven prioritization of toxicologically relevant signals detected in complex environmental samples through high-resolution tandem mass spectrometry (HRMS/MS). MLinvitroTox is a machine learning (ML) framework comprising 490 independent XGBoost classifiers trained on molecular fingerprints from chemical structures and target-specific endpoints from the ToxCast/Tox21 invitroDBv4.1 database. For each analyzed HRMS feature, MLinvitroTox generates a 490-bit bioactivity fingerprint used as a basis for prioritization, focusing the time-consuming molecular identification efforts on features most likely to cause adverse effects. The practical advantages of MLinvitroTox are demonstrated for groundwater HRMS data. Among the 874 features for which molecular fingerprints were derived from spectra, including 630 nontargets, 185 spectral matches, and 59 targets, around 4% of the feature/endpoint relationship pairs were predicted to be active. Cross-checking the predictions for targets and spectral matches with invitroDB data confirmed the bioactivity of 120 active and 6791 nonactive pairs while mislabeling 88 active and 56 non-active relationships. By filtering according to bioactivity probability, endpoint scores, and similarity to the training data, the number of potentially toxic features was reduced by at least one order of magnitude. This refinement makes the analytical confirmation of the toxicologically most relevant features feasible, offering significant benefits for cost-efficient chemical risk assessment. Scientific Contribution: In contrast to the classical ML-based approaches for toxicity prediction, MLinvitroTox predicts bioactivity for HRMS features (i.e., distinct m/z signals) based on MS2 fragmentation spectra rather than the chemical structures from the identified features. While the original proof of concept study was accompanied by the release of a MLinvitroTox v1 KNIME workflow, in this study, we release a Python MLinvitroTox v2 package, which, in addition to automation, expands functionality to include predicting toxicity from structures, cleaning up and generating chemical fingerprints, customizing models, and retraining on custom data. Furthermore, as a result of improvements in bioactivity data processing, realized in the concurrently released pytcpl Python package for the custom processing of invitroDBv4.1 input data used for training MLinvitroTox, the current release introduces enhancements in model accuracy, coverage of biological mechanistic targets, and overall interpretability.
format Article
id doaj-art-4a26c64768bf4aa297152bed52f5d19a
institution Kabale University
issn 1758-2946
language English
publishDate 2025-01-01
publisher BMC
record_format Article
series Journal of Cheminformatics
spelling doaj-art-4a26c64768bf4aa297152bed52f5d19a2025-02-02T12:40:16ZengBMCJournal of Cheminformatics1758-29462025-01-0117112010.1186/s13321-025-00950-4MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry dataKatarzyna Arturi0Eliza J. Harris1Lilian Gasser2Beate I. Escher3Georg Braun4Robin Bosshard5Juliane Hollender6Department of Environmental Chemistry, Swiss Federal Institute of Aquatic Science and Technology (Eawag)Swiss Data Science Center (SDSC)Swiss Data Science Center (SDSC)Cell Toxicology, Helmholtz Centre for Environmental Research (UFZ)Cell Toxicology, Helmholtz Centre for Environmental Research (UFZ)Department of Computer Science, Eidgenössische Technische Hochschule Zürich (ETH Zürich)Department of Environmental Chemistry, Swiss Federal Institute of Aquatic Science and Technology (Eawag)Abstract MLinvitroTox is an automated Python pipeline developed for high-throughput hazard-driven prioritization of toxicologically relevant signals detected in complex environmental samples through high-resolution tandem mass spectrometry (HRMS/MS). MLinvitroTox is a machine learning (ML) framework comprising 490 independent XGBoost classifiers trained on molecular fingerprints from chemical structures and target-specific endpoints from the ToxCast/Tox21 invitroDBv4.1 database. For each analyzed HRMS feature, MLinvitroTox generates a 490-bit bioactivity fingerprint used as a basis for prioritization, focusing the time-consuming molecular identification efforts on features most likely to cause adverse effects. The practical advantages of MLinvitroTox are demonstrated for groundwater HRMS data. Among the 874 features for which molecular fingerprints were derived from spectra, including 630 nontargets, 185 spectral matches, and 59 targets, around 4% of the feature/endpoint relationship pairs were predicted to be active. Cross-checking the predictions for targets and spectral matches with invitroDB data confirmed the bioactivity of 120 active and 6791 nonactive pairs while mislabeling 88 active and 56 non-active relationships. By filtering according to bioactivity probability, endpoint scores, and similarity to the training data, the number of potentially toxic features was reduced by at least one order of magnitude. This refinement makes the analytical confirmation of the toxicologically most relevant features feasible, offering significant benefits for cost-efficient chemical risk assessment. Scientific Contribution: In contrast to the classical ML-based approaches for toxicity prediction, MLinvitroTox predicts bioactivity for HRMS features (i.e., distinct m/z signals) based on MS2 fragmentation spectra rather than the chemical structures from the identified features. While the original proof of concept study was accompanied by the release of a MLinvitroTox v1 KNIME workflow, in this study, we release a Python MLinvitroTox v2 package, which, in addition to automation, expands functionality to include predicting toxicity from structures, cleaning up and generating chemical fingerprints, customizing models, and retraining on custom data. Furthermore, as a result of improvements in bioactivity data processing, realized in the concurrently released pytcpl Python package for the custom processing of invitroDBv4.1 input data used for training MLinvitroTox, the current release introduces enhancements in model accuracy, coverage of biological mechanistic targets, and overall interpretability.https://doi.org/10.1186/s13321-025-00950-4ToxCastTox21ToxicityIn vitro assayActivity predictionHRMS/MS
spellingShingle Katarzyna Arturi
Eliza J. Harris
Lilian Gasser
Beate I. Escher
Georg Braun
Robin Bosshard
Juliane Hollender
MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data
Journal of Cheminformatics
ToxCast
Tox21
Toxicity
In vitro assay
Activity prediction
HRMS/MS
title MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data
title_full MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data
title_fullStr MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data
title_full_unstemmed MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data
title_short MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data
title_sort mlinvitrotox reloaded for high throughput hazard based prioritization of high resolution mass spectrometry data
topic ToxCast
Tox21
Toxicity
In vitro assay
Activity prediction
HRMS/MS
url https://doi.org/10.1186/s13321-025-00950-4
work_keys_str_mv AT katarzynaarturi mlinvitrotoxreloadedforhighthroughputhazardbasedprioritizationofhighresolutionmassspectrometrydata
AT elizajharris mlinvitrotoxreloadedforhighthroughputhazardbasedprioritizationofhighresolutionmassspectrometrydata
AT liliangasser mlinvitrotoxreloadedforhighthroughputhazardbasedprioritizationofhighresolutionmassspectrometrydata
AT beateiescher mlinvitrotoxreloadedforhighthroughputhazardbasedprioritizationofhighresolutionmassspectrometrydata
AT georgbraun mlinvitrotoxreloadedforhighthroughputhazardbasedprioritizationofhighresolutionmassspectrometrydata
AT robinbosshard mlinvitrotoxreloadedforhighthroughputhazardbasedprioritizationofhighresolutionmassspectrometrydata
AT julianehollender mlinvitrotoxreloadedforhighthroughputhazardbasedprioritizationofhighresolutionmassspectrometrydata