MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data

Abstract MLinvitroTox is an automated Python pipeline developed for high-throughput hazard-driven prioritization of toxicologically relevant signals detected in complex environmental samples through high-resolution tandem mass spectrometry (HRMS/MS). MLinvitroTox is a machine learning (ML) framework...

Full description

Saved in:

Bibliographic Details
Main Authors:	Katarzyna Arturi, Eliza J. Harris, Lilian Gasser, Beate I. Escher, Georg Braun, Robin Bosshard, Juliane Hollender
Format:	Article
Language:	English
Published:	BMC 2025-01-01
Series:	Journal of Cheminformatics
Subjects:	ToxCast Tox21 Toxicity In vitro assay Activity prediction HRMS/MS
Online Access:	https://doi.org/10.1186/s13321-025-00950-4
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832571369357836288
author	Katarzyna Arturi Eliza J. Harris Lilian Gasser Beate I. Escher Georg Braun Robin Bosshard Juliane Hollender
author_facet	Katarzyna Arturi Eliza J. Harris Lilian Gasser Beate I. Escher Georg Braun Robin Bosshard Juliane Hollender
author_sort	Katarzyna Arturi
collection	DOAJ
description	Abstract MLinvitroTox is an automated Python pipeline developed for high-throughput hazard-driven prioritization of toxicologically relevant signals detected in complex environmental samples through high-resolution tandem mass spectrometry (HRMS/MS). MLinvitroTox is a machine learning (ML) framework comprising 490 independent XGBoost classifiers trained on molecular fingerprints from chemical structures and target-specific endpoints from the ToxCast/Tox21 invitroDBv4.1 database. For each analyzed HRMS feature, MLinvitroTox generates a 490-bit bioactivity fingerprint used as a basis for prioritization, focusing the time-consuming molecular identification efforts on features most likely to cause adverse effects. The practical advantages of MLinvitroTox are demonstrated for groundwater HRMS data. Among the 874 features for which molecular fingerprints were derived from spectra, including 630 nontargets, 185 spectral matches, and 59 targets, around 4% of the feature/endpoint relationship pairs were predicted to be active. Cross-checking the predictions for targets and spectral matches with invitroDB data confirmed the bioactivity of 120 active and 6791 nonactive pairs while mislabeling 88 active and 56 non-active relationships. By filtering according to bioactivity probability, endpoint scores, and similarity to the training data, the number of potentially toxic features was reduced by at least one order of magnitude. This refinement makes the analytical confirmation of the toxicologically most relevant features feasible, offering significant benefits for cost-efficient chemical risk assessment. Scientific Contribution: In contrast to the classical ML-based approaches for toxicity prediction, MLinvitroTox predicts bioactivity for HRMS features (i.e., distinct m/z signals) based on MS2 fragmentation spectra rather than the chemical structures from the identified features. While the original proof of concept study was accompanied by the release of a MLinvitroTox v1 KNIME workflow, in this study, we release a Python MLinvitroTox v2 package, which, in addition to automation, expands functionality to include predicting toxicity from structures, cleaning up and generating chemical fingerprints, customizing models, and retraining on custom data. Furthermore, as a result of improvements in bioactivity data processing, realized in the concurrently released pytcpl Python package for the custom processing of invitroDBv4.1 input data used for training MLinvitroTox, the current release introduces enhancements in model accuracy, coverage of biological mechanistic targets, and overall interpretability.
format	Article
id	doaj-art-4a26c64768bf4aa297152bed52f5d19a
institution	Kabale University
issn	1758-2946
language	English
publishDate	2025-01-01
publisher	BMC
record_format	Article
series	Journal of Cheminformatics
spelling	doaj-art-4a26c64768bf4aa297152bed52f5d19a2025-02-02T12:40:16ZengBMCJournal of Cheminformatics1758-29462025-01-0117112010.1186/s13321-025-00950-4MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry dataKatarzyna Arturi0Eliza J. Harris1Lilian Gasser2Beate I. Escher3Georg Braun4Robin Bosshard5Juliane Hollender6Department of Environmental Chemistry, Swiss Federal Institute of Aquatic Science and Technology (Eawag)Swiss Data Science Center (SDSC)Swiss Data Science Center (SDSC)Cell Toxicology, Helmholtz Centre for Environmental Research (UFZ)Cell Toxicology, Helmholtz Centre for Environmental Research (UFZ)Department of Computer Science, Eidgenössische Technische Hochschule Zürich (ETH Zürich)Department of Environmental Chemistry, Swiss Federal Institute of Aquatic Science and Technology (Eawag)Abstract MLinvitroTox is an automated Python pipeline developed for high-throughput hazard-driven prioritization of toxicologically relevant signals detected in complex environmental samples through high-resolution tandem mass spectrometry (HRMS/MS). MLinvitroTox is a machine learning (ML) framework comprising 490 independent XGBoost classifiers trained on molecular fingerprints from chemical structures and target-specific endpoints from the ToxCast/Tox21 invitroDBv4.1 database. For each analyzed HRMS feature, MLinvitroTox generates a 490-bit bioactivity fingerprint used as a basis for prioritization, focusing the time-consuming molecular identification efforts on features most likely to cause adverse effects. The practical advantages of MLinvitroTox are demonstrated for groundwater HRMS data. Among the 874 features for which molecular fingerprints were derived from spectra, including 630 nontargets, 185 spectral matches, and 59 targets, around 4% of the feature/endpoint relationship pairs were predicted to be active. Cross-checking the predictions for targets and spectral matches with invitroDB data confirmed the bioactivity of 120 active and 6791 nonactive pairs while mislabeling 88 active and 56 non-active relationships. By filtering according to bioactivity probability, endpoint scores, and similarity to the training data, the number of potentially toxic features was reduced by at least one order of magnitude. This refinement makes the analytical confirmation of the toxicologically most relevant features feasible, offering significant benefits for cost-efficient chemical risk assessment. Scientific Contribution: In contrast to the classical ML-based approaches for toxicity prediction, MLinvitroTox predicts bioactivity for HRMS features (i.e., distinct m/z signals) based on MS2 fragmentation spectra rather than the chemical structures from the identified features. While the original proof of concept study was accompanied by the release of a MLinvitroTox v1 KNIME workflow, in this study, we release a Python MLinvitroTox v2 package, which, in addition to automation, expands functionality to include predicting toxicity from structures, cleaning up and generating chemical fingerprints, customizing models, and retraining on custom data. Furthermore, as a result of improvements in bioactivity data processing, realized in the concurrently released pytcpl Python package for the custom processing of invitroDBv4.1 input data used for training MLinvitroTox, the current release introduces enhancements in model accuracy, coverage of biological mechanistic targets, and overall interpretability.https://doi.org/10.1186/s13321-025-00950-4ToxCastTox21ToxicityIn vitro assayActivity predictionHRMS/MS
spellingShingle	Katarzyna Arturi Eliza J. Harris Lilian Gasser Beate I. Escher Georg Braun Robin Bosshard Juliane Hollender MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data Journal of Cheminformatics ToxCast Tox21 Toxicity In vitro assay Activity prediction HRMS/MS
title	MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data
title_full	MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data
title_fullStr	MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data
title_full_unstemmed	MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data
title_short	MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data
title_sort	mlinvitrotox reloaded for high throughput hazard based prioritization of high resolution mass spectrometry data
topic	ToxCast Tox21 Toxicity In vitro assay Activity prediction HRMS/MS
url	https://doi.org/10.1186/s13321-025-00950-4
work_keys_str_mv	AT katarzynaarturi mlinvitrotoxreloadedforhighthroughputhazardbasedprioritizationofhighresolutionmassspectrometrydata AT elizajharris mlinvitrotoxreloadedforhighthroughputhazardbasedprioritizationofhighresolutionmassspectrometrydata AT liliangasser mlinvitrotoxreloadedforhighthroughputhazardbasedprioritizationofhighresolutionmassspectrometrydata AT beateiescher mlinvitrotoxreloadedforhighthroughputhazardbasedprioritizationofhighresolutionmassspectrometrydata AT georgbraun mlinvitrotoxreloadedforhighthroughputhazardbasedprioritizationofhighresolutionmassspectrometrydata AT robinbosshard mlinvitrotoxreloadedforhighthroughputhazardbasedprioritizationofhighresolutionmassspectrometrydata AT julianehollender mlinvitrotoxreloadedforhighthroughputhazardbasedprioritizationofhighresolutionmassspectrometrydata

MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data

Similar Items