MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data
Abstract MLinvitroTox is an automated Python pipeline developed for high-throughput hazard-driven prioritization of toxicologically relevant signals detected in complex environmental samples through high-resolution tandem mass spectrometry (HRMS/MS). MLinvitroTox is a machine learning (ML) framework...
Saved in:
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2025-01-01
|
Series: | Journal of Cheminformatics |
Subjects: | |
Online Access: | https://doi.org/10.1186/s13321-025-00950-4 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832571369357836288 |
---|---|
author | Katarzyna Arturi Eliza J. Harris Lilian Gasser Beate I. Escher Georg Braun Robin Bosshard Juliane Hollender |
author_facet | Katarzyna Arturi Eliza J. Harris Lilian Gasser Beate I. Escher Georg Braun Robin Bosshard Juliane Hollender |
author_sort | Katarzyna Arturi |
collection | DOAJ |
description | Abstract MLinvitroTox is an automated Python pipeline developed for high-throughput hazard-driven prioritization of toxicologically relevant signals detected in complex environmental samples through high-resolution tandem mass spectrometry (HRMS/MS). MLinvitroTox is a machine learning (ML) framework comprising 490 independent XGBoost classifiers trained on molecular fingerprints from chemical structures and target-specific endpoints from the ToxCast/Tox21 invitroDBv4.1 database. For each analyzed HRMS feature, MLinvitroTox generates a 490-bit bioactivity fingerprint used as a basis for prioritization, focusing the time-consuming molecular identification efforts on features most likely to cause adverse effects. The practical advantages of MLinvitroTox are demonstrated for groundwater HRMS data. Among the 874 features for which molecular fingerprints were derived from spectra, including 630 nontargets, 185 spectral matches, and 59 targets, around 4% of the feature/endpoint relationship pairs were predicted to be active. Cross-checking the predictions for targets and spectral matches with invitroDB data confirmed the bioactivity of 120 active and 6791 nonactive pairs while mislabeling 88 active and 56 non-active relationships. By filtering according to bioactivity probability, endpoint scores, and similarity to the training data, the number of potentially toxic features was reduced by at least one order of magnitude. This refinement makes the analytical confirmation of the toxicologically most relevant features feasible, offering significant benefits for cost-efficient chemical risk assessment. Scientific Contribution: In contrast to the classical ML-based approaches for toxicity prediction, MLinvitroTox predicts bioactivity for HRMS features (i.e., distinct m/z signals) based on MS2 fragmentation spectra rather than the chemical structures from the identified features. While the original proof of concept study was accompanied by the release of a MLinvitroTox v1 KNIME workflow, in this study, we release a Python MLinvitroTox v2 package, which, in addition to automation, expands functionality to include predicting toxicity from structures, cleaning up and generating chemical fingerprints, customizing models, and retraining on custom data. Furthermore, as a result of improvements in bioactivity data processing, realized in the concurrently released pytcpl Python package for the custom processing of invitroDBv4.1 input data used for training MLinvitroTox, the current release introduces enhancements in model accuracy, coverage of biological mechanistic targets, and overall interpretability. |
format | Article |
id | doaj-art-4a26c64768bf4aa297152bed52f5d19a |
institution | Kabale University |
issn | 1758-2946 |
language | English |
publishDate | 2025-01-01 |
publisher | BMC |
record_format | Article |
series | Journal of Cheminformatics |
spelling | doaj-art-4a26c64768bf4aa297152bed52f5d19a2025-02-02T12:40:16ZengBMCJournal of Cheminformatics1758-29462025-01-0117112010.1186/s13321-025-00950-4MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry dataKatarzyna Arturi0Eliza J. Harris1Lilian Gasser2Beate I. Escher3Georg Braun4Robin Bosshard5Juliane Hollender6Department of Environmental Chemistry, Swiss Federal Institute of Aquatic Science and Technology (Eawag)Swiss Data Science Center (SDSC)Swiss Data Science Center (SDSC)Cell Toxicology, Helmholtz Centre for Environmental Research (UFZ)Cell Toxicology, Helmholtz Centre for Environmental Research (UFZ)Department of Computer Science, Eidgenössische Technische Hochschule Zürich (ETH Zürich)Department of Environmental Chemistry, Swiss Federal Institute of Aquatic Science and Technology (Eawag)Abstract MLinvitroTox is an automated Python pipeline developed for high-throughput hazard-driven prioritization of toxicologically relevant signals detected in complex environmental samples through high-resolution tandem mass spectrometry (HRMS/MS). MLinvitroTox is a machine learning (ML) framework comprising 490 independent XGBoost classifiers trained on molecular fingerprints from chemical structures and target-specific endpoints from the ToxCast/Tox21 invitroDBv4.1 database. For each analyzed HRMS feature, MLinvitroTox generates a 490-bit bioactivity fingerprint used as a basis for prioritization, focusing the time-consuming molecular identification efforts on features most likely to cause adverse effects. The practical advantages of MLinvitroTox are demonstrated for groundwater HRMS data. Among the 874 features for which molecular fingerprints were derived from spectra, including 630 nontargets, 185 spectral matches, and 59 targets, around 4% of the feature/endpoint relationship pairs were predicted to be active. Cross-checking the predictions for targets and spectral matches with invitroDB data confirmed the bioactivity of 120 active and 6791 nonactive pairs while mislabeling 88 active and 56 non-active relationships. By filtering according to bioactivity probability, endpoint scores, and similarity to the training data, the number of potentially toxic features was reduced by at least one order of magnitude. This refinement makes the analytical confirmation of the toxicologically most relevant features feasible, offering significant benefits for cost-efficient chemical risk assessment. Scientific Contribution: In contrast to the classical ML-based approaches for toxicity prediction, MLinvitroTox predicts bioactivity for HRMS features (i.e., distinct m/z signals) based on MS2 fragmentation spectra rather than the chemical structures from the identified features. While the original proof of concept study was accompanied by the release of a MLinvitroTox v1 KNIME workflow, in this study, we release a Python MLinvitroTox v2 package, which, in addition to automation, expands functionality to include predicting toxicity from structures, cleaning up and generating chemical fingerprints, customizing models, and retraining on custom data. Furthermore, as a result of improvements in bioactivity data processing, realized in the concurrently released pytcpl Python package for the custom processing of invitroDBv4.1 input data used for training MLinvitroTox, the current release introduces enhancements in model accuracy, coverage of biological mechanistic targets, and overall interpretability.https://doi.org/10.1186/s13321-025-00950-4ToxCastTox21ToxicityIn vitro assayActivity predictionHRMS/MS |
spellingShingle | Katarzyna Arturi Eliza J. Harris Lilian Gasser Beate I. Escher Georg Braun Robin Bosshard Juliane Hollender MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data Journal of Cheminformatics ToxCast Tox21 Toxicity In vitro assay Activity prediction HRMS/MS |
title | MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data |
title_full | MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data |
title_fullStr | MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data |
title_full_unstemmed | MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data |
title_short | MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data |
title_sort | mlinvitrotox reloaded for high throughput hazard based prioritization of high resolution mass spectrometry data |
topic | ToxCast Tox21 Toxicity In vitro assay Activity prediction HRMS/MS |
url | https://doi.org/10.1186/s13321-025-00950-4 |
work_keys_str_mv | AT katarzynaarturi mlinvitrotoxreloadedforhighthroughputhazardbasedprioritizationofhighresolutionmassspectrometrydata AT elizajharris mlinvitrotoxreloadedforhighthroughputhazardbasedprioritizationofhighresolutionmassspectrometrydata AT liliangasser mlinvitrotoxreloadedforhighthroughputhazardbasedprioritizationofhighresolutionmassspectrometrydata AT beateiescher mlinvitrotoxreloadedforhighthroughputhazardbasedprioritizationofhighresolutionmassspectrometrydata AT georgbraun mlinvitrotoxreloadedforhighthroughputhazardbasedprioritizationofhighresolutionmassspectrometrydata AT robinbosshard mlinvitrotoxreloadedforhighthroughputhazardbasedprioritizationofhighresolutionmassspectrometrydata AT julianehollender mlinvitrotoxreloadedforhighthroughputhazardbasedprioritizationofhighresolutionmassspectrometrydata |