VitroBert: modeling DILI by pretraining BERT on in vitro data

Abstract Drug-induced liver injury (DILI) presents a significant challenge due to its complexity, small datasets, and severe class imbalance. While unsupervised pretraining is a common approach to learn molecular representations for downstream tasks, it often lacks insights into how molecules intera...

Full description

Saved in:
Bibliographic Details
Main Authors: Muhammad Arslan Masood, Anamya Ajjolli Nagaraja, Katia Belaid, Natalie Mesens, Hugo Ceulemans, Samuel Kaski, Dorota Herman, Markus Heinonen
Format: Article
Language:English
Published: BMC 2025-08-01
Series:Journal of Cheminformatics
Subjects:
Online Access:https://doi.org/10.1186/s13321-025-01048-7
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849234971687911424
author Muhammad Arslan Masood
Anamya Ajjolli Nagaraja
Katia Belaid
Natalie Mesens
Hugo Ceulemans
Samuel Kaski
Dorota Herman
Markus Heinonen
author_facet Muhammad Arslan Masood
Anamya Ajjolli Nagaraja
Katia Belaid
Natalie Mesens
Hugo Ceulemans
Samuel Kaski
Dorota Herman
Markus Heinonen
author_sort Muhammad Arslan Masood
collection DOAJ
description Abstract Drug-induced liver injury (DILI) presents a significant challenge due to its complexity, small datasets, and severe class imbalance. While unsupervised pretraining is a common approach to learn molecular representations for downstream tasks, it often lacks insights into how molecules interact with biological systems. We therefore introduce VitroBERT, a bidirectional encoder representations from transformers (BERT) model pretrained on large-scale in vitro assay profiles to generate biologically informed molecular embeddings. When leveraged to predict in vivo DILI endpoints, these embeddings delivered up to a 29% improvement in biochemistry-related tasks and a 16% gain in histopathology endpoints compared to unsupervised pretraining (MolBERT). However, no significant improvement was observed in clinical tasks. Furthermore, to address the critical issue of class imbalance, we evaluated multiple loss functions-including BCE, weighted BCE, Focal loss, and weighted Focal loss-and identified weighted Focal loss as the most effective. Our findings demonstrate the potential of integrating biological context into molecular models and highlight the importance of selecting appropriate loss functions in improving model performance of highly imbalanced DILI-related tasks.
format Article
id doaj-art-101a86803fb440a485e4952a65441fed
institution Kabale University
issn 1758-2946
language English
publishDate 2025-08-01
publisher BMC
record_format Article
series Journal of Cheminformatics
spelling doaj-art-101a86803fb440a485e4952a65441fed2025-08-20T04:02:56ZengBMCJournal of Cheminformatics1758-29462025-08-0117111210.1186/s13321-025-01048-7VitroBert: modeling DILI by pretraining BERT on in vitro dataMuhammad Arslan Masood0Anamya Ajjolli Nagaraja1Katia Belaid2Natalie Mesens3Hugo Ceulemans4Samuel Kaski5Dorota Herman6Markus Heinonen7Johnson & JohnsonJohnson & JohnsonJohnson & JohnsonJohnson & JohnsonJohnson & JohnsonAalto UniversityJohnson & JohnsonAalto UniversityAbstract Drug-induced liver injury (DILI) presents a significant challenge due to its complexity, small datasets, and severe class imbalance. While unsupervised pretraining is a common approach to learn molecular representations for downstream tasks, it often lacks insights into how molecules interact with biological systems. We therefore introduce VitroBERT, a bidirectional encoder representations from transformers (BERT) model pretrained on large-scale in vitro assay profiles to generate biologically informed molecular embeddings. When leveraged to predict in vivo DILI endpoints, these embeddings delivered up to a 29% improvement in biochemistry-related tasks and a 16% gain in histopathology endpoints compared to unsupervised pretraining (MolBERT). However, no significant improvement was observed in clinical tasks. Furthermore, to address the critical issue of class imbalance, we evaluated multiple loss functions-including BCE, weighted BCE, Focal loss, and weighted Focal loss-and identified weighted Focal loss as the most effective. Our findings demonstrate the potential of integrating biological context into molecular models and highlight the importance of selecting appropriate loss functions in improving model performance of highly imbalanced DILI-related tasks.https://doi.org/10.1186/s13321-025-01048-7BERTDILIToxicityMolecular embeddings
spellingShingle Muhammad Arslan Masood
Anamya Ajjolli Nagaraja
Katia Belaid
Natalie Mesens
Hugo Ceulemans
Samuel Kaski
Dorota Herman
Markus Heinonen
VitroBert: modeling DILI by pretraining BERT on in vitro data
Journal of Cheminformatics
BERT
DILI
Toxicity
Molecular embeddings
title VitroBert: modeling DILI by pretraining BERT on in vitro data
title_full VitroBert: modeling DILI by pretraining BERT on in vitro data
title_fullStr VitroBert: modeling DILI by pretraining BERT on in vitro data
title_full_unstemmed VitroBert: modeling DILI by pretraining BERT on in vitro data
title_short VitroBert: modeling DILI by pretraining BERT on in vitro data
title_sort vitrobert modeling dili by pretraining bert on in vitro data
topic BERT
DILI
Toxicity
Molecular embeddings
url https://doi.org/10.1186/s13321-025-01048-7
work_keys_str_mv AT muhammadarslanmasood vitrobertmodelingdilibypretrainingbertoninvitrodata
AT anamyaajjollinagaraja vitrobertmodelingdilibypretrainingbertoninvitrodata
AT katiabelaid vitrobertmodelingdilibypretrainingbertoninvitrodata
AT nataliemesens vitrobertmodelingdilibypretrainingbertoninvitrodata
AT hugoceulemans vitrobertmodelingdilibypretrainingbertoninvitrodata
AT samuelkaski vitrobertmodelingdilibypretrainingbertoninvitrodata
AT dorotaherman vitrobertmodelingdilibypretrainingbertoninvitrodata
AT markusheinonen vitrobertmodelingdilibypretrainingbertoninvitrodata