VitroBert: modeling DILI by pretraining BERT on in vitro data
Abstract Drug-induced liver injury (DILI) presents a significant challenge due to its complexity, small datasets, and severe class imbalance. While unsupervised pretraining is a common approach to learn molecular representations for downstream tasks, it often lacks insights into how molecules intera...
Saved in:
| Main Authors: | , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
BMC
2025-08-01
|
| Series: | Journal of Cheminformatics |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s13321-025-01048-7 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849234971687911424 |
|---|---|
| author | Muhammad Arslan Masood Anamya Ajjolli Nagaraja Katia Belaid Natalie Mesens Hugo Ceulemans Samuel Kaski Dorota Herman Markus Heinonen |
| author_facet | Muhammad Arslan Masood Anamya Ajjolli Nagaraja Katia Belaid Natalie Mesens Hugo Ceulemans Samuel Kaski Dorota Herman Markus Heinonen |
| author_sort | Muhammad Arslan Masood |
| collection | DOAJ |
| description | Abstract Drug-induced liver injury (DILI) presents a significant challenge due to its complexity, small datasets, and severe class imbalance. While unsupervised pretraining is a common approach to learn molecular representations for downstream tasks, it often lacks insights into how molecules interact with biological systems. We therefore introduce VitroBERT, a bidirectional encoder representations from transformers (BERT) model pretrained on large-scale in vitro assay profiles to generate biologically informed molecular embeddings. When leveraged to predict in vivo DILI endpoints, these embeddings delivered up to a 29% improvement in biochemistry-related tasks and a 16% gain in histopathology endpoints compared to unsupervised pretraining (MolBERT). However, no significant improvement was observed in clinical tasks. Furthermore, to address the critical issue of class imbalance, we evaluated multiple loss functions-including BCE, weighted BCE, Focal loss, and weighted Focal loss-and identified weighted Focal loss as the most effective. Our findings demonstrate the potential of integrating biological context into molecular models and highlight the importance of selecting appropriate loss functions in improving model performance of highly imbalanced DILI-related tasks. |
| format | Article |
| id | doaj-art-101a86803fb440a485e4952a65441fed |
| institution | Kabale University |
| issn | 1758-2946 |
| language | English |
| publishDate | 2025-08-01 |
| publisher | BMC |
| record_format | Article |
| series | Journal of Cheminformatics |
| spelling | doaj-art-101a86803fb440a485e4952a65441fed2025-08-20T04:02:56ZengBMCJournal of Cheminformatics1758-29462025-08-0117111210.1186/s13321-025-01048-7VitroBert: modeling DILI by pretraining BERT on in vitro dataMuhammad Arslan Masood0Anamya Ajjolli Nagaraja1Katia Belaid2Natalie Mesens3Hugo Ceulemans4Samuel Kaski5Dorota Herman6Markus Heinonen7Johnson & JohnsonJohnson & JohnsonJohnson & JohnsonJohnson & JohnsonJohnson & JohnsonAalto UniversityJohnson & JohnsonAalto UniversityAbstract Drug-induced liver injury (DILI) presents a significant challenge due to its complexity, small datasets, and severe class imbalance. While unsupervised pretraining is a common approach to learn molecular representations for downstream tasks, it often lacks insights into how molecules interact with biological systems. We therefore introduce VitroBERT, a bidirectional encoder representations from transformers (BERT) model pretrained on large-scale in vitro assay profiles to generate biologically informed molecular embeddings. When leveraged to predict in vivo DILI endpoints, these embeddings delivered up to a 29% improvement in biochemistry-related tasks and a 16% gain in histopathology endpoints compared to unsupervised pretraining (MolBERT). However, no significant improvement was observed in clinical tasks. Furthermore, to address the critical issue of class imbalance, we evaluated multiple loss functions-including BCE, weighted BCE, Focal loss, and weighted Focal loss-and identified weighted Focal loss as the most effective. Our findings demonstrate the potential of integrating biological context into molecular models and highlight the importance of selecting appropriate loss functions in improving model performance of highly imbalanced DILI-related tasks.https://doi.org/10.1186/s13321-025-01048-7BERTDILIToxicityMolecular embeddings |
| spellingShingle | Muhammad Arslan Masood Anamya Ajjolli Nagaraja Katia Belaid Natalie Mesens Hugo Ceulemans Samuel Kaski Dorota Herman Markus Heinonen VitroBert: modeling DILI by pretraining BERT on in vitro data Journal of Cheminformatics BERT DILI Toxicity Molecular embeddings |
| title | VitroBert: modeling DILI by pretraining BERT on in vitro data |
| title_full | VitroBert: modeling DILI by pretraining BERT on in vitro data |
| title_fullStr | VitroBert: modeling DILI by pretraining BERT on in vitro data |
| title_full_unstemmed | VitroBert: modeling DILI by pretraining BERT on in vitro data |
| title_short | VitroBert: modeling DILI by pretraining BERT on in vitro data |
| title_sort | vitrobert modeling dili by pretraining bert on in vitro data |
| topic | BERT DILI Toxicity Molecular embeddings |
| url | https://doi.org/10.1186/s13321-025-01048-7 |
| work_keys_str_mv | AT muhammadarslanmasood vitrobertmodelingdilibypretrainingbertoninvitrodata AT anamyaajjollinagaraja vitrobertmodelingdilibypretrainingbertoninvitrodata AT katiabelaid vitrobertmodelingdilibypretrainingbertoninvitrodata AT nataliemesens vitrobertmodelingdilibypretrainingbertoninvitrodata AT hugoceulemans vitrobertmodelingdilibypretrainingbertoninvitrodata AT samuelkaski vitrobertmodelingdilibypretrainingbertoninvitrodata AT dorotaherman vitrobertmodelingdilibypretrainingbertoninvitrodata AT markusheinonen vitrobertmodelingdilibypretrainingbertoninvitrodata |