ADMET evaluation in drug discovery: 21. Application and industrial validation of machine learning algorithms for Caco-2 permeability prediction

Abstract The Caco-2 cell model has been widely used to assess the intestinal permeability of drug candidates in vitro, owing to its morphological and functional similarity to human enterocytes. While Caco-2 cell assay is considered safe and cost-effective, it is also characterized by being time-cons...

Full description

Saved in:
Bibliographic Details
Main Authors: Dong Wang, Jieyu Jin, Guqin Shi, Jingxiao Bao, Zheng Wang, Shimeng Li, Peichen Pan, Dan Li, Yu Kang, Tingjun Hou
Format: Article
Language:English
Published: BMC 2025-01-01
Series:Journal of Cheminformatics
Subjects:
Online Access:https://doi.org/10.1186/s13321-025-00947-z
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841544356004102144
author Dong Wang
Jieyu Jin
Guqin Shi
Jingxiao Bao
Zheng Wang
Shimeng Li
Peichen Pan
Dan Li
Yu Kang
Tingjun Hou
author_facet Dong Wang
Jieyu Jin
Guqin Shi
Jingxiao Bao
Zheng Wang
Shimeng Li
Peichen Pan
Dan Li
Yu Kang
Tingjun Hou
author_sort Dong Wang
collection DOAJ
description Abstract The Caco-2 cell model has been widely used to assess the intestinal permeability of drug candidates in vitro, owing to its morphological and functional similarity to human enterocytes. While Caco-2 cell assay is considered safe and cost-effective, it is also characterized by being time-consuming. Therefore, computational models that achieve high accuracies in predicting Caco-2 permeability are crucial for enhancing the efficiency of oral drug development. In this study, we conducted an in-depth analysis of the characteristics of an augmented Caco-2 permeability dataset, and evaluated a diverse range of machine learning algorithms in combination with different molecular representations. The results indicated that XGBoost generally provided better predictions than comparable models for the test sets. In addition, we investigated the transferability of machine learning models trained on publicly available data to internal pharmaceutical industry datasets. Our findings, based on the Shanghai Qilu’s in-house dataset, showed that the boosting models retained a degree of predictive efficacy when applied to industry data. Furthermore, Y-randomization test and applicability domain analysis were employed to assess the robustness and generalizability of these models. Matched Molecular Pair Analysis (MMPA) was utilized to extract chemical transformation rules. We believe that the model developed in this study could represent a reliable tool for assessing Caco-2 permeability during early-stage drug discovery and the chemical transformation rules derived here could provide insights for optimizing Caco-2 permeability. Scientific contribution A comprehensive validation of various machine learning algorithms combined with diverse molecular representations on a large dataset for predicting Caco-2 permeability was reported. The transferability of machine learning models trained on publicly available data to internal pharmaceutical industry datasets was also investigated. Matched molecular pair analysis was carried out to provide reasonable suggestions for researchers to improve the Caco-2 permeability of compounds. Graphical Abstract
format Article
id doaj-art-52addb2516a84aeab780b3373d74aa1b
institution Kabale University
issn 1758-2946
language English
publishDate 2025-01-01
publisher BMC
record_format Article
series Journal of Cheminformatics
spelling doaj-art-52addb2516a84aeab780b3373d74aa1b2025-01-12T12:37:25ZengBMCJournal of Cheminformatics1758-29462025-01-0117111410.1186/s13321-025-00947-zADMET evaluation in drug discovery: 21. Application and industrial validation of machine learning algorithms for Caco-2 permeability predictionDong Wang0Jieyu Jin1Guqin Shi2Jingxiao Bao3Zheng Wang4Shimeng Li5Peichen Pan6Dan Li7Yu Kang8Tingjun Hou9Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang UniversityInnovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang UniversityShanghai Qilu Pharmaceutical R&D CenterShanghai Qilu Pharmaceutical R&D CenterShanghai Qilu Pharmaceutical R&D CenterInnovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang UniversityInnovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang UniversityInnovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang UniversityInnovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang UniversityInnovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang UniversityAbstract The Caco-2 cell model has been widely used to assess the intestinal permeability of drug candidates in vitro, owing to its morphological and functional similarity to human enterocytes. While Caco-2 cell assay is considered safe and cost-effective, it is also characterized by being time-consuming. Therefore, computational models that achieve high accuracies in predicting Caco-2 permeability are crucial for enhancing the efficiency of oral drug development. In this study, we conducted an in-depth analysis of the characteristics of an augmented Caco-2 permeability dataset, and evaluated a diverse range of machine learning algorithms in combination with different molecular representations. The results indicated that XGBoost generally provided better predictions than comparable models for the test sets. In addition, we investigated the transferability of machine learning models trained on publicly available data to internal pharmaceutical industry datasets. Our findings, based on the Shanghai Qilu’s in-house dataset, showed that the boosting models retained a degree of predictive efficacy when applied to industry data. Furthermore, Y-randomization test and applicability domain analysis were employed to assess the robustness and generalizability of these models. Matched Molecular Pair Analysis (MMPA) was utilized to extract chemical transformation rules. We believe that the model developed in this study could represent a reliable tool for assessing Caco-2 permeability during early-stage drug discovery and the chemical transformation rules derived here could provide insights for optimizing Caco-2 permeability. Scientific contribution A comprehensive validation of various machine learning algorithms combined with diverse molecular representations on a large dataset for predicting Caco-2 permeability was reported. The transferability of machine learning models trained on publicly available data to internal pharmaceutical industry datasets was also investigated. Matched molecular pair analysis was carried out to provide reasonable suggestions for researchers to improve the Caco-2 permeability of compounds. Graphical Abstracthttps://doi.org/10.1186/s13321-025-00947-zCaco-2 permeabilityMachine learningMatched molecular pair
spellingShingle Dong Wang
Jieyu Jin
Guqin Shi
Jingxiao Bao
Zheng Wang
Shimeng Li
Peichen Pan
Dan Li
Yu Kang
Tingjun Hou
ADMET evaluation in drug discovery: 21. Application and industrial validation of machine learning algorithms for Caco-2 permeability prediction
Journal of Cheminformatics
Caco-2 permeability
Machine learning
Matched molecular pair
title ADMET evaluation in drug discovery: 21. Application and industrial validation of machine learning algorithms for Caco-2 permeability prediction
title_full ADMET evaluation in drug discovery: 21. Application and industrial validation of machine learning algorithms for Caco-2 permeability prediction
title_fullStr ADMET evaluation in drug discovery: 21. Application and industrial validation of machine learning algorithms for Caco-2 permeability prediction
title_full_unstemmed ADMET evaluation in drug discovery: 21. Application and industrial validation of machine learning algorithms for Caco-2 permeability prediction
title_short ADMET evaluation in drug discovery: 21. Application and industrial validation of machine learning algorithms for Caco-2 permeability prediction
title_sort admet evaluation in drug discovery 21 application and industrial validation of machine learning algorithms for caco 2 permeability prediction
topic Caco-2 permeability
Machine learning
Matched molecular pair
url https://doi.org/10.1186/s13321-025-00947-z
work_keys_str_mv AT dongwang admetevaluationindrugdiscovery21applicationandindustrialvalidationofmachinelearningalgorithmsforcaco2permeabilityprediction
AT jieyujin admetevaluationindrugdiscovery21applicationandindustrialvalidationofmachinelearningalgorithmsforcaco2permeabilityprediction
AT guqinshi admetevaluationindrugdiscovery21applicationandindustrialvalidationofmachinelearningalgorithmsforcaco2permeabilityprediction
AT jingxiaobao admetevaluationindrugdiscovery21applicationandindustrialvalidationofmachinelearningalgorithmsforcaco2permeabilityprediction
AT zhengwang admetevaluationindrugdiscovery21applicationandindustrialvalidationofmachinelearningalgorithmsforcaco2permeabilityprediction
AT shimengli admetevaluationindrugdiscovery21applicationandindustrialvalidationofmachinelearningalgorithmsforcaco2permeabilityprediction
AT peichenpan admetevaluationindrugdiscovery21applicationandindustrialvalidationofmachinelearningalgorithmsforcaco2permeabilityprediction
AT danli admetevaluationindrugdiscovery21applicationandindustrialvalidationofmachinelearningalgorithmsforcaco2permeabilityprediction
AT yukang admetevaluationindrugdiscovery21applicationandindustrialvalidationofmachinelearningalgorithmsforcaco2permeabilityprediction
AT tingjunhou admetevaluationindrugdiscovery21applicationandindustrialvalidationofmachinelearningalgorithmsforcaco2permeabilityprediction