Data-driven organic solubility prediction at the limit of aleatoric uncertainty

Abstract Small molecule solubility is a critically important property which affects the efficiency, environmental impact, and phase behavior of synthetic processes. Experimental determination of solubility is a time- and resource-intensive process and existing methods for in silico estimation of sol...

Full description

Saved in:
Bibliographic Details
Main Authors: Lucas Attia, Jackson W. Burns, Patrick S. Doyle, William H. Green
Format: Article
Language:English
Published: Nature Portfolio 2025-08-01
Series:Nature Communications
Online Access:https://doi.org/10.1038/s41467-025-62717-7
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849226143941525504
author Lucas Attia
Jackson W. Burns
Patrick S. Doyle
William H. Green
author_facet Lucas Attia
Jackson W. Burns
Patrick S. Doyle
William H. Green
author_sort Lucas Attia
collection DOAJ
description Abstract Small molecule solubility is a critically important property which affects the efficiency, environmental impact, and phase behavior of synthetic processes. Experimental determination of solubility is a time- and resource-intensive process and existing methods for in silico estimation of solubility are limited by their generality, speed, and accuracy. This work presents two models derived from the FASTPROP and CHEMPROP architectures and trained on BigSolDB which are capable of predicting solubility at arbitrary temperatures for a wide range of small molecules in organic solvent. Both extrapolate to unseen solutes 2–3 times more accurately than the current state-of-the-art model and we demonstrate that they are approaching the aleatoric limit (0.5–1 $$\log S$$ log S ) of available test data, suggesting that further improvements in prediction accuracy require more accurate datasets. The FASTPROP-derived model (called FASTSOLV) and the CHEMPROP-based model are open source, freely accessible via a Python package and web interface, highly reproducible, and up to 2 orders of magnitude faster than current alternatives.
format Article
id doaj-art-86732046b8024a0ab71e91d1e23076c3
institution Kabale University
issn 2041-1723
language English
publishDate 2025-08-01
publisher Nature Portfolio
record_format Article
series Nature Communications
spelling doaj-art-86732046b8024a0ab71e91d1e23076c32025-08-24T11:36:57ZengNature PortfolioNature Communications2041-17232025-08-0116111010.1038/s41467-025-62717-7Data-driven organic solubility prediction at the limit of aleatoric uncertaintyLucas Attia0Jackson W. Burns1Patrick S. Doyle2William H. Green3Department of Chemical Engineering, MITDepartment of Chemical Engineering, MITDepartment of Chemical Engineering, MITDepartment of Chemical Engineering, MITAbstract Small molecule solubility is a critically important property which affects the efficiency, environmental impact, and phase behavior of synthetic processes. Experimental determination of solubility is a time- and resource-intensive process and existing methods for in silico estimation of solubility are limited by their generality, speed, and accuracy. This work presents two models derived from the FASTPROP and CHEMPROP architectures and trained on BigSolDB which are capable of predicting solubility at arbitrary temperatures for a wide range of small molecules in organic solvent. Both extrapolate to unseen solutes 2–3 times more accurately than the current state-of-the-art model and we demonstrate that they are approaching the aleatoric limit (0.5–1 $$\log S$$ log S ) of available test data, suggesting that further improvements in prediction accuracy require more accurate datasets. The FASTPROP-derived model (called FASTSOLV) and the CHEMPROP-based model are open source, freely accessible via a Python package and web interface, highly reproducible, and up to 2 orders of magnitude faster than current alternatives.https://doi.org/10.1038/s41467-025-62717-7
spellingShingle Lucas Attia
Jackson W. Burns
Patrick S. Doyle
William H. Green
Data-driven organic solubility prediction at the limit of aleatoric uncertainty
Nature Communications
title Data-driven organic solubility prediction at the limit of aleatoric uncertainty
title_full Data-driven organic solubility prediction at the limit of aleatoric uncertainty
title_fullStr Data-driven organic solubility prediction at the limit of aleatoric uncertainty
title_full_unstemmed Data-driven organic solubility prediction at the limit of aleatoric uncertainty
title_short Data-driven organic solubility prediction at the limit of aleatoric uncertainty
title_sort data driven organic solubility prediction at the limit of aleatoric uncertainty
url https://doi.org/10.1038/s41467-025-62717-7
work_keys_str_mv AT lucasattia datadrivenorganicsolubilitypredictionatthelimitofaleatoricuncertainty
AT jacksonwburns datadrivenorganicsolubilitypredictionatthelimitofaleatoricuncertainty
AT patricksdoyle datadrivenorganicsolubilitypredictionatthelimitofaleatoricuncertainty
AT williamhgreen datadrivenorganicsolubilitypredictionatthelimitofaleatoricuncertainty