Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory
Abstract An accurate description of information is relevant for a range of problems in atomistic machine learning (ML), such as crafting training sets, performing uncertainty quantification (UQ), or extracting physical insights from large datasets. However, atomistic ML often relies on unsupervised...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-04-01
|
| Series: | Nature Communications |
| Online Access: | https://doi.org/10.1038/s41467-025-59232-0 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850206302542757888 |
|---|---|
| author | Daniel Schwalbe-Koda Sebastien Hamel Babak Sadigh Fei Zhou Vincenzo Lordi |
| author_facet | Daniel Schwalbe-Koda Sebastien Hamel Babak Sadigh Fei Zhou Vincenzo Lordi |
| author_sort | Daniel Schwalbe-Koda |
| collection | DOAJ |
| description | Abstract An accurate description of information is relevant for a range of problems in atomistic machine learning (ML), such as crafting training sets, performing uncertainty quantification (UQ), or extracting physical insights from large datasets. However, atomistic ML often relies on unsupervised learning or model predictions to analyze information contents from simulation or training data. Here, we introduce a theoretical framework that provides a rigorous, model-free tool to quantify information contents in atomistic simulations. We demonstrate that the information entropy of a distribution of atom-centered environments explains known heuristics in ML potential developments, from training set sizes to dataset optimality. Using this tool, we propose a model-free UQ method that reliably predicts epistemic uncertainty and detects out-of-distribution samples, including rare events in systems such as nucleation. This method provides a general tool for data-driven atomistic modeling and combines efforts in ML, simulations, and physical explainability. |
| format | Article |
| id | doaj-art-715f9073a53a4edc93e044eb2b7b2f45 |
| institution | OA Journals |
| issn | 2041-1723 |
| language | English |
| publishDate | 2025-04-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Nature Communications |
| spelling | doaj-art-715f9073a53a4edc93e044eb2b7b2f452025-08-20T02:10:53ZengNature PortfolioNature Communications2041-17232025-04-0116111310.1038/s41467-025-59232-0Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theoryDaniel Schwalbe-Koda0Sebastien Hamel1Babak Sadigh2Fei Zhou3Vincenzo Lordi4Lawrence Livermore National LaboratoryLawrence Livermore National LaboratoryLawrence Livermore National LaboratoryLawrence Livermore National LaboratoryLawrence Livermore National LaboratoryAbstract An accurate description of information is relevant for a range of problems in atomistic machine learning (ML), such as crafting training sets, performing uncertainty quantification (UQ), or extracting physical insights from large datasets. However, atomistic ML often relies on unsupervised learning or model predictions to analyze information contents from simulation or training data. Here, we introduce a theoretical framework that provides a rigorous, model-free tool to quantify information contents in atomistic simulations. We demonstrate that the information entropy of a distribution of atom-centered environments explains known heuristics in ML potential developments, from training set sizes to dataset optimality. Using this tool, we propose a model-free UQ method that reliably predicts epistemic uncertainty and detects out-of-distribution samples, including rare events in systems such as nucleation. This method provides a general tool for data-driven atomistic modeling and combines efforts in ML, simulations, and physical explainability.https://doi.org/10.1038/s41467-025-59232-0 |
| spellingShingle | Daniel Schwalbe-Koda Sebastien Hamel Babak Sadigh Fei Zhou Vincenzo Lordi Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory Nature Communications |
| title | Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory |
| title_full | Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory |
| title_fullStr | Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory |
| title_full_unstemmed | Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory |
| title_short | Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory |
| title_sort | model free estimation of completeness uncertainties and outliers in atomistic machine learning using information theory |
| url | https://doi.org/10.1038/s41467-025-59232-0 |
| work_keys_str_mv | AT danielschwalbekoda modelfreeestimationofcompletenessuncertaintiesandoutliersinatomisticmachinelearningusinginformationtheory AT sebastienhamel modelfreeestimationofcompletenessuncertaintiesandoutliersinatomisticmachinelearningusinginformationtheory AT babaksadigh modelfreeestimationofcompletenessuncertaintiesandoutliersinatomisticmachinelearningusinginformationtheory AT feizhou modelfreeestimationofcompletenessuncertaintiesandoutliersinatomisticmachinelearningusinginformationtheory AT vincenzolordi modelfreeestimationofcompletenessuncertaintiesandoutliersinatomisticmachinelearningusinginformationtheory |