Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory

Abstract An accurate description of information is relevant for a range of problems in atomistic machine learning (ML), such as crafting training sets, performing uncertainty quantification (UQ), or extracting physical insights from large datasets. However, atomistic ML often relies on unsupervised...

Full description

Saved in:
Bibliographic Details
Main Authors: Daniel Schwalbe-Koda, Sebastien Hamel, Babak Sadigh, Fei Zhou, Vincenzo Lordi
Format: Article
Language:English
Published: Nature Portfolio 2025-04-01
Series:Nature Communications
Online Access:https://doi.org/10.1038/s41467-025-59232-0
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850206302542757888
author Daniel Schwalbe-Koda
Sebastien Hamel
Babak Sadigh
Fei Zhou
Vincenzo Lordi
author_facet Daniel Schwalbe-Koda
Sebastien Hamel
Babak Sadigh
Fei Zhou
Vincenzo Lordi
author_sort Daniel Schwalbe-Koda
collection DOAJ
description Abstract An accurate description of information is relevant for a range of problems in atomistic machine learning (ML), such as crafting training sets, performing uncertainty quantification (UQ), or extracting physical insights from large datasets. However, atomistic ML often relies on unsupervised learning or model predictions to analyze information contents from simulation or training data. Here, we introduce a theoretical framework that provides a rigorous, model-free tool to quantify information contents in atomistic simulations. We demonstrate that the information entropy of a distribution of atom-centered environments explains known heuristics in ML potential developments, from training set sizes to dataset optimality. Using this tool, we propose a model-free UQ method that reliably predicts epistemic uncertainty and detects out-of-distribution samples, including rare events in systems such as nucleation. This method provides a general tool for data-driven atomistic modeling and combines efforts in ML, simulations, and physical explainability.
format Article
id doaj-art-715f9073a53a4edc93e044eb2b7b2f45
institution OA Journals
issn 2041-1723
language English
publishDate 2025-04-01
publisher Nature Portfolio
record_format Article
series Nature Communications
spelling doaj-art-715f9073a53a4edc93e044eb2b7b2f452025-08-20T02:10:53ZengNature PortfolioNature Communications2041-17232025-04-0116111310.1038/s41467-025-59232-0Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theoryDaniel Schwalbe-Koda0Sebastien Hamel1Babak Sadigh2Fei Zhou3Vincenzo Lordi4Lawrence Livermore National LaboratoryLawrence Livermore National LaboratoryLawrence Livermore National LaboratoryLawrence Livermore National LaboratoryLawrence Livermore National LaboratoryAbstract An accurate description of information is relevant for a range of problems in atomistic machine learning (ML), such as crafting training sets, performing uncertainty quantification (UQ), or extracting physical insights from large datasets. However, atomistic ML often relies on unsupervised learning or model predictions to analyze information contents from simulation or training data. Here, we introduce a theoretical framework that provides a rigorous, model-free tool to quantify information contents in atomistic simulations. We demonstrate that the information entropy of a distribution of atom-centered environments explains known heuristics in ML potential developments, from training set sizes to dataset optimality. Using this tool, we propose a model-free UQ method that reliably predicts epistemic uncertainty and detects out-of-distribution samples, including rare events in systems such as nucleation. This method provides a general tool for data-driven atomistic modeling and combines efforts in ML, simulations, and physical explainability.https://doi.org/10.1038/s41467-025-59232-0
spellingShingle Daniel Schwalbe-Koda
Sebastien Hamel
Babak Sadigh
Fei Zhou
Vincenzo Lordi
Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory
Nature Communications
title Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory
title_full Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory
title_fullStr Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory
title_full_unstemmed Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory
title_short Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory
title_sort model free estimation of completeness uncertainties and outliers in atomistic machine learning using information theory
url https://doi.org/10.1038/s41467-025-59232-0
work_keys_str_mv AT danielschwalbekoda modelfreeestimationofcompletenessuncertaintiesandoutliersinatomisticmachinelearningusinginformationtheory
AT sebastienhamel modelfreeestimationofcompletenessuncertaintiesandoutliersinatomisticmachinelearningusinginformationtheory
AT babaksadigh modelfreeestimationofcompletenessuncertaintiesandoutliersinatomisticmachinelearningusinginformationtheory
AT feizhou modelfreeestimationofcompletenessuncertaintiesandoutliersinatomisticmachinelearningusinginformationtheory
AT vincenzolordi modelfreeestimationofcompletenessuncertaintiesandoutliersinatomisticmachinelearningusinginformationtheory