The QCML dataset, Quantum chemistry reference data from 33.5M DFT and 14.7B semi-empirical calculations
Abstract Machine learning (ML) methods enable prediction of the properties of chemical structures without computationally expensive ab initio calculations. The quality of such predictions depends on the reference data that was used to train the model. In this work, we introduce the QCML dataset: A c...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-03-01
|
| Series: | Scientific Data |
| Online Access: | https://doi.org/10.1038/s41597-025-04720-7 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850252093674225664 |
|---|---|
| author | Stefan Ganscha Oliver T. Unke Daniel Ahlin Hartmut Maennel Sergii Kashubin Klaus-Robert Müller |
| author_facet | Stefan Ganscha Oliver T. Unke Daniel Ahlin Hartmut Maennel Sergii Kashubin Klaus-Robert Müller |
| author_sort | Stefan Ganscha |
| collection | DOAJ |
| description | Abstract Machine learning (ML) methods enable prediction of the properties of chemical structures without computationally expensive ab initio calculations. The quality of such predictions depends on the reference data that was used to train the model. In this work, we introduce the QCML dataset: A comprehensive dataset for training ML models for quantum chemistry. The QCML dataset systematically covers chemical space with small molecules consisting of up to 8 heavy atoms and includes elements from a large fraction of the periodic table, as well as different electronic states. Starting from chemical graphs, conformer search and normal mode sampling are used to generate both equilibrium and off-equilibrium 3D structures, for which various properties are calculated with semi-empirical methods (14.7 billion entries) and density functional theory (33.5 million entries). The covered properties include energies, forces, multipole moments, and other quantities, e.g., Kohn-Sham matrices. We provide a first demonstration of the utility of our dataset by training ML-based force fields on the data and applying them to run molecular dynamics simulations. |
| format | Article |
| id | doaj-art-cc3d261a9ee04fe0b42b1333accde9c0 |
| institution | OA Journals |
| issn | 2052-4463 |
| language | English |
| publishDate | 2025-03-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Scientific Data |
| spelling | doaj-art-cc3d261a9ee04fe0b42b1333accde9c02025-08-20T01:57:44ZengNature PortfolioScientific Data2052-44632025-03-0112111510.1038/s41597-025-04720-7The QCML dataset, Quantum chemistry reference data from 33.5M DFT and 14.7B semi-empirical calculationsStefan Ganscha0Oliver T. Unke1Daniel Ahlin2Hartmut Maennel3Sergii Kashubin4Klaus-Robert Müller5Google DeepMindGoogle DeepMindGoogle DeepMindGoogle DeepMindGoogle DeepMindGoogle DeepMindAbstract Machine learning (ML) methods enable prediction of the properties of chemical structures without computationally expensive ab initio calculations. The quality of such predictions depends on the reference data that was used to train the model. In this work, we introduce the QCML dataset: A comprehensive dataset for training ML models for quantum chemistry. The QCML dataset systematically covers chemical space with small molecules consisting of up to 8 heavy atoms and includes elements from a large fraction of the periodic table, as well as different electronic states. Starting from chemical graphs, conformer search and normal mode sampling are used to generate both equilibrium and off-equilibrium 3D structures, for which various properties are calculated with semi-empirical methods (14.7 billion entries) and density functional theory (33.5 million entries). The covered properties include energies, forces, multipole moments, and other quantities, e.g., Kohn-Sham matrices. We provide a first demonstration of the utility of our dataset by training ML-based force fields on the data and applying them to run molecular dynamics simulations.https://doi.org/10.1038/s41597-025-04720-7 |
| spellingShingle | Stefan Ganscha Oliver T. Unke Daniel Ahlin Hartmut Maennel Sergii Kashubin Klaus-Robert Müller The QCML dataset, Quantum chemistry reference data from 33.5M DFT and 14.7B semi-empirical calculations Scientific Data |
| title | The QCML dataset, Quantum chemistry reference data from 33.5M DFT and 14.7B semi-empirical calculations |
| title_full | The QCML dataset, Quantum chemistry reference data from 33.5M DFT and 14.7B semi-empirical calculations |
| title_fullStr | The QCML dataset, Quantum chemistry reference data from 33.5M DFT and 14.7B semi-empirical calculations |
| title_full_unstemmed | The QCML dataset, Quantum chemistry reference data from 33.5M DFT and 14.7B semi-empirical calculations |
| title_short | The QCML dataset, Quantum chemistry reference data from 33.5M DFT and 14.7B semi-empirical calculations |
| title_sort | qcml dataset quantum chemistry reference data from 33 5m dft and 14 7b semi empirical calculations |
| url | https://doi.org/10.1038/s41597-025-04720-7 |
| work_keys_str_mv | AT stefanganscha theqcmldatasetquantumchemistryreferencedatafrom335mdftand147bsemiempiricalcalculations AT olivertunke theqcmldatasetquantumchemistryreferencedatafrom335mdftand147bsemiempiricalcalculations AT danielahlin theqcmldatasetquantumchemistryreferencedatafrom335mdftand147bsemiempiricalcalculations AT hartmutmaennel theqcmldatasetquantumchemistryreferencedatafrom335mdftand147bsemiempiricalcalculations AT sergiikashubin theqcmldatasetquantumchemistryreferencedatafrom335mdftand147bsemiempiricalcalculations AT klausrobertmuller theqcmldatasetquantumchemistryreferencedatafrom335mdftand147bsemiempiricalcalculations AT stefanganscha qcmldatasetquantumchemistryreferencedatafrom335mdftand147bsemiempiricalcalculations AT olivertunke qcmldatasetquantumchemistryreferencedatafrom335mdftand147bsemiempiricalcalculations AT danielahlin qcmldatasetquantumchemistryreferencedatafrom335mdftand147bsemiempiricalcalculations AT hartmutmaennel qcmldatasetquantumchemistryreferencedatafrom335mdftand147bsemiempiricalcalculations AT sergiikashubin qcmldatasetquantumchemistryreferencedatafrom335mdftand147bsemiempiricalcalculations AT klausrobertmuller qcmldatasetquantumchemistryreferencedatafrom335mdftand147bsemiempiricalcalculations |