IR-NMR multimodal computational spectra dataset for 177K patent-extracted organic molecules

Abstract The construction of predictive models in molecular science increasingly relies on large, high-quality datasets. Synthetic data generation is becoming a foundational strategy for advancing model accuracy and enabling fast discovery workflows. To support the development of structure elucidati...

Full description

Saved in:
Bibliographic Details
Main Authors: Federico Zipoli, Marvin Alberts, Teodoro Laino
Format: Article
Language:English
Published: Nature Portfolio 2025-08-01
Series:Scientific Data
Online Access:https://doi.org/10.1038/s41597-025-05729-8
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract The construction of predictive models in molecular science increasingly relies on large, high-quality datasets. Synthetic data generation is becoming a foundational strategy for advancing model accuracy and enabling fast discovery workflows. To support the development of structure elucidation and spectral property prediction models, we present a comprehensive synthetic dataset of infrared (IR) and nuclear magnetic resonance (NMR) spectra for a diverse ensemble of organic molecules. The data were generated using a hybrid computational approach that integrates molecular dynamics (MD) simulations, density functional theory (DFT) calculations, and machine learning (ML) models. The dataset primarily consists of IR spectra for 177,461 molecules, derived from long-timescale MD simulations with ML-accelerated dipole moment predictions. In addition, it includes a smaller subset of 1H-NMR and 13C-NMR chemical shifts for 1,255 molecules. This unique combination of spectral data offers a valuable resource for benchmarking and validating computational methodologies, developing and enhancing artificial intelligence (AI) models for molecular property prediction, and facilitating the interpretation of experimental spectroscopic results. The dataset is publicly available through Zenodo, encouraging its broad utilization within the scientific community.
ISSN:2052-4463