IR-NMR multimodal computational spectra dataset for 177K patent-extracted organic molecules

Abstract The construction of predictive models in molecular science increasingly relies on large, high-quality datasets. Synthetic data generation is becoming a foundational strategy for advancing model accuracy and enabling fast discovery workflows. To support the development of structure elucidati...

Full description

Saved in:
Bibliographic Details
Main Authors: Federico Zipoli, Marvin Alberts, Teodoro Laino
Format: Article
Language:English
Published: Nature Portfolio 2025-08-01
Series:Scientific Data
Online Access:https://doi.org/10.1038/s41597-025-05729-8
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849344022458400768
author Federico Zipoli
Marvin Alberts
Teodoro Laino
author_facet Federico Zipoli
Marvin Alberts
Teodoro Laino
author_sort Federico Zipoli
collection DOAJ
description Abstract The construction of predictive models in molecular science increasingly relies on large, high-quality datasets. Synthetic data generation is becoming a foundational strategy for advancing model accuracy and enabling fast discovery workflows. To support the development of structure elucidation and spectral property prediction models, we present a comprehensive synthetic dataset of infrared (IR) and nuclear magnetic resonance (NMR) spectra for a diverse ensemble of organic molecules. The data were generated using a hybrid computational approach that integrates molecular dynamics (MD) simulations, density functional theory (DFT) calculations, and machine learning (ML) models. The dataset primarily consists of IR spectra for 177,461 molecules, derived from long-timescale MD simulations with ML-accelerated dipole moment predictions. In addition, it includes a smaller subset of 1H-NMR and 13C-NMR chemical shifts for 1,255 molecules. This unique combination of spectral data offers a valuable resource for benchmarking and validating computational methodologies, developing and enhancing artificial intelligence (AI) models for molecular property prediction, and facilitating the interpretation of experimental spectroscopic results. The dataset is publicly available through Zenodo, encouraging its broad utilization within the scientific community.
format Article
id doaj-art-c66dc7f5d6974acaab8ef68f8891e424
institution Kabale University
issn 2052-4463
language English
publishDate 2025-08-01
publisher Nature Portfolio
record_format Article
series Scientific Data
spelling doaj-art-c66dc7f5d6974acaab8ef68f8891e4242025-08-20T03:42:47ZengNature PortfolioScientific Data2052-44632025-08-0112111410.1038/s41597-025-05729-8IR-NMR multimodal computational spectra dataset for 177K patent-extracted organic moleculesFederico Zipoli0Marvin Alberts1Teodoro Laino2IBM Research EuropeIBM Research EuropeIBM Research EuropeAbstract The construction of predictive models in molecular science increasingly relies on large, high-quality datasets. Synthetic data generation is becoming a foundational strategy for advancing model accuracy and enabling fast discovery workflows. To support the development of structure elucidation and spectral property prediction models, we present a comprehensive synthetic dataset of infrared (IR) and nuclear magnetic resonance (NMR) spectra for a diverse ensemble of organic molecules. The data were generated using a hybrid computational approach that integrates molecular dynamics (MD) simulations, density functional theory (DFT) calculations, and machine learning (ML) models. The dataset primarily consists of IR spectra for 177,461 molecules, derived from long-timescale MD simulations with ML-accelerated dipole moment predictions. In addition, it includes a smaller subset of 1H-NMR and 13C-NMR chemical shifts for 1,255 molecules. This unique combination of spectral data offers a valuable resource for benchmarking and validating computational methodologies, developing and enhancing artificial intelligence (AI) models for molecular property prediction, and facilitating the interpretation of experimental spectroscopic results. The dataset is publicly available through Zenodo, encouraging its broad utilization within the scientific community.https://doi.org/10.1038/s41597-025-05729-8
spellingShingle Federico Zipoli
Marvin Alberts
Teodoro Laino
IR-NMR multimodal computational spectra dataset for 177K patent-extracted organic molecules
Scientific Data
title IR-NMR multimodal computational spectra dataset for 177K patent-extracted organic molecules
title_full IR-NMR multimodal computational spectra dataset for 177K patent-extracted organic molecules
title_fullStr IR-NMR multimodal computational spectra dataset for 177K patent-extracted organic molecules
title_full_unstemmed IR-NMR multimodal computational spectra dataset for 177K patent-extracted organic molecules
title_short IR-NMR multimodal computational spectra dataset for 177K patent-extracted organic molecules
title_sort ir nmr multimodal computational spectra dataset for 177k patent extracted organic molecules
url https://doi.org/10.1038/s41597-025-05729-8
work_keys_str_mv AT federicozipoli irnmrmultimodalcomputationalspectradatasetfor177kpatentextractedorganicmolecules
AT marvinalberts irnmrmultimodalcomputationalspectradatasetfor177kpatentextractedorganicmolecules
AT teodorolaino irnmrmultimodalcomputationalspectradatasetfor177kpatentextractedorganicmolecules