CrysMTM: a multiphase, temperature-resolved, multimodal dataset for crystalline materials

We present CrysMTM, a large-scale, multimodal dataset designed to benchmark temperature- and phase-sensitive machine learning models for crystalline materials. The dataset comprises approximately 30 000 atomistic samples covering the three primary polymorphs of titanium dioxide–anatase, brookite, an...

Full description

Saved in:
Bibliographic Details
Main Authors: Can Polat, Erchin Serpedin, Mustafa Kurban, Hasan Kurban
Format: Article
Language:English
Published: IOP Publishing 2025-01-01
Series:Machine Learning: Science and Technology
Subjects:
Online Access:https://doi.org/10.1088/2632-2153/adf9bc
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849230105761546240
author Can Polat
Erchin Serpedin
Mustafa Kurban
Hasan Kurban
author_facet Can Polat
Erchin Serpedin
Mustafa Kurban
Hasan Kurban
author_sort Can Polat
collection DOAJ
description We present CrysMTM, a large-scale, multimodal dataset designed to benchmark temperature- and phase-sensitive machine learning models for crystalline materials. The dataset comprises approximately 30 000 atomistic samples covering the three primary polymorphs of titanium dioxide–anatase, brookite, and rutile–each evaluated across a temperature spectrum ranging from cryogenic to ambient and elevated conditions. Each data entry integrates three complementary modalities: (1) three-dimensional atomic coordinates, (2) RGBA molecular visualizations, and (3) structured textual metadata encompassing geometric descriptors, local bonding environments, and phase transformation parameters. This multimodal structure enables both supervised and self-supervised learning across graph-based, image-based, and language-based architectures. CrysMTM supports rigorous evaluation of model robustness under thermal perturbations and crystallographic phase transitions. Baseline benchmarking across 18 models–including graph neural networks (GNNs), convolutional neural networks, and foundation models–reveals significant property-specific challenges. For example, bandgap predictions exhibit errors exceeding 25%, while volumetric expansion and atomic displacement estimations frequently deviate by more than 100%. Even state-of-the-art GNNs, which achieve an average in-distribution (ID) mean absolute percentage error of approximately 20%, show a threefold increase under out-of-distribution (OOD) thermal conditions. In contrast, a few-shot multimodal large language model reduces global prediction error from 96% to 23% and narrows the performance gap between ID and OOD cases to just four percentage points. These results highlight both the selective difficulty posed by temperature-sensitive geometric targets and the considerable room for innovation in model design. All dataset files, model implementations, and pretrained checkpoints are available at https://github.com/KurbanIntelligenceLab/CrysMTM .
format Article
id doaj-art-aa6e21bf9cd84786a8d3ab85f6edf2c7
institution Kabale University
issn 2632-2153
language English
publishDate 2025-01-01
publisher IOP Publishing
record_format Article
series Machine Learning: Science and Technology
spelling doaj-art-aa6e21bf9cd84786a8d3ab85f6edf2c72025-08-21T12:42:28ZengIOP PublishingMachine Learning: Science and Technology2632-21532025-01-016303060310.1088/2632-2153/adf9bcCrysMTM: a multiphase, temperature-resolved, multimodal dataset for crystalline materialsCan Polat0https://orcid.org/0000-0002-1458-302XErchin Serpedin1Mustafa Kurban2https://orcid.org/0000-0002-7263-0234Hasan Kurban3https://orcid.org/0000-0003-3142-2866Electrical & Computer Engineering, Texas A&M University , College Station, TX, United States of AmericaElectrical & Computer Engineering, Texas A&M University , College Station, TX, United States of AmericaDepartment of Prosthetics and Orthotics, Ankara University , Ankara, Turkey; Electrical & Computer Engineering, Texas A&M University at Qatar , Doha, QatarCollege of Science and Engineering, Hamad Bin Khalifa University , Doha, QatarWe present CrysMTM, a large-scale, multimodal dataset designed to benchmark temperature- and phase-sensitive machine learning models for crystalline materials. The dataset comprises approximately 30 000 atomistic samples covering the three primary polymorphs of titanium dioxide–anatase, brookite, and rutile–each evaluated across a temperature spectrum ranging from cryogenic to ambient and elevated conditions. Each data entry integrates three complementary modalities: (1) three-dimensional atomic coordinates, (2) RGBA molecular visualizations, and (3) structured textual metadata encompassing geometric descriptors, local bonding environments, and phase transformation parameters. This multimodal structure enables both supervised and self-supervised learning across graph-based, image-based, and language-based architectures. CrysMTM supports rigorous evaluation of model robustness under thermal perturbations and crystallographic phase transitions. Baseline benchmarking across 18 models–including graph neural networks (GNNs), convolutional neural networks, and foundation models–reveals significant property-specific challenges. For example, bandgap predictions exhibit errors exceeding 25%, while volumetric expansion and atomic displacement estimations frequently deviate by more than 100%. Even state-of-the-art GNNs, which achieve an average in-distribution (ID) mean absolute percentage error of approximately 20%, show a threefold increase under out-of-distribution (OOD) thermal conditions. In contrast, a few-shot multimodal large language model reduces global prediction error from 96% to 23% and narrows the performance gap between ID and OOD cases to just four percentage points. These results highlight both the selective difficulty posed by temperature-sensitive geometric targets and the considerable room for innovation in model design. All dataset files, model implementations, and pretrained checkpoints are available at https://github.com/KurbanIntelligenceLab/CrysMTM .https://doi.org/10.1088/2632-2153/adf9bcLLMDFTBtemperature dependencebenchmarkdatasetexplainability
spellingShingle Can Polat
Erchin Serpedin
Mustafa Kurban
Hasan Kurban
CrysMTM: a multiphase, temperature-resolved, multimodal dataset for crystalline materials
Machine Learning: Science and Technology
LLM
DFTB
temperature dependence
benchmark
dataset
explainability
title CrysMTM: a multiphase, temperature-resolved, multimodal dataset for crystalline materials
title_full CrysMTM: a multiphase, temperature-resolved, multimodal dataset for crystalline materials
title_fullStr CrysMTM: a multiphase, temperature-resolved, multimodal dataset for crystalline materials
title_full_unstemmed CrysMTM: a multiphase, temperature-resolved, multimodal dataset for crystalline materials
title_short CrysMTM: a multiphase, temperature-resolved, multimodal dataset for crystalline materials
title_sort crysmtm a multiphase temperature resolved multimodal dataset for crystalline materials
topic LLM
DFTB
temperature dependence
benchmark
dataset
explainability
url https://doi.org/10.1088/2632-2153/adf9bc
work_keys_str_mv AT canpolat crysmtmamultiphasetemperatureresolvedmultimodaldatasetforcrystallinematerials
AT erchinserpedin crysmtmamultiphasetemperatureresolvedmultimodaldatasetforcrystallinematerials
AT mustafakurban crysmtmamultiphasetemperatureresolvedmultimodaldatasetforcrystallinematerials
AT hasankurban crysmtmamultiphasetemperatureresolvedmultimodaldatasetforcrystallinematerials