A data driven approach to classify descriptors based on their efficiency in translating noisy trajectories into physically-relevant information

Reconstructing the physical complexity of many-body dynamical systems can be a hard task. Starting from the trajectories of their constitutive units (raw data), typical approaches require choosing adequate parameters/descriptors to convert them into time-series that are then analyzed to extract huma...

Full description

Saved in:
Bibliographic Details
Main Authors: Simone Martino, Domiziano Doria, Chiara Lionello, Matteo Becchi, Giovanni M Pavan
Format: Article
Language:English
Published: IOP Publishing 2025-01-01
Series:Machine Learning: Science and Technology
Subjects:
Online Access:https://doi.org/10.1088/2632-2153/adfa66
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849222162318098432
author Simone Martino
Domiziano Doria
Chiara Lionello
Matteo Becchi
Giovanni M Pavan
author_facet Simone Martino
Domiziano Doria
Chiara Lionello
Matteo Becchi
Giovanni M Pavan
author_sort Simone Martino
collection DOAJ
description Reconstructing the physical complexity of many-body dynamical systems can be a hard task. Starting from the trajectories of their constitutive units (raw data), typical approaches require choosing adequate parameters/descriptors to convert them into time-series that are then analyzed to extract human-interpretable information. However, identifying the best descriptor is often far from being trivial. Here we report a data-driven approach that allows to compare the efficiency of different types of descriptors in extracting information from noisy trajectories and translating them into physically-relevant information. As a prototypical example of a system with non-trivial internal complexity, we analyze molecular dynamics trajectories of an atomistic model system where ice and water coexist dynamically in correspondence of the solid/liquid transition temperature. We compare different types of general or specific descriptors often used to study aqueous systems, e.g. number of neighbors, molecular velocities, smooth overlap of atomic positions (SOAP), local environments and neighbors shuffling (LENS), orientational tetrahedral order, and distance from the fifth neighbor ( d _5 ). We use Onion clustering (an efficient unsupervised clustering method for timeseries analysis) to assess the maximum amount of information that can be extracted from the noisy trajectories by the various descriptors, which we then rank via a high-dimensional metric. Our results demonstrate how advanced descriptors, such as SOAP and LENS, outperform classical ones thanks to higher signal-to-noise ratios. Nonetheless, even the simplest descriptor can become as efficient (and even more) as advanced ones upon local-denoising of their signal. This is the case of, e.g. d _5 , among the worst performing descriptors, which becomes following to denoising by far the best one in resolving the non-strictly-local dynamical complexity of such an ice/water system. This work highlights the critical role of noise in the process of information extraction and it offers a data-driven approach to identify optimal descriptors for systems with characteristic internal complexity.
format Article
id doaj-art-5cf7c24657ce49128bf1d53c00fd17bd
institution Kabale University
issn 2632-2153
language English
publishDate 2025-01-01
publisher IOP Publishing
record_format Article
series Machine Learning: Science and Technology
spelling doaj-art-5cf7c24657ce49128bf1d53c00fd17bd2025-08-26T07:33:45ZengIOP PublishingMachine Learning: Science and Technology2632-21532025-01-016303503910.1088/2632-2153/adfa66A data driven approach to classify descriptors based on their efficiency in translating noisy trajectories into physically-relevant informationSimone Martino0https://orcid.org/0009-0009-7369-3809Domiziano Doria1https://orcid.org/0000-0002-6176-3576Chiara Lionello2https://orcid.org/0000-0002-7491-8952Matteo Becchi3https://orcid.org/0000-0002-6306-5229Giovanni M Pavan4https://orcid.org/0000-0002-3473-8471Department of Applied Science and Technology, Politecnico di Torino , 10129 Torino, ItalyDepartment of Applied Science and Technology, Politecnico di Torino , 10129 Torino, ItalyDepartment of Applied Science and Technology, Politecnico di Torino , 10129 Torino, ItalyDepartment of Applied Science and Technology, Politecnico di Torino , 10129 Torino, ItalyDepartment of Applied Science and Technology, Politecnico di Torino , 10129 Torino, ItalyReconstructing the physical complexity of many-body dynamical systems can be a hard task. Starting from the trajectories of their constitutive units (raw data), typical approaches require choosing adequate parameters/descriptors to convert them into time-series that are then analyzed to extract human-interpretable information. However, identifying the best descriptor is often far from being trivial. Here we report a data-driven approach that allows to compare the efficiency of different types of descriptors in extracting information from noisy trajectories and translating them into physically-relevant information. As a prototypical example of a system with non-trivial internal complexity, we analyze molecular dynamics trajectories of an atomistic model system where ice and water coexist dynamically in correspondence of the solid/liquid transition temperature. We compare different types of general or specific descriptors often used to study aqueous systems, e.g. number of neighbors, molecular velocities, smooth overlap of atomic positions (SOAP), local environments and neighbors shuffling (LENS), orientational tetrahedral order, and distance from the fifth neighbor ( d _5 ). We use Onion clustering (an efficient unsupervised clustering method for timeseries analysis) to assess the maximum amount of information that can be extracted from the noisy trajectories by the various descriptors, which we then rank via a high-dimensional metric. Our results demonstrate how advanced descriptors, such as SOAP and LENS, outperform classical ones thanks to higher signal-to-noise ratios. Nonetheless, even the simplest descriptor can become as efficient (and even more) as advanced ones upon local-denoising of their signal. This is the case of, e.g. d _5 , among the worst performing descriptors, which becomes following to denoising by far the best one in resolving the non-strictly-local dynamical complexity of such an ice/water system. This work highlights the critical role of noise in the process of information extraction and it offers a data-driven approach to identify optimal descriptors for systems with characteristic internal complexity.https://doi.org/10.1088/2632-2153/adfa66high-dimensional analysisdimensionality reductioninformation extractiontime-series analysisunsupervised clusteringdescriptors
spellingShingle Simone Martino
Domiziano Doria
Chiara Lionello
Matteo Becchi
Giovanni M Pavan
A data driven approach to classify descriptors based on their efficiency in translating noisy trajectories into physically-relevant information
Machine Learning: Science and Technology
high-dimensional analysis
dimensionality reduction
information extraction
time-series analysis
unsupervised clustering
descriptors
title A data driven approach to classify descriptors based on their efficiency in translating noisy trajectories into physically-relevant information
title_full A data driven approach to classify descriptors based on their efficiency in translating noisy trajectories into physically-relevant information
title_fullStr A data driven approach to classify descriptors based on their efficiency in translating noisy trajectories into physically-relevant information
title_full_unstemmed A data driven approach to classify descriptors based on their efficiency in translating noisy trajectories into physically-relevant information
title_short A data driven approach to classify descriptors based on their efficiency in translating noisy trajectories into physically-relevant information
title_sort data driven approach to classify descriptors based on their efficiency in translating noisy trajectories into physically relevant information
topic high-dimensional analysis
dimensionality reduction
information extraction
time-series analysis
unsupervised clustering
descriptors
url https://doi.org/10.1088/2632-2153/adfa66
work_keys_str_mv AT simonemartino adatadrivenapproachtoclassifydescriptorsbasedontheirefficiencyintranslatingnoisytrajectoriesintophysicallyrelevantinformation
AT domizianodoria adatadrivenapproachtoclassifydescriptorsbasedontheirefficiencyintranslatingnoisytrajectoriesintophysicallyrelevantinformation
AT chiaralionello adatadrivenapproachtoclassifydescriptorsbasedontheirefficiencyintranslatingnoisytrajectoriesintophysicallyrelevantinformation
AT matteobecchi adatadrivenapproachtoclassifydescriptorsbasedontheirefficiencyintranslatingnoisytrajectoriesintophysicallyrelevantinformation
AT giovannimpavan adatadrivenapproachtoclassifydescriptorsbasedontheirefficiencyintranslatingnoisytrajectoriesintophysicallyrelevantinformation
AT simonemartino datadrivenapproachtoclassifydescriptorsbasedontheirefficiencyintranslatingnoisytrajectoriesintophysicallyrelevantinformation
AT domizianodoria datadrivenapproachtoclassifydescriptorsbasedontheirefficiencyintranslatingnoisytrajectoriesintophysicallyrelevantinformation
AT chiaralionello datadrivenapproachtoclassifydescriptorsbasedontheirefficiencyintranslatingnoisytrajectoriesintophysicallyrelevantinformation
AT matteobecchi datadrivenapproachtoclassifydescriptorsbasedontheirefficiencyintranslatingnoisytrajectoriesintophysicallyrelevantinformation
AT giovannimpavan datadrivenapproachtoclassifydescriptorsbasedontheirefficiencyintranslatingnoisytrajectoriesintophysicallyrelevantinformation