A data driven approach to classify descriptors based on their efficiency in translating noisy trajectories into physically-relevant information
Reconstructing the physical complexity of many-body dynamical systems can be a hard task. Starting from the trajectories of their constitutive units (raw data), typical approaches require choosing adequate parameters/descriptors to convert them into time-series that are then analyzed to extract huma...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IOP Publishing
2025-01-01
|
| Series: | Machine Learning: Science and Technology |
| Subjects: | |
| Online Access: | https://doi.org/10.1088/2632-2153/adfa66 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849222162318098432 |
|---|---|
| author | Simone Martino Domiziano Doria Chiara Lionello Matteo Becchi Giovanni M Pavan |
| author_facet | Simone Martino Domiziano Doria Chiara Lionello Matteo Becchi Giovanni M Pavan |
| author_sort | Simone Martino |
| collection | DOAJ |
| description | Reconstructing the physical complexity of many-body dynamical systems can be a hard task. Starting from the trajectories of their constitutive units (raw data), typical approaches require choosing adequate parameters/descriptors to convert them into time-series that are then analyzed to extract human-interpretable information. However, identifying the best descriptor is often far from being trivial. Here we report a data-driven approach that allows to compare the efficiency of different types of descriptors in extracting information from noisy trajectories and translating them into physically-relevant information. As a prototypical example of a system with non-trivial internal complexity, we analyze molecular dynamics trajectories of an atomistic model system where ice and water coexist dynamically in correspondence of the solid/liquid transition temperature. We compare different types of general or specific descriptors often used to study aqueous systems, e.g. number of neighbors, molecular velocities, smooth overlap of atomic positions (SOAP), local environments and neighbors shuffling (LENS), orientational tetrahedral order, and distance from the fifth neighbor ( d _5 ). We use Onion clustering (an efficient unsupervised clustering method for timeseries analysis) to assess the maximum amount of information that can be extracted from the noisy trajectories by the various descriptors, which we then rank via a high-dimensional metric. Our results demonstrate how advanced descriptors, such as SOAP and LENS, outperform classical ones thanks to higher signal-to-noise ratios. Nonetheless, even the simplest descriptor can become as efficient (and even more) as advanced ones upon local-denoising of their signal. This is the case of, e.g. d _5 , among the worst performing descriptors, which becomes following to denoising by far the best one in resolving the non-strictly-local dynamical complexity of such an ice/water system. This work highlights the critical role of noise in the process of information extraction and it offers a data-driven approach to identify optimal descriptors for systems with characteristic internal complexity. |
| format | Article |
| id | doaj-art-5cf7c24657ce49128bf1d53c00fd17bd |
| institution | Kabale University |
| issn | 2632-2153 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IOP Publishing |
| record_format | Article |
| series | Machine Learning: Science and Technology |
| spelling | doaj-art-5cf7c24657ce49128bf1d53c00fd17bd2025-08-26T07:33:45ZengIOP PublishingMachine Learning: Science and Technology2632-21532025-01-016303503910.1088/2632-2153/adfa66A data driven approach to classify descriptors based on their efficiency in translating noisy trajectories into physically-relevant informationSimone Martino0https://orcid.org/0009-0009-7369-3809Domiziano Doria1https://orcid.org/0000-0002-6176-3576Chiara Lionello2https://orcid.org/0000-0002-7491-8952Matteo Becchi3https://orcid.org/0000-0002-6306-5229Giovanni M Pavan4https://orcid.org/0000-0002-3473-8471Department of Applied Science and Technology, Politecnico di Torino , 10129 Torino, ItalyDepartment of Applied Science and Technology, Politecnico di Torino , 10129 Torino, ItalyDepartment of Applied Science and Technology, Politecnico di Torino , 10129 Torino, ItalyDepartment of Applied Science and Technology, Politecnico di Torino , 10129 Torino, ItalyDepartment of Applied Science and Technology, Politecnico di Torino , 10129 Torino, ItalyReconstructing the physical complexity of many-body dynamical systems can be a hard task. Starting from the trajectories of their constitutive units (raw data), typical approaches require choosing adequate parameters/descriptors to convert them into time-series that are then analyzed to extract human-interpretable information. However, identifying the best descriptor is often far from being trivial. Here we report a data-driven approach that allows to compare the efficiency of different types of descriptors in extracting information from noisy trajectories and translating them into physically-relevant information. As a prototypical example of a system with non-trivial internal complexity, we analyze molecular dynamics trajectories of an atomistic model system where ice and water coexist dynamically in correspondence of the solid/liquid transition temperature. We compare different types of general or specific descriptors often used to study aqueous systems, e.g. number of neighbors, molecular velocities, smooth overlap of atomic positions (SOAP), local environments and neighbors shuffling (LENS), orientational tetrahedral order, and distance from the fifth neighbor ( d _5 ). We use Onion clustering (an efficient unsupervised clustering method for timeseries analysis) to assess the maximum amount of information that can be extracted from the noisy trajectories by the various descriptors, which we then rank via a high-dimensional metric. Our results demonstrate how advanced descriptors, such as SOAP and LENS, outperform classical ones thanks to higher signal-to-noise ratios. Nonetheless, even the simplest descriptor can become as efficient (and even more) as advanced ones upon local-denoising of their signal. This is the case of, e.g. d _5 , among the worst performing descriptors, which becomes following to denoising by far the best one in resolving the non-strictly-local dynamical complexity of such an ice/water system. This work highlights the critical role of noise in the process of information extraction and it offers a data-driven approach to identify optimal descriptors for systems with characteristic internal complexity.https://doi.org/10.1088/2632-2153/adfa66high-dimensional analysisdimensionality reductioninformation extractiontime-series analysisunsupervised clusteringdescriptors |
| spellingShingle | Simone Martino Domiziano Doria Chiara Lionello Matteo Becchi Giovanni M Pavan A data driven approach to classify descriptors based on their efficiency in translating noisy trajectories into physically-relevant information Machine Learning: Science and Technology high-dimensional analysis dimensionality reduction information extraction time-series analysis unsupervised clustering descriptors |
| title | A data driven approach to classify descriptors based on their efficiency in translating noisy trajectories into physically-relevant information |
| title_full | A data driven approach to classify descriptors based on their efficiency in translating noisy trajectories into physically-relevant information |
| title_fullStr | A data driven approach to classify descriptors based on their efficiency in translating noisy trajectories into physically-relevant information |
| title_full_unstemmed | A data driven approach to classify descriptors based on their efficiency in translating noisy trajectories into physically-relevant information |
| title_short | A data driven approach to classify descriptors based on their efficiency in translating noisy trajectories into physically-relevant information |
| title_sort | data driven approach to classify descriptors based on their efficiency in translating noisy trajectories into physically relevant information |
| topic | high-dimensional analysis dimensionality reduction information extraction time-series analysis unsupervised clustering descriptors |
| url | https://doi.org/10.1088/2632-2153/adfa66 |
| work_keys_str_mv | AT simonemartino adatadrivenapproachtoclassifydescriptorsbasedontheirefficiencyintranslatingnoisytrajectoriesintophysicallyrelevantinformation AT domizianodoria adatadrivenapproachtoclassifydescriptorsbasedontheirefficiencyintranslatingnoisytrajectoriesintophysicallyrelevantinformation AT chiaralionello adatadrivenapproachtoclassifydescriptorsbasedontheirefficiencyintranslatingnoisytrajectoriesintophysicallyrelevantinformation AT matteobecchi adatadrivenapproachtoclassifydescriptorsbasedontheirefficiencyintranslatingnoisytrajectoriesintophysicallyrelevantinformation AT giovannimpavan adatadrivenapproachtoclassifydescriptorsbasedontheirefficiencyintranslatingnoisytrajectoriesintophysicallyrelevantinformation AT simonemartino datadrivenapproachtoclassifydescriptorsbasedontheirefficiencyintranslatingnoisytrajectoriesintophysicallyrelevantinformation AT domizianodoria datadrivenapproachtoclassifydescriptorsbasedontheirefficiencyintranslatingnoisytrajectoriesintophysicallyrelevantinformation AT chiaralionello datadrivenapproachtoclassifydescriptorsbasedontheirefficiencyintranslatingnoisytrajectoriesintophysicallyrelevantinformation AT matteobecchi datadrivenapproachtoclassifydescriptorsbasedontheirefficiencyintranslatingnoisytrajectoriesintophysicallyrelevantinformation AT giovannimpavan datadrivenapproachtoclassifydescriptorsbasedontheirefficiencyintranslatingnoisytrajectoriesintophysicallyrelevantinformation |