The topology of molecular representations and its influence on machine learning performance
Abstract Advancements in cheminformatics have led to numerous methods for encoding molecules numerically. The choice of molecular representation impacts the accuracy and generalizability of learning algorithms applied to chemical datasets. Designing and selecting the appropriate representation often...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
BMC
2025-07-01
|
| Series: | Journal of Cheminformatics |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s13321-025-01045-w |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849761343887900672 |
|---|---|
| author | Florian Rottach Sebastian Schieferdecker Carsten Eickhoff |
| author_facet | Florian Rottach Sebastian Schieferdecker Carsten Eickhoff |
| author_sort | Florian Rottach |
| collection | DOAJ |
| description | Abstract Advancements in cheminformatics have led to numerous methods for encoding molecules numerically. The choice of molecular representation impacts the accuracy and generalizability of learning algorithms applied to chemical datasets. Designing and selecting the appropriate representation often lacks a systematic approach and follows computationally exhaustive empirical testing. Moreover, research has shown that deep learning models do not substantially outperform traditional approaches across many tasks with no clear explanation for this shortfall. In this work, we present TopoLearn, a model that predicts the effectiveness of representations on datasets based on the topological characteristics of the corresponding feature space. Using interpretability techniques, we find that persistent homology descriptors are linked with the error metrics of trained machine learning models, offering a new method to better understand and select molecular representations. Scientific contribution Our research is the first to establish an empirical connection between the topology of feature spaces and the machine learning performance of molecular representations. In addition, we facilitate future research endeavors by providing open access to our developed model. |
| format | Article |
| id | doaj-art-a988feb44e12476c98b46f2ba0c2fad3 |
| institution | DOAJ |
| issn | 1758-2946 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | BMC |
| record_format | Article |
| series | Journal of Cheminformatics |
| spelling | doaj-art-a988feb44e12476c98b46f2ba0c2fad32025-08-20T03:06:04ZengBMCJournal of Cheminformatics1758-29462025-07-0117112510.1186/s13321-025-01045-wThe topology of molecular representations and its influence on machine learning performanceFlorian Rottach0Sebastian Schieferdecker1Carsten Eickhoff2Central Data Science, Boehringer Ingelheim GmbHComputational Toxicology, Boehringer Ingelheim Pharma GmbH & Co. KGSchool of Medicine, University of TübingenAbstract Advancements in cheminformatics have led to numerous methods for encoding molecules numerically. The choice of molecular representation impacts the accuracy and generalizability of learning algorithms applied to chemical datasets. Designing and selecting the appropriate representation often lacks a systematic approach and follows computationally exhaustive empirical testing. Moreover, research has shown that deep learning models do not substantially outperform traditional approaches across many tasks with no clear explanation for this shortfall. In this work, we present TopoLearn, a model that predicts the effectiveness of representations on datasets based on the topological characteristics of the corresponding feature space. Using interpretability techniques, we find that persistent homology descriptors are linked with the error metrics of trained machine learning models, offering a new method to better understand and select molecular representations. Scientific contribution Our research is the first to establish an empirical connection between the topology of feature spaces and the machine learning performance of molecular representations. In addition, we facilitate future research endeavors by providing open access to our developed model.https://doi.org/10.1186/s13321-025-01045-wMolecular property predictionQSARPersistent homologyTopological data analysisGeneralizability |
| spellingShingle | Florian Rottach Sebastian Schieferdecker Carsten Eickhoff The topology of molecular representations and its influence on machine learning performance Journal of Cheminformatics Molecular property prediction QSAR Persistent homology Topological data analysis Generalizability |
| title | The topology of molecular representations and its influence on machine learning performance |
| title_full | The topology of molecular representations and its influence on machine learning performance |
| title_fullStr | The topology of molecular representations and its influence on machine learning performance |
| title_full_unstemmed | The topology of molecular representations and its influence on machine learning performance |
| title_short | The topology of molecular representations and its influence on machine learning performance |
| title_sort | topology of molecular representations and its influence on machine learning performance |
| topic | Molecular property prediction QSAR Persistent homology Topological data analysis Generalizability |
| url | https://doi.org/10.1186/s13321-025-01045-w |
| work_keys_str_mv | AT florianrottach thetopologyofmolecularrepresentationsanditsinfluenceonmachinelearningperformance AT sebastianschieferdecker thetopologyofmolecularrepresentationsanditsinfluenceonmachinelearningperformance AT carsteneickhoff thetopologyofmolecularrepresentationsanditsinfluenceonmachinelearningperformance AT florianrottach topologyofmolecularrepresentationsanditsinfluenceonmachinelearningperformance AT sebastianschieferdecker topologyofmolecularrepresentationsanditsinfluenceonmachinelearningperformance AT carsteneickhoff topologyofmolecularrepresentationsanditsinfluenceonmachinelearningperformance |