The topology of molecular representations and its influence on machine learning performance

Abstract Advancements in cheminformatics have led to numerous methods for encoding molecules numerically. The choice of molecular representation impacts the accuracy and generalizability of learning algorithms applied to chemical datasets. Designing and selecting the appropriate representation often...

Full description

Saved in:
Bibliographic Details
Main Authors: Florian Rottach, Sebastian Schieferdecker, Carsten Eickhoff
Format: Article
Language:English
Published: BMC 2025-07-01
Series:Journal of Cheminformatics
Subjects:
Online Access:https://doi.org/10.1186/s13321-025-01045-w
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849761343887900672
author Florian Rottach
Sebastian Schieferdecker
Carsten Eickhoff
author_facet Florian Rottach
Sebastian Schieferdecker
Carsten Eickhoff
author_sort Florian Rottach
collection DOAJ
description Abstract Advancements in cheminformatics have led to numerous methods for encoding molecules numerically. The choice of molecular representation impacts the accuracy and generalizability of learning algorithms applied to chemical datasets. Designing and selecting the appropriate representation often lacks a systematic approach and follows computationally exhaustive empirical testing. Moreover, research has shown that deep learning models do not substantially outperform traditional approaches across many tasks with no clear explanation for this shortfall. In this work, we present TopoLearn, a model that predicts the effectiveness of representations on datasets based on the topological characteristics of the corresponding feature space. Using interpretability techniques, we find that persistent homology descriptors are linked with the error metrics of trained machine learning models, offering a new method to better understand and select molecular representations. Scientific contribution Our research is the first to establish an empirical connection between the topology of feature spaces and the machine learning performance of molecular representations. In addition, we facilitate future research endeavors by providing open access to our developed model.
format Article
id doaj-art-a988feb44e12476c98b46f2ba0c2fad3
institution DOAJ
issn 1758-2946
language English
publishDate 2025-07-01
publisher BMC
record_format Article
series Journal of Cheminformatics
spelling doaj-art-a988feb44e12476c98b46f2ba0c2fad32025-08-20T03:06:04ZengBMCJournal of Cheminformatics1758-29462025-07-0117112510.1186/s13321-025-01045-wThe topology of molecular representations and its influence on machine learning performanceFlorian Rottach0Sebastian Schieferdecker1Carsten Eickhoff2Central Data Science, Boehringer Ingelheim GmbHComputational Toxicology, Boehringer Ingelheim Pharma GmbH & Co. KGSchool of Medicine, University of TübingenAbstract Advancements in cheminformatics have led to numerous methods for encoding molecules numerically. The choice of molecular representation impacts the accuracy and generalizability of learning algorithms applied to chemical datasets. Designing and selecting the appropriate representation often lacks a systematic approach and follows computationally exhaustive empirical testing. Moreover, research has shown that deep learning models do not substantially outperform traditional approaches across many tasks with no clear explanation for this shortfall. In this work, we present TopoLearn, a model that predicts the effectiveness of representations on datasets based on the topological characteristics of the corresponding feature space. Using interpretability techniques, we find that persistent homology descriptors are linked with the error metrics of trained machine learning models, offering a new method to better understand and select molecular representations. Scientific contribution Our research is the first to establish an empirical connection between the topology of feature spaces and the machine learning performance of molecular representations. In addition, we facilitate future research endeavors by providing open access to our developed model.https://doi.org/10.1186/s13321-025-01045-wMolecular property predictionQSARPersistent homologyTopological data analysisGeneralizability
spellingShingle Florian Rottach
Sebastian Schieferdecker
Carsten Eickhoff
The topology of molecular representations and its influence on machine learning performance
Journal of Cheminformatics
Molecular property prediction
QSAR
Persistent homology
Topological data analysis
Generalizability
title The topology of molecular representations and its influence on machine learning performance
title_full The topology of molecular representations and its influence on machine learning performance
title_fullStr The topology of molecular representations and its influence on machine learning performance
title_full_unstemmed The topology of molecular representations and its influence on machine learning performance
title_short The topology of molecular representations and its influence on machine learning performance
title_sort topology of molecular representations and its influence on machine learning performance
topic Molecular property prediction
QSAR
Persistent homology
Topological data analysis
Generalizability
url https://doi.org/10.1186/s13321-025-01045-w
work_keys_str_mv AT florianrottach thetopologyofmolecularrepresentationsanditsinfluenceonmachinelearningperformance
AT sebastianschieferdecker thetopologyofmolecularrepresentationsanditsinfluenceonmachinelearningperformance
AT carsteneickhoff thetopologyofmolecularrepresentationsanditsinfluenceonmachinelearningperformance
AT florianrottach topologyofmolecularrepresentationsanditsinfluenceonmachinelearningperformance
AT sebastianschieferdecker topologyofmolecularrepresentationsanditsinfluenceonmachinelearningperformance
AT carsteneickhoff topologyofmolecularrepresentationsanditsinfluenceonmachinelearningperformance