Data splitting to avoid information leakage with DataSAIL

Abstract Information leakage is an increasingly important topic in machine learning research for biomedical applications. When information leakage happens during a model’s training, it risks memorizing the training data instead of learning generalizable properties. This can lead to inflated performa...

Full description

Saved in:

Bibliographic Details
Main Authors:	Roman Joeres, David B. Blumenthal, Olga V. Kalinina
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-04-01
Series:	Nature Communications
Online Access:	https://doi.org/10.1038/s41467-025-58606-8
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Abstract Information leakage is an increasingly important topic in machine learning research for biomedical applications. When information leakage happens during a model’s training, it risks memorizing the training data instead of learning generalizable properties. This can lead to inflated performance metrics that do not reflect the actual performance at inference time. We present DataSAIL, a versatile Python package to facilitate leakage-reduced data splitting to enable realistic evaluation of machine learning models for biological data that are intended to be applied in out-of-distribution scenarios. DataSAIL is based on formulating the problem to find leakage-reduced data splits as a combinatorial optimization problem. We prove that this problem is NP-hard and provide a scalable heuristic based on clustering and integer linear programming. Finally, we empirically demonstrate DataSAIL’s impact on evaluating biomedical machine learning models.
ISSN:	2041-1723

Data splitting to avoid information leakage with DataSAIL

Similar Items