DataSAIL: Data Splitting Against Information Leakage
DataSAIL: Data Splitting Against Information Leakage
Information Leakage is an increasing problem in machine learning research. It is common practice to report models with benchmarks comparing them to the state-of-the-art performance on the test splits of datasets. If two or more dataset splits contain identical or highly similar samples, a model risks simply memorizing them, and hence the true performance is overestimated. Depending on the application of the model, the challenge is to find splits that minimize the similarity between data points in any two splits. Frequently, after reducing the similarity between training and test sets, one sees a considerable drop in performance, which is a signal of removed Information Leakage. Recent work has shown that Information Leakage is an emerging problem in model performance assessment. This work presents DataSAIL, a tool for splitting biological datasets while minimizing Information Leakage in different settings. This is done by splitting the dataset such that the total similarity of any two samples in different splits is minimized. To this end, we formulate data splitting as a Binary Linear Program (BLP) following the rules of Disciplines Quasi-Convex Programming (DQCP) and optimize a solution. DataSAIL can split one-dimensional data, e.g., for property prediction, and two-dimensional data, e.g., data organized as a matrix of binding affinities between two sets of molecules, accounting for similarities along each dimension and missing values. We compute splits of the MoleculeNet benchmarks using DeepChem, the LoHi splitter, GraphPart, and DataSAIL to compare their computational speed and quality. We show that DataSAIL can impose more complex learning tasks on machine learning models and allows for a better assessment of how well the model generalizes beyond the data presented during training.
Blumenthal David B、Joeres Roman、Kalinina Olga V
生物科学研究方法、生物科学研究技术计算技术、计算机技术生物科学理论、生物科学方法
Blumenthal David B,Joeres Roman,Kalinina Olga V.DataSAIL: Data Splitting Against Information Leakage[EB/OL].(2025-03-28)[2025-04-24].https://www.biorxiv.org/content/10.1101/2023.11.15.566305.点此复制
评论