|国家预印本平台
首页|SourceSplice: Source Selection for Machine Learning Tasks

SourceSplice: Source Selection for Machine Learning Tasks

SourceSplice: Source Selection for Machine Learning Tasks

来源:Arxiv_logoArxiv
英文摘要

Data quality plays a pivotal role in the predictive performance of machine learning (ML) tasks - a challenge amplified by the deluge of data sources available in modern organizations. Prior work in data discovery largely focus on metadata matching, semantic similarity or identifying tables that should be joined to answer a particular query, but do not consider source quality for high performance of the downstream ML task. This paper addresses the problem of determining the best subset of data sources that must be combined to construct the underlying training dataset for a given ML task. We propose SourceGrasp and SourceSplice, frameworks designed to efficiently select a suitable subset of sources that maximizes the utility of the downstream ML model. Both the algorithms rely on the core idea that sources (or their combinations) contribute differently to the task utility, and must be judiciously chosen. While SourceGrasp utilizes a metaheuristic based on a greediness criterion and randomization, the SourceSplice framework presents a source selection mechanism inspired from gene splicing - a core concept used in protein synthesis. We empirically evaluate our algorithms on three real-world datasets and synthetic datasets and show that, with significantly fewer subset explorations, SourceSplice effectively identifies subsets of data sources leading to high task utility. We also conduct studies reporting the sensitivity of SourceSplice to the decision choices under several settings.

Ambarish Singh、Romila Pradhan

计算技术、计算机技术

Ambarish Singh,Romila Pradhan.SourceSplice: Source Selection for Machine Learning Tasks[EB/OL].(2025-07-31)[2025-08-06].https://arxiv.org/abs/2507.22186.点此复制

评论