|国家预印本平台
首页|Data quality or data quantity? Prioritizing data collection under distribution shift with the data usefulness coefficient

Data quality or data quantity? Prioritizing data collection under distribution shift with the data usefulness coefficient

Data quality or data quantity? Prioritizing data collection under distribution shift with the data usefulness coefficient

来源:Arxiv_logoArxiv
英文摘要

Researchers often have access to multiple data sources of varying quality. For example, in psychology, a researcher may decide between running an experiment on an online platform or on a representative sample of the population of interest. Collecting a representative sample will result in higher quality data but is often more expensive. This raises the question of how to optimally prioritize data collection under resource constraints. We study this question in a setting where the distribution shift arises through many independent random changes in the population. We introduce a "data usefulness coefficient" (DUC) and show that it allows us to predict how much the risk of empirical risk minimization would decrease if a specific data set were added to the training data. An advantage of our procedure is that it does not require access to any outcome data $Y$. Instead, we rely on a random shift assumption, which implies that the strength of covariate ($X$) shift is predictive of the shift in $Y \mid X$. We also derive methods for sampling under budget and size constraints. We demonstrate the benefits of data collection based on DUC and our optimal sampling strategy in several numerical experiments.

Ivy Zhang、Dominik Rothenh?usler

计算技术、计算机技术

Ivy Zhang,Dominik Rothenh?usler.Data quality or data quantity? Prioritizing data collection under distribution shift with the data usefulness coefficient[EB/OL].(2025-04-09)[2025-05-21].https://arxiv.org/abs/2504.06570.点此复制

评论