首页|Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information

Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information

来源：

英文摘要

In order to increase the effectiveness of model training, data reduction is essential to data-centric AI. It does this by locating the most instructive examples in massive datasets. To increase data quality and training efficiency, the main difficulty is to choose the best examples rather than the complete datasets. In this paper, we propose an effective data reduction strategy based on Pointwise -Information (PVI). To enable a static method, we first use PVI to quantify instance difficulty and remove instances with low difficulty. Experiments show that the classifier performance is maintained with only a 0.0001% to 0.76% reduction in accuracy when 10%-30% of the data is removed. Second, we train the classifiers using a progressive learning strategy on examples sorted by increasing PVI, accelerating convergence and achieving a 0.8% accuracy gain over conventional training. Our findings imply that training a classifier on the chosen optimal subset may improve model performance and increase training efficiency when combined with an efficient data reduction strategy. Furthermore, we have adapted the PVI framework, which was previously limited to English datasets, to a variety of Chinese NLP tasks and base models, yielding insightful results for faster training and cross-lingual data reduction. The codes are released at https://github.com/zhouwenchi/DatasetReductionStrategy.

作者：Fei Chen、Wenchi Zhou

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Fei Chen,Wenchi Zhou.Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information[EB/OL].(2025-07-14)[2025-07-16].https://arxiv.org/abs/2507.00038.点此复制

Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information

Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information

评论