GneissWeb: Preparing High Quality Data for LLMs at Scale
GneissWeb: Preparing High Quality Data for LLMs at Scale
Data quantity and quality play a vital role in determining the performance of Large Language Models (LLMs). High-quality data, in particular, can significantly boost the LLM's ability to generalize on a wide range of downstream tasks. Large pre-training datasets for leading LLMs remain inaccessible to the public, whereas many open datasets are small in size (less than 5 trillion tokens), limiting their suitability for training large models. In this paper, we introduce GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training LLMs. Our GneissWeb recipe that produced the dataset consists of sharded exact sub-string deduplication and a judiciously constructed ensemble of quality filters. GneissWeb achieves a favorable trade-off between data quality and quantity, producing models that outperform models trained on state-of-the-art open large datasets (5+ trillion tokens). We show that models trained using GneissWeb dataset outperform those trained on FineWeb-V1.1.0 by 2.73 percentage points in terms of average score computed on a set of 11 commonly used benchmarks (both zero-shot and few-shot) for pre-training dataset evaluation. When the evaluation set is extended to 20 benchmarks (both zero-shot and few-shot), models trained using GneissWeb still achieve a 1.75 percentage points advantage over those trained on FineWeb-V1.1.0.
Nathalie Baracaldo Angel、Takuyo Ohko、Herbert Woisetschlager、Kun-Lung Wu、Abdulhamid Adebayo、Yan Koyfman、Praneet Adusumilli、Santosh Borse、Shalisha Witherspoon、Ran Iwamoto、Pablo Pesce、Bishwaranjan Bhattacharjee、Petros Zerfos、Farhan Ahmed、Yuan-Chi Chang、Syed Zawad、Xuan-Hong Dang、Maroun Touma、Issei Yoshida、Shiqiang Wang、Syed Yousaf Shah. Constantin Adam、Swanand Ravindra Kadhe、Wei-Han Lee、Yi Zhou、David Wood、Changchang Liu、Ravital Eres、Hajar Emami Gohari、Nirmit Desai、Alexei Karve、Boris Lublinsky
计算技术、计算机技术
Nathalie Baracaldo Angel,Takuyo Ohko,Herbert Woisetschlager,Kun-Lung Wu,Abdulhamid Adebayo,Yan Koyfman,Praneet Adusumilli,Santosh Borse,Shalisha Witherspoon,Ran Iwamoto,Pablo Pesce,Bishwaranjan Bhattacharjee,Petros Zerfos,Farhan Ahmed,Yuan-Chi Chang,Syed Zawad,Xuan-Hong Dang,Maroun Touma,Issei Yoshida,Shiqiang Wang,Syed Yousaf Shah. Constantin Adam,Swanand Ravindra Kadhe,Wei-Han Lee,Yi Zhou,David Wood,Changchang Liu,Ravital Eres,Hajar Emami Gohari,Nirmit Desai,Alexei Karve,Boris Lublinsky.GneissWeb: Preparing High Quality Data for LLMs at Scale[EB/OL].(2025-02-18)[2025-05-19].https://arxiv.org/abs/2502.14907.点此复制
评论