|国家预印本平台
首页|Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation

Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation

Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation

来源:Arxiv_logoArxiv
英文摘要

Scaling data quantity is essential for large language models (LLMs), yet recent findings show that data quality can significantly boost performance and training efficiency. We introduce a German-language dataset curation pipeline that combines heuristic and model-based filtering techniques with synthetic data generation. We use our pipeline to create Aleph-Alpha-GermanWeb, a large-scale German pre-training dataset which draws from: (1) Common Crawl web data, (2) FineWeb2, and (3) synthetically-generated data conditioned on actual, organic web data. We evaluate our dataset by pre-training both a 1B Llama-style model and an 8B tokenizer-free hierarchical autoregressive transformer (HAT). A comparison on German-language benchmarks, including MMMLU, shows significant performance gains of Aleph-Alpha-GermanWeb over FineWeb2 alone. This advantage holds at the 8B scale even when FineWeb2 is enriched by human-curated high-quality data sources such as Wikipedia. Our findings support the growing body of evidence that model-based data curation and synthetic data generation can significantly enhance LLM pre-training datasets.

Thomas F Burns、Letitia Parcalabescu、Stephan W?ldchen、Michael Barlow、Gregor Ziegltrum、Volker Stampa、Bastian Harren、Bj?rn Deiseroth

常用外国语计算技术、计算机技术

Thomas F Burns,Letitia Parcalabescu,Stephan W?ldchen,Michael Barlow,Gregor Ziegltrum,Volker Stampa,Bastian Harren,Bj?rn Deiseroth.Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation[EB/OL].(2025-04-24)[2025-06-28].https://arxiv.org/abs/2505.00022.点此复制

评论