首页|The Art of Breaking Words: Rethinking Multilingual Tokenizer Design

The Art of Breaking Words: Rethinking Multilingual Tokenizer Design

来源：

英文摘要

While model architecture and training objectives are well-studied, tokenization, particularly in multilingual contexts, remains a relatively neglected aspect of Large Language Model (LLM) development. Existing tokenizers often exhibit high token-to-word ratios, inefficient use of context length, and slower inference. We present a systematic study that links vocabulary size, pre-tokenization rules, and training-corpus composition to both token-to-word efficiency and model quality. To ground our analysis in a linguistically diverse context, we conduct extensive experiments on Indic scripts, which present unique challenges due to their high script diversity and orthographic complexity. Drawing on the insights from these analyses, we propose a novel algorithm for data composition that balances multilingual data for tokenizer training. Our observations on pretokenization strategies significantly improve model performance, and our data composition algorithm reduces the average token-to-word ratio by approximately 6% with respect to the conventional data randomization approach. Our tokenizer achieves more than 40% improvement on average token-to-word ratio against stateof-the-art multilingual Indic models. This improvement yields measurable gains in both model performance and inference speed. This highlights tokenization alongside architecture and training objectives as a critical lever for building efficient, scalable multilingual LLMs

作者：Aamod Thakur、Ajay Nagpal、Atharva Savarkar、Kundeshwar Pundalik、Siddhesh Dosi、Viraj Thakur、Piyush Sawarkar、Rohit Saluja、Maunendra Sankar Desarkar、Ganesh Ramakrishnan

作者单位：

学科分类：语言学南亚语系（澳斯特罗-亚细亚语系）

推荐引用：Aamod Thakur,Ajay Nagpal,Atharva Savarkar,Kundeshwar Pundalik,Siddhesh Dosi,Viraj Thakur,Piyush Sawarkar,Rohit Saluja,Maunendra Sankar Desarkar,Ganesh Ramakrishnan.The Art of Breaking Words: Rethinking Multilingual Tokenizer Design[EB/OL].(2025-08-03)[2025-08-24].https://arxiv.org/abs/2508.06533.点此复制

The Art of Breaking Words: Rethinking Multilingual Tokenizer Design

The Art of Breaking Words: Rethinking Multilingual Tokenizer Design

评论