Causal Estimation of Tokenisation Bias
Causal Estimation of Tokenisation Bias
Modern language models are typically trained over subword sequences, but ultimately define probabilities over character-strings. Ideally, the choice of the tokeniser -- which maps character-strings to subwords -- should not affect the probability assigned to the underlying character-string; in practice, it does. We define this mismatch as tokenisation bias. In this work, we quantify one particular type of tokenisation bias: the effect of including or not a subword (e.g., $\langle hello \rangle$) in a tokeniser's vocabulary on the probability a trained model assigns to the corresponding characters (i.e., \textit{``hello''}). Estimating this effect is challenging because each model is trained with only one tokeniser. We address this by framing tokenisation bias as a causal effect and estimating it using the regression discontinuity design. Specifically, we exploit the fact that tokenisation algorithms rank subwords and add the first $K$ to a tokeniser's vocabulary, where $K$ is an arbitrary cutoff point. As such, we can estimate a causal effect by comparing similar subwords around this cutoff. Experimentally, we find that tokenisation consistently affects models' outputs across scales, vocabularies, and tokenisers. Notably, a subword's presence in a small model's vocabulary may increase its characters' probability by up to 17 times, highlighting tokenisation as a key design choice in language modelling.
Pietro Lesci、Clara Meister、Thomas Hofmann、Andreas Vlachos、Tiago Pimentel
计算技术、计算机技术
Pietro Lesci,Clara Meister,Thomas Hofmann,Andreas Vlachos,Tiago Pimentel.Causal Estimation of Tokenisation Bias[EB/OL].(2025-06-03)[2025-06-30].https://arxiv.org/abs/2506.03149.点此复制
评论