首页|The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models

The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models

来源：

英文摘要

Despite their remarkable progress across diverse domains, Large Language Models (LLMs) consistently fail at simple character-level tasks, such as counting letters in words, due to a fundamental limitation: tokenization. In this work, we frame this limitation as a problem of low mutual information and analyze it in terms of concept emergence. Using a suite of 19 synthetic tasks that isolate character-level reasoning in a controlled setting, we show that such capabilities emerge slowly, suddenly, and only late in training. We further show that percolation-based models of concept emergence explain these patterns, suggesting that learning character composition is not fundamentally different from learning commonsense knowledge. To address this bottleneck, we propose a lightweight architectural modification that significantly improves character-level reasoning while preserving the inductive advantages of subword models. Together, our results bridge low-level perceptual gaps in tokenized LMs and provide a principled framework for understanding and mitigating their structural blind spots. We make our code publicly available.

作者：Adrian Cosma、Stefan Ruseti、Emilian Radoi、Mihai Dascalu

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Adrian Cosma,Stefan Ruseti,Emilian Radoi,Mihai Dascalu.The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models[EB/OL].(2025-05-20)[2025-06-14].https://arxiv.org/abs/2505.14172.点此复制

The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models

The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models

评论