首页|The emergence of sparse attention: impact of data distribution and benefits of repetition

The emergence of sparse attention: impact of data distribution and benefits of repetition

来源：

英文摘要

Emergence is a fascinating property of large language models and neural networks more broadly: as models scale and train for longer, they sometimes develop new abilities in sudden ways. Despite initial studies, we still lack a comprehensive understanding of how and when these abilities emerge. To address this gap, we study the emergence over training of sparse attention, a critical and frequently observed attention pattern in Transformers. By combining theoretical analysis of a toy model with empirical observations on small Transformers trained on a linear regression variant, we uncover the mechanics driving sparse attention emergence and reveal that emergence timing follows power laws based on task structure, architecture, and optimizer choice. We additionally find that repetition can greatly speed up emergence. Finally, we confirm these results on a well-studied in-context associative recall task. Our findings provide a simple, theoretically grounded framework for understanding how data distributions and model design influence the learning dynamics behind one form of emergence.

作者：Nicolas Zucchet、Francesco d'Angelo、Andrew K. Lampinen、Stephanie C. Y. Chan

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Nicolas Zucchet,Francesco d'Angelo,Andrew K. Lampinen,Stephanie C. Y. Chan.The emergence of sparse attention: impact of data distribution and benefits of repetition[EB/OL].(2025-05-23)[2025-07-21].https://arxiv.org/abs/2505.17863.点此复制

The emergence of sparse attention: impact of data distribution and benefits of repetition

The emergence of sparse attention: impact of data distribution and benefits of repetition

评论