AweDist: Attention-aware Embedding Distillation for New Input Token Embeddings
AweDist: Attention-aware Embedding Distillation for New Input Token Embeddings
Current language models rely on static vocabularies determined at pretraining time, which can lead to decreased performance and increased computational cost for domains underrepresented in the original vocabulary. New tokens can be added to solve this problem, when coupled with a good initialization for their new embeddings. However, existing embedding initialization methods either require expensive further training or pretraining of additional modules. In this paper, we propose AweDist and show that by distilling representations obtained using the original tokenization, we can quickly learn high-quality input embeddings for new tokens. Experimental results with a wide range of open-weight models show that AweDist is able to outperform even strong baselines.
Konstantin Dobler、Desmond Elliott、Gerard de Melo
语言学
Konstantin Dobler,Desmond Elliott,Gerard de Melo.AweDist: Attention-aware Embedding Distillation for New Input Token Embeddings[EB/OL].(2025-05-26)[2025-07-18].https://arxiv.org/abs/2505.20133.点此复制
评论