首页|An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L

An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L

来源：

英文摘要

Prior work suggests that language models manage the limited bandwidth of the residual stream through a "memory management" mechanism, where certain attention heads and MLP layers clear residual stream directions set by earlier layers. Our study provides concrete evidence for this erasure phenomenon in a 4-layer transformer, identifying heads that consistently remove the output of earlier heads. We further demonstrate that direct logit attribution (DLA), a common technique for interpreting the output of intermediate transformer layers, can show misleading results by not accounting for erasure.

作者：Can Rager、Jett Janiak、James Dao、Yeu-Tong Lau

作者单位：

DOI：10.18653/v1/2024.blackboxnlp-1.15

学科分类：计算技术、计算机技术

推荐引用：Can Rager,Jett Janiak,James Dao,Yeu-Tong Lau.An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L[EB/OL].(2023-10-11)[2025-05-18].https://arxiv.org/abs/2310.07325.点此复制

An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L

An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L

评论