首页|DePass: Unified Feature Attributing by Simple Decomposed Forward Pass

DePass: Unified Feature Attributing by Simple Decomposed Forward Pass

Xiangyu Hong Che Jiang Kai Tian Biqing Qi Youbang Sun Ning Ding Bowen Zhou

来源：

Arxiv

DePass: Unified Feature Attributing by Simple Decomposed Forward Pass

Xiangyu Hong Che Jiang Kai Tian Biqing Qi Youbang Sun Ning Ding Bowen Zhou

作者信息

Abstract

Attributing the behavior of Transformer models to internal computations is a central challenge in mechanistic interpretability. We introduce DePass, a unified framework for feature attribution based on a single decomposed forward pass. DePass decomposes hidden states into customized additive components, then propagates them with attention scores and MLP's activations fixed. It achieves faithful, fine-grained attribution without requiring auxiliary training. We validate DePass across token-level, model component-level, and subspace-level attribution tasks, demonstrating its effectiveness and fidelity. Our experiments highlight its potential to attribute information flow between arbitrary components of a Transformer model. We hope DePass serves as a foundational tool for broader applications in interpretability.

引用本文复制引用

Xiangyu Hong,Che Jiang,Kai Tian,Biqing Qi,Youbang Sun,Ning Ding,Bowen Zhou.DePass: Unified Feature Attributing by Simple Decomposed Forward Pass[EB/OL].(2025-10-24)[2026-04-02].https://arxiv.org/abs/2510.18462.

学科分类

计算技术、计算机技术

首发时间： 2025-10-24

下载量：0

点击量：9

段落导航

DePass: Unified Feature Attributing by Simple Decomposed Forward Pass

DePass: Unified Feature Attributing by Simple Decomposed Forward Pass

Abstract

引用本文复制引用

学科分类

评论