首页|Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations

Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations

来源：

英文摘要

Sparse autoencoders (SAEs) are commonly used to interpret the internal activations of large language models (LLMs) by mapping them to human-interpretable concept representations. While existing evaluations of SAEs focus on metrics such as the reconstruction-sparsity tradeoff, human (auto-)interpretability, and feature disentanglement, they overlook a critical aspect: the robustness of concept representations to input perturbations. We argue that robustness must be a fundamental consideration for concept representations, reflecting the fidelity of concept labeling. To this end, we formulate robustness quantification as input-space optimization problems and develop a comprehensive evaluation framework featuring realistic scenarios in which adversarial perturbations are crafted to manipulate SAE representations. Empirically, we find that tiny adversarial input perturbations can effectively manipulate concept-based interpretations in most scenarios without notably affecting the outputs of the base LLMs themselves. Overall, our results suggest that SAE concept representations are fragile and may be ill-suited for applications in model monitoring and oversight.

作者：Aaron J. Li、Suraj Srinivas、Usha Bhalla、Himabindu Lakkaraju

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Aaron J. Li,Suraj Srinivas,Usha Bhalla,Himabindu Lakkaraju.Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations[EB/OL].(2025-05-21)[2025-07-20].https://arxiv.org/abs/2505.16004.点此复制

Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations

Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations

评论