Validating Mechanistic Interpretations: An Axiomatic Approach
Validating Mechanistic Interpretations: An Axiomatic Approach
Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the notion of a mechanistic interpretation itself is often ad-hoc. Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner. We demonstrate the applicability of these axioms for validating mechanistic interpretations on an existing, well-known interpretability study as well as on a new case study involving a Transformer-based model trained to solve the well-known 2-SAT problem.
Somesh Jha、Nils Palumbo、Ravi Mangal、Zifan Wang、Saranya Vijayakumar、Corina S. Pasareanu
计算技术、计算机技术
Somesh Jha,Nils Palumbo,Ravi Mangal,Zifan Wang,Saranya Vijayakumar,Corina S. Pasareanu.Validating Mechanistic Interpretations: An Axiomatic Approach[EB/OL].(2025-06-20)[2025-07-16].https://arxiv.org/abs/2407.13594.点此复制
评论