Validating Mechanistic Interpretations: An Axiomatic Approach

1citations

Citations

#623

in ICML 2025

of 3340 papers

Authors

Data Points

Authors

Nils Palumbo Ravi Mangal Zifan Wang Saranya Vijayakumar Corina Pasareanu Somesh Jha

Abstract

Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the notion of amechanistic interpretationitself is often ad-hoc. Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner. We demonstrate the applicability of these axioms for validating mechanistic interpretations on an existing, well-known interpretability study as well as on a new case study involving a Transformer-based model trained to solve the well-known 2-SAT problem.

Citation History

Jan 28, 2026