2025 "mechanistic interpretability" Papers
24 papers found
Adaptive Transformer Programs: Bridging the Gap Between Performance and Interpretability in Transformers
Quoc-Vinh Lai-Dang, Taemin Kang, Seungah Son
A Implies B: Circuit Analysis in LLMs for Propositional Logical Reasoning
Guan Zhe Hong, Nishanth Dikkala, Enming Luo et al.
Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits
Areeb Ahmad, Abhinav Joshi, Ashutosh Modi
Deep Networks Learn Features From Local Discontinuities in the Label Function
Prithaj Banerjee, Harish G Ramaswamy, Mahesh Yadav et al.
Dissecting and Mitigating Diffusion Bias via Mechanistic Interpretability
Yingdong Shi, Changming Li, Yifan Wang et al.
EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification
Lin Zhang, Wenshuo Dong, Zhuoran Zhang et al.
Efficient Automated Circuit Discovery in Transformers using Contextual Decomposition
Aliyah Hsu, Georgia Zhou, Yeshwanth Cherapanamjeri et al.
FaCT: Faithful Concept Traces for Explaining Neural Network Decisions
Amin Parchami-Araghi, Sukrut Rao, Jonas Fischer et al.
Interpreting Emergent Features in Deep Learning-based Side-channel Analysis
Sengim Karayalcin, Marina Krček, Stjepan Picek
Interpreting Emergent Planning in Model-Free Reinforcement Learning
Thomas Bush, Stephen Chung, Usman Anwar et al.
Mechanistic Interpretability Meets Vision Language Models: Insights and Limitations
Yiming Liu, Yuhui Zhang, Serena Yeung
Mechanistic Interpretability of RNNs emulating Hidden Markov Models
Elia Torre, Michele Viscione, Lucas Pompe et al.
Mechanistic Permutability: Match Features Across Layers
Nikita Balagansky, Ian Maksimov, Daniil Gavrilov
Monet: Mixture of Monosemantic Experts for Transformers
Jungwoo Park, Young Jin Ahn, Kee-Eung Kim et al.
Not All Language Model Features Are One-Dimensionally Linear
Josh Engels, Eric Michaud, Isaac Liao et al.
On the Role of Attention Heads in Large Language Model Safety
Zhenhong Zhou, Haiyang Yu, Xinghua Zhang et al.
Prompting as Scientific Inquiry
Ari Holtzman, Chenhao Tan
Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates
Hang Chen, Jiaying Zhu, Xinyu Yang et al.
Revising and Falsifying Sparse Autoencoder Feature Explanations
George Ma, Samuel Pfrommer, Somayeh Sojoudi
Sparse Autoencoders Do Not Find Canonical Units of Analysis
Patrick Leask, Bart Bussmann, Michael Pearce et al.
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
Denis Sutter, Julian Minder, Thomas Hofmann et al.
The Same but Different: Structural Similarities and Differences in Multilingual Language Modeling
Ruochen Zhang, Qinan Yu, Matianyu Zang et al.
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
Jianhui Chen, Xiaozhi Wang, Zijun Yao et al.
Transformers Struggle to Learn to Search
Abulhair Saparov, Srushti Ajay Pawar, Shreyas Pimpalgaonkar et al.