"mechanistic interpretability" Papers
16 papers found
A Implies B: Circuit Analysis in LLMs for Propositional Logical Reasoning
Guan Zhe Hong, Nishanth Dikkala, Enming Luo et al.
Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits
Areeb Ahmad, Abhinav Joshi, Ashutosh Modi
FaCT: Faithful Concept Traces for Explaining Neural Network Decisions
Amin Parchami-Araghi, Sukrut Rao, Jonas Fischer et al.
Interpreting Emergent Planning in Model-Free Reinforcement Learning
Thomas Bush, Stephen Chung, Usman Anwar et al.
Mechanistic Interpretability Meets Vision Language Models: Insights and Limitations
Yiming Liu, Yuhui Zhang, Serena Yeung
Mechanistic Interpretability of RNNs emulating Hidden Markov Models
Elia Torre, Michele Viscione, Lucas Pompe et al.
Mechanistic Permutability: Match Features Across Layers
Nikita Balagansky, Ian Maksimov, Daniil Gavrilov
On the Role of Attention Heads in Large Language Model Safety
Zhenhong Zhou, Haiyang Yu, Xinghua Zhang et al.
Prompting as Scientific Inquiry
Ari Holtzman, Chenhao Tan
Revising and Falsifying Sparse Autoencoder Feature Explanations
George Ma, Samuel Pfrommer, Somayeh Sojoudi
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
Jianhui Chen, Xiaozhi Wang, Zijun Yao et al.
Transformers Struggle to Learn to Search
Abulhair Saparov, Srushti Ajay Pawar, Shreyas Pimpalgaonkar et al.
Don't trust your eyes: on the (un)reliability of feature visualizations
Robert Geirhos, Roland S. Zimmermann, Blair Bilodeau et al.
From Neurons to Neutrons: A Case Study in Interpretability
Ouail Kitouni, Niklas Nolte, Víctor Samuel Pérez-Díaz et al.
Observable Propagation: Uncovering Feature Vectors in Transformers
Jacob Dunefsky, Arman Cohan
What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation
Aaditya Singh, Ted Moskovitz, Feilx Hill et al.