"mechanistic interpretability" Papers

16 papers found

A Implies B: Circuit Analysis in LLMs for Propositional Logical Reasoning

Guan Zhe Hong, Nishanth Dikkala, Enming Luo et al.

NeurIPS 2025spotlightarXiv:2411.04105
3
citations

Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits

Areeb Ahmad, Abhinav Joshi, Ashutosh Modi

NeurIPS 2025posterarXiv:2511.20273

FaCT: Faithful Concept Traces for Explaining Neural Network Decisions

Amin Parchami-Araghi, Sukrut Rao, Jonas Fischer et al.

NeurIPS 2025posterarXiv:2510.25512
1
citations

Interpreting Emergent Planning in Model-Free Reinforcement Learning

Thomas Bush, Stephen Chung, Usman Anwar et al.

ICLR 2025posterarXiv:1901.03559
124
citations

Mechanistic Interpretability Meets Vision Language Models: Insights and Limitations

Yiming Liu, Yuhui Zhang, Serena Yeung

ICLR 2025poster

Mechanistic Interpretability of RNNs emulating Hidden Markov Models

Elia Torre, Michele Viscione, Lucas Pompe et al.

NeurIPS 2025posterarXiv:2510.25674

Mechanistic Permutability: Match Features Across Layers

Nikita Balagansky, Ian Maksimov, Daniil Gavrilov

ICLR 2025posterarXiv:2410.07656
14
citations

On the Role of Attention Heads in Large Language Model Safety

Zhenhong Zhou, Haiyang Yu, Xinghua Zhang et al.

ICLR 2025posterarXiv:2410.13708
40
citations

Prompting as Scientific Inquiry

Ari Holtzman, Chenhao Tan

NeurIPS 2025oralarXiv:2507.00163

Revising and Falsifying Sparse Autoencoder Feature Explanations

George Ma, Samuel Pfrommer, Somayeh Sojoudi

NeurIPS 2025poster

Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons

Jianhui Chen, Xiaozhi Wang, Zijun Yao et al.

NeurIPS 2025posterarXiv:2406.14144
24
citations

Transformers Struggle to Learn to Search

Abulhair Saparov, Srushti Ajay Pawar, Shreyas Pimpalgaonkar et al.

ICLR 2025posterarXiv:2412.04703
15
citations

Don't trust your eyes: on the (un)reliability of feature visualizations

Robert Geirhos, Roland S. Zimmermann, Blair Bilodeau et al.

ICML 2024poster

From Neurons to Neutrons: A Case Study in Interpretability

Ouail Kitouni, Niklas Nolte, Víctor Samuel Pérez-Díaz et al.

ICML 2024poster

Observable Propagation: Uncovering Feature Vectors in Transformers

Jacob Dunefsky, Arman Cohan

ICML 2024poster

What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation

Aaditya Singh, Ted Moskovitz, Feilx Hill et al.

ICML 2024spotlight