Poster "mechanistic interpretability" Papers
15 papers found
Adaptive Transformer Programs: Bridging the Gap Between Performance and Interpretability in Transformers
Quoc-Vinh Lai-Dang, Taemin Kang, Seungah Son
ICLR 2025poster
Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits
Areeb Ahmad, Abhinav Joshi, Ashutosh Modi
NeurIPS 2025posterarXiv:2511.20273
FaCT: Faithful Concept Traces for Explaining Neural Network Decisions
Amin Parchami-Araghi, Sukrut Rao, Jonas Fischer et al.
NeurIPS 2025posterarXiv:2510.25512
1
citations
Interpreting Emergent Planning in Model-Free Reinforcement Learning
Thomas Bush, Stephen Chung, Usman Anwar et al.
ICLR 2025posterarXiv:1901.03559
124
citations
Mechanistic Interpretability Meets Vision Language Models: Insights and Limitations
Yiming Liu, Yuhui Zhang, Serena Yeung
ICLR 2025poster
Mechanistic Interpretability of RNNs emulating Hidden Markov Models
Elia Torre, Michele Viscione, Lucas Pompe et al.
NeurIPS 2025posterarXiv:2510.25674
Mechanistic Permutability: Match Features Across Layers
Nikita Balagansky, Ian Maksimov, Daniil Gavrilov
ICLR 2025posterarXiv:2410.07656
14
citations
On the Role of Attention Heads in Large Language Model Safety
Zhenhong Zhou, Haiyang Yu, Xinghua Zhang et al.
ICLR 2025posterarXiv:2410.13708
40
citations
Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates
Hang Chen, Jiaying Zhu, Xinyu Yang et al.
NeurIPS 2025posterarXiv:2505.10039
3
citations
Revising and Falsifying Sparse Autoencoder Feature Explanations
George Ma, Samuel Pfrommer, Somayeh Sojoudi
NeurIPS 2025poster
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
Jianhui Chen, Xiaozhi Wang, Zijun Yao et al.
NeurIPS 2025posterarXiv:2406.14144
24
citations
Transformers Struggle to Learn to Search
Abulhair Saparov, Srushti Ajay Pawar, Shreyas Pimpalgaonkar et al.
ICLR 2025posterarXiv:2412.04703
15
citations
Don't trust your eyes: on the (un)reliability of feature visualizations
Robert Geirhos, Roland S. Zimmermann, Blair Bilodeau et al.
ICML 2024poster
From Neurons to Neutrons: A Case Study in Interpretability
Ouail Kitouni, Niklas Nolte, Víctor Samuel Pérez-Díaz et al.
ICML 2024poster
Observable Propagation: Uncovering Feature Vectors in Transformers
Jacob Dunefsky, Arman Cohan
ICML 2024poster