Poster "mechanistic interpretability" Papers

15 papers found

Adaptive Transformer Programs: Bridging the Gap Between Performance and Interpretability in Transformers

Quoc-Vinh Lai-Dang, Taemin Kang, Seungah Son

ICLR 2025poster

Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits

Areeb Ahmad, Abhinav Joshi, Ashutosh Modi

NeurIPS 2025posterarXiv:2511.20273

FaCT: Faithful Concept Traces for Explaining Neural Network Decisions

Amin Parchami-Araghi, Sukrut Rao, Jonas Fischer et al.

NeurIPS 2025posterarXiv:2510.25512
1
citations

Interpreting Emergent Planning in Model-Free Reinforcement Learning

Thomas Bush, Stephen Chung, Usman Anwar et al.

ICLR 2025posterarXiv:1901.03559
124
citations

Mechanistic Interpretability Meets Vision Language Models: Insights and Limitations

Yiming Liu, Yuhui Zhang, Serena Yeung

ICLR 2025poster

Mechanistic Interpretability of RNNs emulating Hidden Markov Models

Elia Torre, Michele Viscione, Lucas Pompe et al.

NeurIPS 2025posterarXiv:2510.25674

Mechanistic Permutability: Match Features Across Layers

Nikita Balagansky, Ian Maksimov, Daniil Gavrilov

ICLR 2025posterarXiv:2410.07656
14
citations

On the Role of Attention Heads in Large Language Model Safety

Zhenhong Zhou, Haiyang Yu, Xinghua Zhang et al.

ICLR 2025posterarXiv:2410.13708
40
citations

Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates

Hang Chen, Jiaying Zhu, Xinyu Yang et al.

NeurIPS 2025posterarXiv:2505.10039
3
citations

Revising and Falsifying Sparse Autoencoder Feature Explanations

George Ma, Samuel Pfrommer, Somayeh Sojoudi

NeurIPS 2025poster

Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons

Jianhui Chen, Xiaozhi Wang, Zijun Yao et al.

NeurIPS 2025posterarXiv:2406.14144
24
citations

Transformers Struggle to Learn to Search

Abulhair Saparov, Srushti Ajay Pawar, Shreyas Pimpalgaonkar et al.

ICLR 2025posterarXiv:2412.04703
15
citations

Don't trust your eyes: on the (un)reliability of feature visualizations

Robert Geirhos, Roland S. Zimmermann, Blair Bilodeau et al.

ICML 2024poster

From Neurons to Neutrons: A Case Study in Interpretability

Ouail Kitouni, Niklas Nolte, Víctor Samuel Pérez-Díaz et al.

ICML 2024poster

Observable Propagation: Uncovering Feature Vectors in Transformers

Jacob Dunefsky, Arman Cohan

ICML 2024poster