Poster "mechanistic interpretability" Papers

15 papers found

Filters:poster mechanistic interpretability Clear all

Conference

AAAI 2025 (3,028)CVPR 2025 (2,873)ICCV 2025 (2,701)ICLR 2025 (3,827)ICML 2025 (3,340)ISMAR 2025 (229)NeurIPS 2025 (5,858)AAAI 2024 (2,289)CVPR 2024 (2,716)ECCV 2024 (2,387)ICLR 2024 (2,297)ICML 2024 (2,635)

Paper Type

poster (24,624)paper (8,140)oral (1,594)spotlight (1,421)highlight (975)

Adaptive Transformer Programs: Bridging the Gap Between Performance and Interpretability in Transformers

Quoc-Vinh Lai-Dang, Taemin Kang, Seungah Son

ICLR 2025poster

Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits

Areeb Ahmad, Abhinav Joshi, Ashutosh Modi

NeurIPS 2025posterarXiv:2511.20273

FaCT: Faithful Concept Traces for Explaining Neural Network Decisions

Amin Parchami-Araghi, Sukrut Rao, Jonas Fischer et al.

NeurIPS 2025posterarXiv:2510.25512

citations

Interpreting Emergent Planning in Model-Free Reinforcement Learning

Thomas Bush, Stephen Chung, Usman Anwar et al.

ICLR 2025posterarXiv:1901.03559

124

citations

Mechanistic Interpretability Meets Vision Language Models: Insights and Limitations

Yiming Liu, Yuhui Zhang, Serena Yeung

ICLR 2025poster

Mechanistic Interpretability of RNNs emulating Hidden Markov Models

Elia Torre, Michele Viscione, Lucas Pompe et al.

NeurIPS 2025posterarXiv:2510.25674

Mechanistic Permutability: Match Features Across Layers

Nikita Balagansky, Ian Maksimov, Daniil Gavrilov

ICLR 2025posterarXiv:2410.07656

citations

On the Role of Attention Heads in Large Language Model Safety

Zhenhong Zhou, Haiyang Yu, Xinghua Zhang et al.

ICLR 2025posterarXiv:2410.13708

citations

Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates

Hang Chen, Jiaying Zhu, Xinyu Yang et al.

NeurIPS 2025posterarXiv:2505.10039

citations

Revising and Falsifying Sparse Autoencoder Feature Explanations

George Ma, Samuel Pfrommer, Somayeh Sojoudi

NeurIPS 2025poster

Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons

Jianhui Chen, Xiaozhi Wang, Zijun Yao et al.

NeurIPS 2025posterarXiv:2406.14144

citations

Transformers Struggle to Learn to Search

Abulhair Saparov, Srushti Ajay Pawar, Shreyas Pimpalgaonkar et al.

ICLR 2025posterarXiv:2412.04703

citations

Don't trust your eyes: on the (un)reliability of feature visualizations

Robert Geirhos, Roland S. Zimmermann, Blair Bilodeau et al.

ICML 2024poster

From Neurons to Neutrons: A Case Study in Interpretability

Ouail Kitouni, Niklas Nolte, Víctor Samuel Pérez-Díaz et al.

ICML 2024poster

Observable Propagation: Uncovering Feature Vectors in Transformers

Jacob Dunefsky, Arman Cohan

ICML 2024poster