2025 Poster "mechanistic interpretability" Papers

21 papers found

Adaptive Transformer Programs: Bridging the Gap Between Performance and Interpretability in Transformers

Quoc-Vinh Lai-Dang, Taemin Kang, Seungah Son

ICLR 2025poster

Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits

Areeb Ahmad, Abhinav Joshi, Ashutosh Modi

NEURIPS 2025posterarXiv:2511.20273

Deep Networks Learn Features From Local Discontinuities in the Label Function

Prithaj Banerjee, Harish G Ramaswamy, Mahesh Yadav et al.

ICLR 2025poster

Dissecting and Mitigating Diffusion Bias via Mechanistic Interpretability

Yingdong Shi, Changming Li, Yifan Wang et al.

CVPR 2025posterarXiv:2503.20483
14
citations

EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification

Lin Zhang, Wenshuo Dong, Zhuoran Zhang et al.

NEURIPS 2025posterarXiv:2502.06852
9
citations

Efficient Automated Circuit Discovery in Transformers using Contextual Decomposition

Aliyah Hsu, Georgia Zhou, Yeshwanth Cherapanamjeri et al.

ICLR 2025posterarXiv:2407.00886
14
citations

FaCT: Faithful Concept Traces for Explaining Neural Network Decisions

Amin Parchami-Araghi, Sukrut Rao, Jonas Fischer et al.

NEURIPS 2025posterarXiv:2510.25512
1
citations

Interpreting Emergent Features in Deep Learning-based Side-channel Analysis

Sengim Karayalcin, Marina Krček, Stjepan Picek

NEURIPS 2025posterarXiv:2502.00384

Interpreting Emergent Planning in Model-Free Reinforcement Learning

Thomas Bush, Stephen Chung, Usman Anwar et al.

ICLR 2025posterarXiv:1901.03559
124
citations

Mechanistic Interpretability Meets Vision Language Models: Insights and Limitations

Yiming Liu, Yuhui Zhang, Serena Yeung

ICLR 2025poster

Mechanistic Interpretability of RNNs emulating Hidden Markov Models

Elia Torre, Michele Viscione, Lucas Pompe et al.

NEURIPS 2025posterarXiv:2510.25674

Mechanistic Permutability: Match Features Across Layers

Nikita Balagansky, Ian Maksimov, Daniil Gavrilov

ICLR 2025posterarXiv:2410.07656
14
citations

Monet: Mixture of Monosemantic Experts for Transformers

Jungwoo Park, Young Jin Ahn, Kee-Eung Kim et al.

ICLR 2025posterarXiv:2412.04139
9
citations

Not All Language Model Features Are One-Dimensionally Linear

Josh Engels, Eric Michaud, Isaac Liao et al.

ICLR 2025posterarXiv:2405.14860
89
citations

On the Role of Attention Heads in Large Language Model Safety

Zhenhong Zhou, Haiyang Yu, Xinghua Zhang et al.

ICLR 2025posterarXiv:2410.13708
40
citations

Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates

Hang Chen, Jiaying Zhu, Xinyu Yang et al.

NEURIPS 2025posterarXiv:2505.10039
3
citations

Revising and Falsifying Sparse Autoencoder Feature Explanations

George Ma, Samuel Pfrommer, Somayeh Sojoudi

NEURIPS 2025poster

Sparse Autoencoders Do Not Find Canonical Units of Analysis

Patrick Leask, Bart Bussmann, Michael Pearce et al.

ICLR 2025posterarXiv:2502.04878
37
citations

The Same but Different: Structural Similarities and Differences in Multilingual Language Modeling

Ruochen Zhang, Qinan Yu, Matianyu Zang et al.

ICLR 2025posterarXiv:2410.09223
16
citations

Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons

Jianhui Chen, Xiaozhi Wang, Zijun Yao et al.

NEURIPS 2025posterarXiv:2406.14144
24
citations

Transformers Struggle to Learn to Search

Abulhair Saparov, Srushti Ajay Pawar, Shreyas Pimpalgaonkar et al.

ICLR 2025posterarXiv:2412.04703
15
citations