ICLR "mechanistic interpretability" Papers
13 papers found
Adaptive Transformer Programs: Bridging the Gap Between Performance and Interpretability in Transformers
Quoc-Vinh Lai-Dang, Taemin Kang, Seungah Son
ICLR 2025poster
Deep Networks Learn Features From Local Discontinuities in the Label Function
Prithaj Banerjee, Harish G Ramaswamy, Mahesh Yadav et al.
ICLR 2025poster
Efficient Automated Circuit Discovery in Transformers using Contextual Decomposition
Aliyah Hsu, Georgia Zhou, Yeshwanth Cherapanamjeri et al.
ICLR 2025posterarXiv:2407.00886
14
citations
Interpreting Emergent Planning in Model-Free Reinforcement Learning
Thomas Bush, Stephen Chung, Usman Anwar et al.
ICLR 2025posterarXiv:1901.03559
124
citations
Mechanistic Interpretability Meets Vision Language Models: Insights and Limitations
Yiming Liu, Yuhui Zhang, Serena Yeung
ICLR 2025poster
Mechanistic Permutability: Match Features Across Layers
Nikita Balagansky, Ian Maksimov, Daniil Gavrilov
ICLR 2025posterarXiv:2410.07656
14
citations
Monet: Mixture of Monosemantic Experts for Transformers
Jungwoo Park, Young Jin Ahn, Kee-Eung Kim et al.
ICLR 2025posterarXiv:2412.04139
9
citations
Not All Language Model Features Are One-Dimensionally Linear
Josh Engels, Eric Michaud, Isaac Liao et al.
ICLR 2025posterarXiv:2405.14860
89
citations
On the Role of Attention Heads in Large Language Model Safety
Zhenhong Zhou, Haiyang Yu, Xinghua Zhang et al.
ICLR 2025posterarXiv:2410.13708
40
citations
Sparse Autoencoders Do Not Find Canonical Units of Analysis
Patrick Leask, Bart Bussmann, Michael Pearce et al.
ICLR 2025posterarXiv:2502.04878
37
citations
The Same but Different: Structural Similarities and Differences in Multilingual Language Modeling
Ruochen Zhang, Qinan Yu, Matianyu Zang et al.
ICLR 2025posterarXiv:2410.09223
16
citations
Towards a Unified and Verified Understanding of Group-Operation Networks
Wilson Wu, Louis Jaburi, jacob drori et al.
ICLR 2025posterarXiv:2410.07476
2
citations
Transformers Struggle to Learn to Search
Abulhair Saparov, Srushti Ajay Pawar, Shreyas Pimpalgaonkar et al.
ICLR 2025posterarXiv:2412.04703
15
citations