ICLR "mechanistic interpretability" Papers

13 papers found

Adaptive Transformer Programs: Bridging the Gap Between Performance and Interpretability in Transformers

Quoc-Vinh Lai-Dang, Taemin Kang, Seungah Son

ICLR 2025poster

Deep Networks Learn Features From Local Discontinuities in the Label Function

Prithaj Banerjee, Harish G Ramaswamy, Mahesh Yadav et al.

ICLR 2025poster

Efficient Automated Circuit Discovery in Transformers using Contextual Decomposition

Aliyah Hsu, Georgia Zhou, Yeshwanth Cherapanamjeri et al.

ICLR 2025posterarXiv:2407.00886
14
citations

Interpreting Emergent Planning in Model-Free Reinforcement Learning

Thomas Bush, Stephen Chung, Usman Anwar et al.

ICLR 2025posterarXiv:1901.03559
124
citations

Mechanistic Interpretability Meets Vision Language Models: Insights and Limitations

Yiming Liu, Yuhui Zhang, Serena Yeung

ICLR 2025poster

Mechanistic Permutability: Match Features Across Layers

Nikita Balagansky, Ian Maksimov, Daniil Gavrilov

ICLR 2025posterarXiv:2410.07656
14
citations

Monet: Mixture of Monosemantic Experts for Transformers

Jungwoo Park, Young Jin Ahn, Kee-Eung Kim et al.

ICLR 2025posterarXiv:2412.04139
9
citations

Not All Language Model Features Are One-Dimensionally Linear

Josh Engels, Eric Michaud, Isaac Liao et al.

ICLR 2025posterarXiv:2405.14860
89
citations

On the Role of Attention Heads in Large Language Model Safety

Zhenhong Zhou, Haiyang Yu, Xinghua Zhang et al.

ICLR 2025posterarXiv:2410.13708
40
citations

Sparse Autoencoders Do Not Find Canonical Units of Analysis

Patrick Leask, Bart Bussmann, Michael Pearce et al.

ICLR 2025posterarXiv:2502.04878
37
citations

The Same but Different: Structural Similarities and Differences in Multilingual Language Modeling

Ruochen Zhang, Qinan Yu, Matianyu Zang et al.

ICLR 2025posterarXiv:2410.09223
16
citations

Towards a Unified and Verified Understanding of Group-Operation Networks

Wilson Wu, Louis Jaburi, jacob drori et al.

ICLR 2025posterarXiv:2410.07476
2
citations

Transformers Struggle to Learn to Search

Abulhair Saparov, Srushti Ajay Pawar, Shreyas Pimpalgaonkar et al.

ICLR 2025posterarXiv:2412.04703
15
citations