NEURIPS 2025 "mechanistic interpretability" Papers

11 papers found

A Implies B: Circuit Analysis in LLMs for Propositional Logical Reasoning

Guan Zhe Hong, Nishanth Dikkala, Enming Luo et al.

NEURIPS 2025spotlightarXiv:2411.04105
3
citations

Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits

Areeb Ahmad, Abhinav Joshi, Ashutosh Modi

NEURIPS 2025posterarXiv:2511.20273

EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification

Lin Zhang, Wenshuo Dong, Zhuoran Zhang et al.

NEURIPS 2025posterarXiv:2502.06852
9
citations

FaCT: Faithful Concept Traces for Explaining Neural Network Decisions

Amin Parchami-Araghi, Sukrut Rao, Jonas Fischer et al.

NEURIPS 2025posterarXiv:2510.25512
1
citations

Interpreting Emergent Features in Deep Learning-based Side-channel Analysis

Sengim Karayalcin, Marina Krček, Stjepan Picek

NEURIPS 2025posterarXiv:2502.00384

Mechanistic Interpretability of RNNs emulating Hidden Markov Models

Elia Torre, Michele Viscione, Lucas Pompe et al.

NEURIPS 2025posterarXiv:2510.25674

Prompting as Scientific Inquiry

Ari Holtzman, Chenhao Tan

NEURIPS 2025oralarXiv:2507.00163

Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates

Hang Chen, Jiaying Zhu, Xinyu Yang et al.

NEURIPS 2025posterarXiv:2505.10039
3
citations

Revising and Falsifying Sparse Autoencoder Feature Explanations

George Ma, Samuel Pfrommer, Somayeh Sojoudi

NEURIPS 2025poster

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

Denis Sutter, Julian Minder, Thomas Hofmann et al.

NEURIPS 2025spotlightarXiv:2507.08802
9
citations

Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons

Jianhui Chen, Xiaozhi Wang, Zijun Yao et al.

NEURIPS 2025posterarXiv:2406.14144
24
citations