NEURIPS 2025 "mechanistic interpretability" Papers
11 papers found
A Implies B: Circuit Analysis in LLMs for Propositional Logical Reasoning
Guan Zhe Hong, Nishanth Dikkala, Enming Luo et al.
NEURIPS 2025spotlightarXiv:2411.04105
3
citations
Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits
Areeb Ahmad, Abhinav Joshi, Ashutosh Modi
NEURIPS 2025posterarXiv:2511.20273
EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification
Lin Zhang, Wenshuo Dong, Zhuoran Zhang et al.
NEURIPS 2025posterarXiv:2502.06852
9
citations
FaCT: Faithful Concept Traces for Explaining Neural Network Decisions
Amin Parchami-Araghi, Sukrut Rao, Jonas Fischer et al.
NEURIPS 2025posterarXiv:2510.25512
1
citations
Interpreting Emergent Features in Deep Learning-based Side-channel Analysis
Sengim Karayalcin, Marina Krček, Stjepan Picek
NEURIPS 2025posterarXiv:2502.00384
Mechanistic Interpretability of RNNs emulating Hidden Markov Models
Elia Torre, Michele Viscione, Lucas Pompe et al.
NEURIPS 2025posterarXiv:2510.25674
Prompting as Scientific Inquiry
Ari Holtzman, Chenhao Tan
NEURIPS 2025oralarXiv:2507.00163
Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates
Hang Chen, Jiaying Zhu, Xinyu Yang et al.
NEURIPS 2025posterarXiv:2505.10039
3
citations
Revising and Falsifying Sparse Autoencoder Feature Explanations
George Ma, Samuel Pfrommer, Somayeh Sojoudi
NEURIPS 2025poster
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
Denis Sutter, Julian Minder, Thomas Hofmann et al.
NEURIPS 2025spotlightarXiv:2507.08802
9
citations
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
Jianhui Chen, Xiaozhi Wang, Zijun Yao et al.
NEURIPS 2025posterarXiv:2406.14144
24
citations