NEURIPS Poster "mechanistic interpretability" Papers
8 papers found
Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits
Areeb Ahmad, Abhinav Joshi, Ashutosh Modi
NEURIPS 2025posterarXiv:2511.20273
1
citations
EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification
Lin Zhang, Wenshuo Dong, Zhuoran Zhang et al.
NEURIPS 2025posterarXiv:2502.06852
9
citations
FaCT: Faithful Concept Traces for Explaining Neural Network Decisions
Amin Parchami-Araghi, Sukrut Rao, Jonas Fischer et al.
NEURIPS 2025posterarXiv:2510.25512
1
citations
Interpreting Emergent Features in Deep Learning-based Side-channel Analysis
Sengim Karayalcin, Marina Krček, Stjepan Picek
NEURIPS 2025posterarXiv:2502.00384
Mechanistic Interpretability of RNNs emulating Hidden Markov Models
Elia Torre, Michele Viscione, Lucas Pompe et al.
NEURIPS 2025posterarXiv:2510.25674
Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates
Hang Chen, Jiaying Zhu, Xinyu Yang et al.
NEURIPS 2025posterarXiv:2505.10039
3
citations
Revising and Falsifying Sparse Autoencoder Feature Explanations
George Ma, Samuel Pfrommer, Somayeh Sojoudi
NEURIPS 2025poster
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
Jianhui Chen, Xiaozhi Wang, Zijun Yao et al.
NEURIPS 2025posterarXiv:2406.14144
24
citations