NeurIPS "language model interpretability" Papers
2 papers found
Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates
Hang Chen, Jiaying Zhu, Xinyu Yang et al.
NeurIPS 2025posterarXiv:2505.10039
3
citations
Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders
James Oldfield, Shawn Im, Sharon Li et al.
NeurIPS 2025posterarXiv:2505.21364