2025 "language model interpretability" Papers
5 papers found
From Models to Microtheories: Distilling a Model's Topical Knowledge for Grounded Question-Answering
Nathaniel Weir, Bhavana Dalvi Mishra, Orion Weller et al.
ICLR 2025posterarXiv:2412.17701
3
citations
Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates
Hang Chen, Jiaying Zhu, Xinyu Yang et al.
NeurIPS 2025posterarXiv:2505.10039
3
citations
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
Gouki Gouki, Hiroki Furuta, Yusuke Iwasawa et al.
ICLR 2025posterarXiv:2501.06254
9
citations
Scaling and evaluating sparse autoencoders
Leo Gao, Tom Dupre la Tour, Henk Tillman et al.
ICLR 2025posterarXiv:2406.04093
298
citations
Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders
James Oldfield, Shawn Im, Sharon Li et al.
NeurIPS 2025posterarXiv:2505.21364