"language model interpretability" Papers
6 papers found
From Models to Microtheories: Distilling a Model's Topical Knowledge for Grounded Question-Answering
Nathaniel Weir, Bhavana Dalvi Mishra, Orion Weller et al.
ICLR 2025posterarXiv:2412.17701
3
citations
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
Gouki Gouki, Hiroki Furuta, Yusuke Iwasawa et al.
ICLR 2025posterarXiv:2501.06254
9
citations
Scaling and evaluating sparse autoencoders
Leo Gao, Tom Dupre la Tour, Henk Tillman et al.
ICLR 2025posterarXiv:2406.04093
298
citations
Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders
James Oldfield, Shawn Im, Sharon Li et al.
NeurIPS 2025posterarXiv:2505.21364
Explorations of Self-Repair in Language Models
Cody Rushing, Neel Nanda
ICML 2024poster
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models
Asma Ghandeharioun, Avi Caciularu, Adam Pearce et al.
ICML 2024poster