ICLR 2025 "language model interpretability" Papers
3 papers found
From Models to Microtheories: Distilling a Model's Topical Knowledge for Grounded Question-Answering
Nathaniel Weir, Bhavana Dalvi Mishra, Orion Weller et al.
ICLR 2025posterarXiv:2412.17701
3
citations
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
Gouki Gouki, Hiroki Furuta, Yusuke Iwasawa et al.
ICLR 2025posterarXiv:2501.06254
9
citations
Scaling and evaluating sparse autoencoders
Leo Gao, Tom Dupre la Tour, Henk Tillman et al.
ICLR 2025posterarXiv:2406.04093
298
citations