2025 "llm interpretability" Papers
2 papers found
Do LLMs ``know'' internally when they follow instructions?
Juyeon Heo, Christina Heinze-Deml, Oussama Elachqar et al.
ICLR 2025posterarXiv:2410.14516
22
citations
Sparse Autoencoders Do Not Find Canonical Units of Analysis
Patrick Leask, Bart Bussmann, Michael Pearce et al.
ICLR 2025posterarXiv:2502.04878
37
citations