2025 Poster "sparse autoencoders" Papers
9 papers found
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan et al.
ICLR 2025posterarXiv:2411.14257
77
citations
Large Language Models Think Too Fast To Explore Effectively
Lan Pan, Hanbo Xie, Robert Wilson
NeurIPS 2025posterarXiv:2501.18009
6
citations
Mechanistic Interpretability Meets Vision Language Models: Insights and Limitations
Yiming Liu, Yuhui Zhang, Serena Yeung
ICLR 2025poster
Representation Consistency for Accurate and Coherent LLM Answer Aggregation
Junqi Jiang, Tom Bewley, Salim I. Amoukou et al.
NeurIPS 2025posterarXiv:2506.21590
2
citations
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
Gouki Gouki, Hiroki Furuta, Yusuke Iwasawa et al.
ICLR 2025posterarXiv:2501.06254
9
citations
Revising and Falsifying Sparse Autoencoder Feature Explanations
George Ma, Samuel Pfrommer, Somayeh Sojoudi
NeurIPS 2025poster
Scaling and evaluating sparse autoencoders
Leo Gao, Tom Dupre la Tour, Henk Tillman et al.
ICLR 2025posterarXiv:2406.04093
298
citations
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Aleksandar Makelov, Georg Lange, Neel Nanda
ICLR 2025posterarXiv:2405.08366
63
citations
Words in Motion: Extracting Interpretable Control Vectors for Motion Transformers
Omer Sahin Tas, Royden Wagner
ICLR 2025posterarXiv:2406.11624
4
citations