2025 "sparse autoencoders" Papers
10 papers found
Among Us: A Sandbox for Measuring and Detecting Agentic Deception
Satvik Golechha, Adrià Garriga-Alonso
NeurIPS 2025spotlightarXiv:2504.04072
7
citations
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan et al.
ICLR 2025posterarXiv:2411.14257
77
citations
Large Language Models Think Too Fast To Explore Effectively
Lan Pan, Hanbo Xie, Robert Wilson
NeurIPS 2025posterarXiv:2501.18009
6
citations
Representation Consistency for Accurate and Coherent LLM Answer Aggregation
Junqi Jiang, Tom Bewley, Salim I. Amoukou et al.
NeurIPS 2025posterarXiv:2506.21590
2
citations
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
Gouki Gouki, Hiroki Furuta, Yusuke Iwasawa et al.
ICLR 2025posterarXiv:2501.06254
9
citations
Revising and Falsifying Sparse Autoencoder Feature Explanations
George Ma, Samuel Pfrommer, Somayeh Sojoudi
NeurIPS 2025poster
Scaling and evaluating sparse autoencoders
Leo Gao, Tom Dupre la Tour, Henk Tillman et al.
ICLR 2025posterarXiv:2406.04093
298
citations
SparseMVC: Probing Cross-view Sparsity Variations for Multi-view Clustering
Ruimeng Liu, Xin Zou, Chang Tang et al.
NeurIPS 2025spotlight
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Aleksandar Makelov, Georg Lange, Neel Nanda
ICLR 2025posterarXiv:2405.08366
63
citations
Words in Motion: Extracting Interpretable Control Vectors for Motion Transformers
Omer Sahin Tas, Royden Wagner
ICLR 2025posterarXiv:2406.11624
4
citations