2025 "sparse autoencoders" Papers

10 papers found

Among Us: A Sandbox for Measuring and Detecting Agentic Deception

Satvik Golechha, Adrià Garriga-Alonso

NeurIPS 2025spotlightarXiv:2504.04072
7
citations

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan et al.

ICLR 2025posterarXiv:2411.14257
77
citations

Large Language Models Think Too Fast To Explore Effectively

Lan Pan, Hanbo Xie, Robert Wilson

NeurIPS 2025posterarXiv:2501.18009
6
citations

Representation Consistency for Accurate and Coherent LLM Answer Aggregation

Junqi Jiang, Tom Bewley, Salim I. Amoukou et al.

NeurIPS 2025posterarXiv:2506.21590
2
citations

Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words

Gouki Gouki, Hiroki Furuta, Yusuke Iwasawa et al.

ICLR 2025posterarXiv:2501.06254
9
citations

Revising and Falsifying Sparse Autoencoder Feature Explanations

George Ma, Samuel Pfrommer, Somayeh Sojoudi

NeurIPS 2025poster

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupre la Tour, Henk Tillman et al.

ICLR 2025posterarXiv:2406.04093
298
citations

SparseMVC: Probing Cross-view Sparsity Variations for Multi-view Clustering

Ruimeng Liu, Xin Zou, Chang Tang et al.

NeurIPS 2025spotlight

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Aleksandar Makelov, Georg Lange, Neel Nanda

ICLR 2025posterarXiv:2405.08366
63
citations

Words in Motion: Extracting Interpretable Control Vectors for Motion Transformers

Omer Sahin Tas, Royden Wagner

ICLR 2025posterarXiv:2406.11624
4
citations