Poster "sparse autoencoders" Papers

11 papers found

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan et al.

ICLR 2025posterarXiv:2411.14257
77
citations

Large Language Models Think Too Fast To Explore Effectively

Lan Pan, Hanbo Xie, Robert Wilson

NeurIPS 2025posterarXiv:2501.18009
6
citations

Mechanistic Interpretability Meets Vision Language Models: Insights and Limitations

Yiming Liu, Yuhui Zhang, Serena Yeung

ICLR 2025poster

Mechanistic Permutability: Match Features Across Layers

Nikita Balagansky, Ian Maksimov, Daniil Gavrilov

ICLR 2025posterarXiv:2410.07656
14
citations

Representation Consistency for Accurate and Coherent LLM Answer Aggregation

Junqi Jiang, Tom Bewley, Salim I. Amoukou et al.

NeurIPS 2025posterarXiv:2506.21590
2
citations

Residual Stream Analysis with Multi-Layer SAEs

Tim Lawson, Lucy Farnik, Conor Houghton et al.

ICLR 2025posterarXiv:2409.04185
11
citations

Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words

Gouki Gouki, Hiroki Furuta, Yusuke Iwasawa et al.

ICLR 2025posterarXiv:2501.06254
9
citations

Revising and Falsifying Sparse Autoencoder Feature Explanations

George Ma, Samuel Pfrommer, Somayeh Sojoudi

NeurIPS 2025poster

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupre la Tour, Henk Tillman et al.

ICLR 2025posterarXiv:2406.04093
298
citations

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Aleksandar Makelov, Georg Lange, Neel Nanda

ICLR 2025posterarXiv:2405.08366
63
citations

Words in Motion: Extracting Interpretable Control Vectors for Motion Transformers

Omer Sahin Tas, Royden Wagner

ICLR 2025posterarXiv:2406.11624
4
citations