by Lee Sharkey Papers
3 papers found
Bilinear MLPs enable weight-based mechanistic interpretability
Michael Pearce, Thomas Dooms, Alice Rigg et al.
ICLR 2025posterarXiv:2410.08417
Sparse Autoencoders Do Not Find Canonical Units of Analysis
Patrick Leask, Bart Bussmann, Michael Pearce et al.
ICLR 2025posterarXiv:2502.04878
37
citations
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Robert Huben, Hoagy Cunningham, Logan Smith et al.
ICLR 2024poster