by Aidan Ewart Papers
2 papers found
Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization
Phillip Guo, Aaquib Syed, Abhay Sheshadri et al.
ICML 2025spotlightarXiv:2410.12949
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Robert Huben, Hoagy Cunningham, Logan Smith et al.
ICLR 2024poster