2024 "mechanistic interpretability" Papers
4 papers found
Don't trust your eyes: on the (un)reliability of feature visualizations
Robert Geirhos, Roland S. Zimmermann, Blair Bilodeau et al.
ICML 2024posterarXiv:2306.04719
From Neurons to Neutrons: A Case Study in Interpretability
Ouail Kitouni, Niklas Nolte, Víctor Samuel Pérez-Díaz et al.
ICML 2024posterarXiv:2405.17425
Observable Propagation: Uncovering Feature Vectors in Transformers
Jacob Dunefsky, Arman Cohan
ICML 2024posterarXiv:2312.16291
What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation
Aaditya Singh, Ted Moskovitz, Feilx Hill et al.
ICML 2024spotlightarXiv:2404.07129