"interpretable features" Papers
2 papers found
From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit
Valérie Costa, Thomas Fel, Ekdeep S Lubana et al.
NeurIPS 2025posterarXiv:2506.03093
10
citations
Not All Language Model Features Are One-Dimensionally Linear
Josh Engels, Eric Michaud, Isaac Liao et al.
ICLR 2025posterarXiv:2405.14860
89
citations