Martin Wattenberg

4

Papers

51

Total Citations

Papers (4)

Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

ICLR: In-Context Learning of Representations

A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity

Q-Probe: A Lightweight Approach to Reward Maximization for Language Models