Jacob Steinhardt
9
Papers
193
Total Citations
Papers (9)
Language Models Learn to Mislead Humans via RLHF
ICLR 2025arXiv
73
citations
Describing Differences in Image Sets with Natural Language
CVPR 2024
51
citations
Which Attention Heads Matter for In-Context Learning?
ICML 2025
34
citations
Monitoring Latent World States in Language Models with Propositional Probes
ICLR 2025
21
citations
Establishing Best Practices in Building Rigorous Agentic Benchmarks
NeurIPS 2025
12
citations
Uncovering Gaps in How Humans and LLMs Interpret Subjective Language
ICLR 2025arXiv
2
citations
Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations
ICML 2024
0
citations
Feedback Loops With Language Models Drive In-Context Reward Hacking
ICML 2024
0
citations
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
ICML 2024
0
citations