Jacob Steinhardt

9

Papers

193

Total Citations

Papers (9)

Language Models Learn to Mislead Humans via RLHF

Describing Differences in Image Sets with Natural Language

Which Attention Heads Matter for In-Context Learning?

Monitoring Latent World States in Language Models with Propositional Probes

Establishing Best Practices in Building Rigorous Agentic Benchmarks

Uncovering Gaps in How Humans and LLMs Interpret Subjective Language

Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations

Feedback Loops With Language Models Drive In-Context Reward Hacking

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation