Anca Dragan

25

Papers

84

Total Citations

Papers (25)

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

Learning Optimal Advantage from Preferences and Mistaking It for Reward

Context Steering: Controllable Personalization at Inference Time

Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning

The Effective Horizon Explains Deep RL Performance in Stochastic Environments

AssistanceZero: Scalably Solving Assistance Games

Learning to Model the World With Language

Learning Temporal Distances: Contrastive Successor Features Can Provide a Metric Structure for Decision-Making

Cooperative Inverse Reinforcement Learning

NeurIPS 2016arXiv

Inverse Reward Design

NeurIPS 2017arXiv

AI Alignment with Changing and Influenceable Reward Functions

Coprocessor Actor Critic: A Model-Based Reinforcement Learning Approach For Adaptive Brain Stimulation

On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference

Learning a Prior over Intent via Meta-Inverse Reinforcement Learning

Where Do You Think You're Going?: Inferring Beliefs about Dynamics from Behavior

On the Utility of Learning about Humans for Human-AI Coordination

Reward-rational (implicit) choice: A unifying formalism for reward learning

AvE: Assistance via Empowerment

Preference learning along multiple criteria: A game-theoretic perspective

Pragmatic Image Compression for Human-in-the-Loop Decision-Making

First Contact: Unsupervised Human-Machine Co-Adaptation via Mutual Information Maximization

Uni[MASK]: Unified Inference in Sequential Decision Problems

Learning to Influence Human Behavior with Offline Reinforcement Learning

Bridging RL Theory and Practice with the Effective Horizon

An Efficient, Generalized Bellman Update For Cooperative Inverse Reinforcement Learning