He He

8

Papers

98

Total Citations

Papers (8)

Language Models Learn to Mislead Humans via RLHF

A Credit Assignment Compiler for Joint Prediction

NeurIPS 2016arXiv

Predicting Empirical AI Research Outcomes with Language Models

Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations

Opponent Modeling in Deep Reinforcement Learning

IRM—when it works and when it doesn't: A test case of natural language inference

SeqPATE: Differentially Private Text Generation via Knowledge Distillation

Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples