Peter Henderson

10

Papers

420

Total Citations

Papers (10)

Safety Alignment Should be Made More Than Just a Few Tokens Deep

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal

Position: In-House Evaluation Is Not Enough. Towards Robust Third-Party Evaluation and Flaw Disclosure for General-Purpose AI

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Position: A Safe Harbor for AI Evaluation and Red Teaming

Position: On the Societal Impact of Open Foundation Models

Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

Cheaply Estimating Inference Efficiency Metrics for Autoregressive Transformer Models

Separating value functions across time-scales