Peter Henderson

7

Papers

424

Total Citations

Papers (7)

Safety Alignment Should be Made More Than Just a Few Tokens Deep

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal

Dynamic Risk Assessments for Offensive Cybersecurity Agents

NeurIPS 2025arXiv

Position: In-House Evaluation Is Not Enough. Towards Robust Third-Party Evaluation and Flaw Disclosure for General-Purpose AI

Position: A Safe Harbor for AI Evaluation and Red Teaming

Position: On the Societal Impact of Open Foundation Models

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications