Peter Henderson
7
Papers
424
Total Citations
Papers (7)
Safety Alignment Should be Made More Than Just a Few Tokens Deep
ICLR 2025
277
citations
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
ICLR 2025arXiv
141
citations
Dynamic Risk Assessments for Offensive Cybersecurity Agents
NeurIPS 2025arXiv
4
citations
Position: In-House Evaluation Is Not Enough. Towards Robust Third-Party Evaluation and Flaw Disclosure for General-Purpose AI
ICML 2025
2
citations
Position: A Safe Harbor for AI Evaluation and Red Teaming
ICML 2024
0
citations
Position: On the Societal Impact of Open Foundation Models
ICML 2024
0
citations
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
ICML 2024
0
citations