by Adam Gleave Papers
3 papers found
Preference Learning with Lie Detectors can Induce Honesty or Evasion
Chris Cundy, Adam Gleave
NeurIPS 2025posterarXiv:2505.13787
4
citations
Scaling Trends in Language Model Robustness
Nikolaus Howe, Ian McKenzie, Oskar Hollinsworth et al.
ICML 2025spotlight
STARC: A General Framework For Quantifying Differences Between Reward Functions
Joar Skalse, Lucy Farnik, Sumeet Motwani et al.
ICLR 2024poster