NeurIPS 2025 "human preference alignment" Papers
2 papers found
Direct Alignment with Heterogeneous Preferences
Ali Shirali, Arash Nasr-Esfahany, Abdullah Alomar et al.
NeurIPS 2025posterarXiv:2502.16320
8
citations
Inference-Time Reward Hacking in Large Language Models
Hadi Khalaf, Claudio Mayrink Verdun, Alex Oesterling et al.
NeurIPS 2025spotlightarXiv:2506.19248
2
citations