"reward hacking mitigation" Papers
3 papers found
RRM: Robust Reward Model Training Mitigates Reward Hacking
Tianqi Liu, Wei Xiong, Jie Ren et al.
ICLR 2025posterarXiv:2409.13156
44
citations
Transforming and Combining Rewards for Aligning Large Language Models
Zihao Wang, Chirag Nagpal, Jonathan Berant et al.
ICML 2024poster
WARM: On the Benefits of Weight Averaged Reward Models
Alexandre Rame, Nino Vieillard, Léonard Hussenot et al.
ICML 2024poster