ICML "reward hacking" Papers
3 papers found
Decoding-time Realignment of Language Models
Tianlin Liu, Shangmin Guo, Leonardo Martins Bianco et al.
ICML 2024spotlight
Feedback Loops With Language Models Drive In-Context Reward Hacking
Alexander Pan, Erik Jones, Meena Jagadeesan et al.
ICML 2024poster
ODIN: Disentangled Reward Mitigates Hacking in RLHF
Lichang Chen, Chen Zhu, Jiuhai Chen et al.
ICML 2024poster