ICLR 2025 "language model alignment" Papers
9 papers found
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux et al.
ICLR 2025posterarXiv:2410.18252
39
citations
How to Evaluate Reward Models for RLHF
Evan Frick, Tianle Li, Connor Chen et al.
ICLR 2025posterarXiv:2410.14872
51
citations
Language Models Learn to Mislead Humans via RLHF
Jiaxin Wen, Ruiqi Zhong, Akbir Khan et al.
ICLR 2025posterarXiv:2409.12822
73
citations
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style
Yantao Liu, Zijun Yao, Rui Min et al.
ICLR 2025posterarXiv:2410.16184
97
citations
SELF-EVOLVED REWARD LEARNING FOR LLMS
Chenghua Huang, Zhizhen Fan, Lu Wang et al.
ICLR 2025posterarXiv:2411.00418
18
citations
SeRA: Self-Reviewing and Alignment of LLMs using Implicit Reward Margins
Jongwoo Ko, Saket Dingliwal, Bhavana Ganesh et al.
ICLR 2025poster
5
citations
SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters
Teng Xiao, Yige Yuan, Zhengyu Chen et al.
ICLR 2025posterarXiv:2502.00883
23
citations
Variational Best-of-N Alignment
Afra Amini, Tim Vieira, Elliott Ash et al.
ICLR 2025posterarXiv:2407.06057
37
citations
Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model
Wenhong Zhu, Zhiwei He, Xiaofeng Wang et al.
ICLR 2025posterarXiv:2410.18640
14
citations