2025 "language model alignment" Papers

13 papers found

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux et al.

ICLR 2025posterarXiv:2410.18252
39
citations

Checklists Are Better Than Reward Models For Aligning Language Models

Vijay Viswanathan, Yanchao Sun, Xiang Kong et al.

NeurIPS 2025spotlightarXiv:2507.18624
23
citations

How to Evaluate Reward Models for RLHF

Evan Frick, Tianle Li, Connor Chen et al.

ICLR 2025posterarXiv:2410.14872
51
citations

Language Models Learn to Mislead Humans via RLHF

Jiaxin Wen, Ruiqi Zhong, Akbir Khan et al.

ICLR 2025posterarXiv:2409.12822
73
citations

Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling

Nguyen Phuc, Ngoc-Hieu Nguyen, Duy M. H. Nguyen et al.

NeurIPS 2025posterarXiv:2506.08681

RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

Yantao Liu, Zijun Yao, Rui Min et al.

ICLR 2025posterarXiv:2410.16184
97
citations

Scalable Valuation of Human Feedback through Provably Robust Model Alignment

Masahiro Fujisawa, Masaki Adachi, Michael A Osborne

NeurIPS 2025posterarXiv:2505.17859
1
citations

SELF-EVOLVED REWARD LEARNING FOR LLMS

Chenghua Huang, Zhizhen Fan, Lu Wang et al.

ICLR 2025posterarXiv:2411.00418
18
citations

SeRA: Self-Reviewing and Alignment of LLMs using Implicit Reward Margins

Jongwoo Ko, Saket Dingliwal, Bhavana Ganesh et al.

ICLR 2025poster
5
citations

SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters

Teng Xiao, Yige Yuan, Zhengyu Chen et al.

ICLR 2025posterarXiv:2502.00883
23
citations

Sparta Alignment: Collectively Aligning Multiple Language Models through Combat

Yuru Jiang, Wenxuan Ding, Shangbin Feng et al.

NeurIPS 2025posterarXiv:2506.04721
3
citations

Variational Best-of-N Alignment

Afra Amini, Tim Vieira, Elliott Ash et al.

ICLR 2025posterarXiv:2407.06057
37
citations

Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model

Wenhong Zhu, Zhiwei He, Xiaofeng Wang et al.

ICLR 2025posterarXiv:2410.18640
14
citations