ICLR Poster "reinforcement learning from human feedback" Papers

17 papers found

A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement

Hui Yuan, Yifan Zeng, Yue Wu et al.

ICLR 2025posterarXiv:2410.13828
5
citations

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux et al.

ICLR 2025posterarXiv:2410.18252
39
citations

Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

Wenxuan Zhang, Philip Torr, Mohamed Elhoseiny et al.

ICLR 2025posterarXiv:2408.15313
23
citations

BOND: Aligning LLMs with Best-of-N Distillation

Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot-Desenonges et al.

ICLR 2025posterarXiv:2407.14622
50
citations

Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking

Cassidy Laidlaw, Shivam Singhal, Anca Dragan

ICLR 2025posterarXiv:2403.03185
24
citations

HERO: Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning

Ayano Hiranaka, Shang-Fu Chen, Chieh-Hsin Lai et al.

ICLR 2025posterarXiv:2410.05116
2
citations

How to Evaluate Reward Models for RLHF

Evan Frick, Tianle Li, Connor Chen et al.

ICLR 2025posterarXiv:2410.14872
51
citations

Language Models Learn to Mislead Humans via RLHF

Jiaxin Wen, Ruiqi Zhong, Akbir Khan et al.

ICLR 2025posterarXiv:2409.12822
73
citations

Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN

Pengxiang Li, Lu Yin, Shiwei Liu

ICLR 2025posterarXiv:2412.13795
25
citations

More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness

Aaron J. Li, Satyapriya Krishna, Hima Lakkaraju

ICLR 2025posterarXiv:2404.18870
10
citations

On a Connection Between Imitation Learning and RLHF

Teng Xiao, Yige Yuan, Mingxiao Li et al.

ICLR 2025posterarXiv:2503.05079
13
citations

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Javier Rando, Tony Wang, Stewart Slocum et al.

ICLR 2025posterarXiv:2307.15217
733
citations

Reward Learning from Multiple Feedback Types

Yannick Metz, Andras Geiszl, Raphaël Baur et al.

ICLR 2025posterarXiv:2502.21038
4
citations

RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

Yantao Liu, Zijun Yao, Rui Min et al.

ICLR 2025posterarXiv:2410.16184
97
citations

Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models

Ángela López-Cardona, Carlos Segura, Alexandros Karatzoglou et al.

ICLR 2025posterarXiv:2410.01532
8
citations

SELF-EVOLVED REWARD LEARNING FOR LLMS

Chenghua Huang, Zhizhen Fan, Lu Wang et al.

ICLR 2025posterarXiv:2411.00418
18
citations

Uncertainty and Influence aware Reward Model Refinement for Reinforcement Learning from Human Feedback

Zexu Sun, Yiju Guo, Yankai Lin et al.

ICLR 2025poster
3
citations