ICLR Poster "reinforcement learning from human feedback" Papers
17 papers found
A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement
Hui Yuan, Yifan Zeng, Yue Wu et al.
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux et al.
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
Wenxuan Zhang, Philip Torr, Mohamed Elhoseiny et al.
BOND: Aligning LLMs with Best-of-N Distillation
Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot-Desenonges et al.
Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking
Cassidy Laidlaw, Shivam Singhal, Anca Dragan
HERO: Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning
Ayano Hiranaka, Shang-Fu Chen, Chieh-Hsin Lai et al.
How to Evaluate Reward Models for RLHF
Evan Frick, Tianle Li, Connor Chen et al.
Language Models Learn to Mislead Humans via RLHF
Jiaxin Wen, Ruiqi Zhong, Akbir Khan et al.
Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN
Pengxiang Li, Lu Yin, Shiwei Liu
More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness
Aaron J. Li, Satyapriya Krishna, Hima Lakkaraju
On a Connection Between Imitation Learning and RLHF
Teng Xiao, Yige Yuan, Mingxiao Li et al.
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Javier Rando, Tony Wang, Stewart Slocum et al.
Reward Learning from Multiple Feedback Types
Yannick Metz, Andras Geiszl, Raphaël Baur et al.
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style
Yantao Liu, Zijun Yao, Rui Min et al.
Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models
Ángela López-Cardona, Carlos Segura, Alexandros Karatzoglou et al.
SELF-EVOLVED REWARD LEARNING FOR LLMS
Chenghua Huang, Zhizhen Fan, Lu Wang et al.
Uncertainty and Influence aware Reward Model Refinement for Reinforcement Learning from Human Feedback
Zexu Sun, Yiju Guo, Yankai Lin et al.