ICML 2024 "reinforcement learning from human feedback" Papers
28 papers found
Active Preference Learning for Large Language Models
William Muldrew, Peter Hayes, Mingtian Zhang et al.
A Minimaximalist Approach to Reinforcement Learning from Human Feedback
Gokul Swamy, Christoph Dann, Rahul Kidambi et al.
BRAIn: Bayesian Reward-conditioned Amortized Inference for natural language generation from feedback
Gaurav Pandey, Yatin Nandwani, Tahira Naseem et al.
CogBench: a large language model walks into a psychology lab
Julian Coda-Forno, Marcel Binz, Jane Wang et al.
Decoding-time Realignment of Language Models
Tianlin Liu, Shangmin Guo, Leonardo Martins Bianco et al.
Dense Reward for Free in Reinforcement Learning from Human Feedback
Alexander Chan, Hao Sun, Samuel Holt et al.
Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization
Yihan Du, Anna Winnicki, Gal Dalal et al.
Exploring the LLM Journey from Cognition to Expression with Linear Representations
Yuzi Yan, Jialian Li, YipinZhang et al.
Fundamental Limitations of Alignment in Large Language Models
Yotam Wolf, Noam Wies, Oshri Avnery et al.
How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?
Ryan Liu, Theodore R Sumers, Ishita Dasgupta et al.
Human Alignment of Large Language Models through Online Preference Optimisation
Daniele Calandriello, Zhaohan Guo, REMI MUNOS et al.
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study
Shusheng Xu, Wei Fu, Jiaxuan Gao et al.
Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF
Banghua Zhu, Michael Jordan, Jiantao Jiao
Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint
Wei Xiong, Hanze Dong, Chenlu Ye et al.
Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback
songyang gao, Qiming Ge, Wei Shen et al.
MaxMin-RLHF: Alignment with Diverse Human Preferences
Souradip Chakraborty, Jiahao Qiu, Hui Yuan et al.
MusicRL: Aligning Music Generation to Human Preferences
Geoffrey Cideron, Sertan Girgin, Mauro Verzetti et al.
Nash Learning from Human Feedback
REMI MUNOS, Michal Valko, Daniele Calandriello et al.
ODIN: Disentangled Reward Mitigates Hacking in RLHF
Lichang Chen, Chen Zhu, Jiuhai Chen et al.
Position: Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback
Vincent Conitzer, Rachel Freedman, Jobstq Heitzig et al.
Privacy-Preserving Instructions for Aligning Large Language Models
Da Yu, Peter Kairouz, Sewoong Oh et al.
Quality Diversity through Human Feedback: Towards Open-Ended Diversity-Driven Optimization
Li Ding, Jenny Zhang, Jeff Clune et al.
ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models
Ziniu Li, Tian Xu, Yushun Zhang et al.
Reward Model Learning vs. Direct Policy Optimization: A Comparative Analysis of Learning from Human Preferences
Andi Nika, Debmalya Mandal, Parameswaran Kamalaruban et al.
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
Harrison Lee, Samrat Phatale, Hassan Mansoor et al.
RLVF: Learning from Verbal Feedback without Overgeneralization
Moritz Stephan, Alexander Khazatsky, Eric Mitchell et al.
WARM: On the Benefits of Weight Averaged Reward Models
Alexandre Rame, Nino Vieillard, Léonard Hussenot et al.
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Collin Burns, Pavel Izmailov, Jan Kirchner et al.