"reinforcement learning from human feedback" Papers

33 papers found

Language Models Learn to Mislead Humans via RLHF

Jiaxin Wen, Ruiqi Zhong, Akbir Khan et al.

ICLR 2025posterarXiv:2409.12822
73
citations

Active Preference Learning for Large Language Models

William Muldrew, Peter Hayes, Mingtian Zhang et al.

ICML 2024poster

A Minimaximalist Approach to Reinforcement Learning from Human Feedback

Gokul Swamy, Christoph Dann, Rahul Kidambi et al.

ICML 2024poster

BRAIn: Bayesian Reward-conditioned Amortized Inference for natural language generation from feedback

Gaurav Pandey, Yatin Nandwani, Tahira Naseem et al.

ICML 2024poster

CogBench: a large language model walks into a psychology lab

Julian Coda-Forno, Marcel Binz, Jane Wang et al.

ICML 2024oral

Decoding-time Realignment of Language Models

Tianlin Liu, Shangmin Guo, Leonardo Martins Bianco et al.

ICML 2024spotlight

Dense Reward for Free in Reinforcement Learning from Human Feedback

Alexander Chan, Hao Sun, Samuel Holt et al.

ICML 2024poster

Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization

Yihan Du, Anna Winnicki, Gal Dalal et al.

ICML 2024poster

Exploring the LLM Journey from Cognition to Expression with Linear Representations

Yuzi Yan, Jialian Li, YipinZhang et al.

ICML 2024poster

Fundamental Limitations of Alignment in Large Language Models

Yotam Wolf, Noam Wies, Oshri Avnery et al.

ICML 2024poster

How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?

Ryan Liu, Theodore R Sumers, Ishita Dasgupta et al.

ICML 2024poster

Human Alignment of Large Language Models through Online Preference Optimisation

Daniele Calandriello, Zhaohan Guo, REMI MUNOS et al.

ICML 2024poster

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Shusheng Xu, Wei Fu, Jiaxuan Gao et al.

ICML 2024poster

Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF

Banghua Zhu, Michael Jordan, Jiantao Jiao

ICML 2024poster

Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint

Wei Xiong, Hanze Dong, Chenlu Ye et al.

ICML 2024poster

Learning Optimal Advantage from Preferences and Mistaking It for Reward

W Bradley Knox, Stephane Hatgis-Kessell, Sigurdur Orn Adalgeirsson et al.

AAAI 2024paperarXiv:2310.02456
15
citations

Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback

songyang gao, Qiming Ge, Wei Shen et al.

ICML 2024poster

MaxMin-RLHF: Alignment with Diverse Human Preferences

Souradip Chakraborty, Jiahao Qiu, Hui Yuan et al.

ICML 2024poster

MusicRL: Aligning Music Generation to Human Preferences

Geoffrey Cideron, Sertan Girgin, Mauro Verzetti et al.

ICML 2024posterarXiv:2301.11325

Nash Learning from Human Feedback

REMI MUNOS, Michal Valko, Daniele Calandriello et al.

ICML 2024spotlight

ODIN: Disentangled Reward Mitigates Hacking in RLHF

Lichang Chen, Chen Zhu, Jiuhai Chen et al.

ICML 2024poster

Position: Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback

Vincent Conitzer, Rachel Freedman, Jobstq Heitzig et al.

ICML 2024poster

Preference Ranking Optimization for Human Alignment

Feifan Song, Bowen Yu, Minghao Li et al.

AAAI 2024paperarXiv:2306.17492
334
citations

Privacy-Preserving Instructions for Aligning Large Language Models

Da Yu, Peter Kairouz, Sewoong Oh et al.

ICML 2024poster

Quality Diversity through Human Feedback: Towards Open-Ended Diversity-Driven Optimization

Li Ding, Jenny Zhang, Jeff Clune et al.

ICML 2024poster

ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models

Ziniu Li, Tian Xu, Yushun Zhang et al.

ICML 2024poster

Reward Model Learning vs. Direct Policy Optimization: A Comparative Analysis of Learning from Human Preferences

Andi Nika, Debmalya Mandal, Parameswaran Kamalaruban et al.

ICML 2024poster

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Harrison Lee, Samrat Phatale, Hassan Mansoor et al.

ICML 2024poster

RLVF: Learning from Verbal Feedback without Overgeneralization

Moritz Stephan, Alexander Khazatsky, Eric Mitchell et al.

ICML 2024poster

Underspecification in Language Modeling Tasks: A Causality-Informed Study of Gendered Pronoun Resolution

Emily McMilin

AAAI 2024paperarXiv:2210.00131

WARM: On the Benefits of Weight Averaged Reward Models

Alexandre Rame, Nino Vieillard, Léonard Hussenot et al.

ICML 2024poster

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Collin Burns, Pavel Izmailov, Jan Kirchner et al.

ICML 2024poster

Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-World Multi-Turn Dialogue

Songhua Yang, Hanjie Zhao, Senbin Zhu et al.

AAAI 2024paperarXiv:2308.03549
204
citations