RLHF
Reinforcement learning from human feedback
Top Papers
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
Jipeng Zhang, Hanze Dong, Tong Zhang et al.
Eureka: Human-Level Reward Design via Coding Large Language Models
Yecheng Jason Ma, William Liang, Guanzhi Wang et al.
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
Tianyu Yu, Yuan Yao, Haoye Zhang et al.
Preference Ranking Optimization for Human Alignment
Feifan Song, Bowen Yu, Minghao Li et al.
Understanding the Effects of RLHF on LLM Generalisation and Diversity
Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis et al.
Self-Play Preference Optimization for Language Model Alignment
Yue Wu, Zhiqing Sun, Rina Hughes et al.
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Haozhe Wang, Chao Qu, Zuming Huang et al.
HIVE: Harnessing Human Feedback for Instructional Visual Editing
Shu Zhang, Xinyi Yang, Yihao Feng et al.
ToolRL: Reward is All Tool Learning Needs
Cheng Qian, Emre Can Acikgoz, Qi He et al.
Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
Juan Rocamonde, Victoriano Montesinos, Elvis Nava et al.
TTRL: Test-Time Reinforcement Learning
Yuxin Zuo, Kaiyan Zhang, Li Sheng et al.
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning
Zehan Qi, Xiao Liu, Iat Long Iong et al.
Universal Jailbreak Backdoors from Poisoned Human Feedback
Javier Rando, Florian Tramer
Improving Video Generation with Human Feedback
Jie Liu, Gongye Liu, Jiajun Liang et al.
HelpSteer2-Preference: Complementing Ratings with Preferences
Zhilin Wang, Alexander Bukharin, Olivier Delalleau et al.
InstructVideo: Instructing Video Diffusion Models with Human Feedback
Hangjie Yuan, Shiwei Zhang, Xiang Wang et al.
Confronting Reward Model Overoptimization with Constrained RLHF
Ted Moskovitz, Aaditya Singh, DJ Strouse et al.
Language Models Learn to Mislead Humans via RLHF
Jiaxin Wen, Ruiqi Zhong, Akbir Khan et al.
LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models
Marwa Abdulhai, Isadora White, Charlie Snell et al.
Perception-R1: Pioneering Perception Policy with Reinforcement Learning
En Yu, Kangheng Lin, Liang Zhao et al.
RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness
Tianyu Yu, Haoye Zhang, Qiming Li et al.
BOND: Aligning LLMs with Best-of-N Distillation
Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot-Desenonges et al.
How to Evaluate Reward Models for RLHF
Evan Frick, Tianle Li, Connor Chen et al.
RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning
Hao Gao, Shaoyu Chen, Bo Jiang et al.
On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback
Marcus Williams, Micah Carroll, Adhyyan Narang et al.
Dual RL: Unification and New Methods for Reinforcement and Imitation Learning
Harshit Sikchi, Qinqing Zheng, Amy Zhang et al.
Provable Offline Preference-Based Reinforcement Learning
Wenhao Zhan, Masatoshi Uehara, Nathan Kallus et al.
Making RL with Preference-based Feedback Efficient via Randomization
Runzhe Wu, Wen Sun
CPPO: Continual Learning for Reinforcement Learning with Human Feedback
Han Zhang, Yu Lei, Lin Gui et al.
HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages
Zhilin Wang, Jiaqi Zeng, Olivier Delalleau et al.
Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models
Hritik Bansal, John Dang, Aditya Grover
EarnHFT: Efficient Hierarchical Reinforcement Learning for High Frequency Trading
Molei Qin, Shuo Sun, Wentao Zhang et al.
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
Wenxuan Zhang, Philip Torr, Mohamed Elhoseiny et al.
Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning
Yun Qu, Yuhang Jiang, Boyuan Wang et al.
Teaching Language Models to Critique via Reinforcement Learning
Zhihui Xie, Jie chen, Liyu Chen et al.
Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving
Zhenghao Peng, Wenjie Luo, Yiren Lu et al.
ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning
Hongyin Zhang, Zifeng Zhuang, Han Zhao et al.
SeRL: Self-play Reinforcement Learning for Large Language Models with Limited Data
Wenkai Fang, Shunyu Liu, Yang Zhou et al.
Online Preference Alignment for Language Models via Count-based Exploration
Chenjia Bai, Yang Zhang, Shuang Qiu et al.
Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards
Zijing Hu, Fengda Zhang, Long Chen et al.
Hierarchical World Models as Visual Whole-Body Humanoid Controllers
Nick Hansen, Jyothir S V, Vlad Sobal et al.
SELF-EVOLVED REWARD LEARNING FOR LLMS
Chenghua Huang, Zhizhen Fan, Lu Wang et al.
Learning Optimal Advantage from Preferences and Mistaking It for Reward
W Bradley Knox, Stephane Hatgis-Kessell, Sigurdur Orn Adalgeirsson et al.
Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards
Xiaoyuan Liu, Tian Liang, Zhiwei He et al.
Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF
Zhaolin Gao, Wenhao Zhan, Jonathan Chang et al.
Reinforcement Learning Friendly Vision-Language Model for Minecraft
Haobin Jiang, Junpeng Yue, Hao Luo et al.
Rating-Based Reinforcement Learning
Devin White, Mingkang Wu, Ellen Novoseller et al.
PILAF: Optimal Human Preference Sampling for Reward Modeling
Yunzhen Feng, Ariel Kwiatkowski, Kunhao Zheng et al.
Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards
Charles Arnal, Gaëtan Narozniak, Vivien Cabannes et al.
Scaling Autonomous Agents via Automatic Reward Modeling And Planning
Zhenfang Chen, Delin Chen, Rui Sun et al.
Post-hoc Reward Calibration: A Case Study on Length Bias
Zeyu Huang, Zihan Qiu, zili wang et al.
Stabilizing Contrastive RL: Techniques for Robotic Goal Reaching from Offline Data
Chongyi Zheng, Benjamin Eysenbach, Homer Walke et al.
ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning
Tonghe Zhang, Chao Yu, Sichang Su et al.
More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness
Aaron J. Li, Satyapriya Krishna, Hima Lakkaraju
Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference
Qining Zhang, Lei Ying
Evaluating Large Language Models through Role-Guide and Self-Reflection: A Comparative Study
Lili Zhao, Yang Wang, Qi Liu et al.
Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics
Dongyoung Kim, Huiwon Jang, Sumin Park et al.
Rapidly Adapting Policies to the Real-World via Simulation-Guided Fine-Tuning
Patrick Yin, Tyler Westenbroek, Ching-An Cheng et al.
Re2LLM: Reflective Reinforcement Large Language Model for Session-based Recommendation
Ziyan Wang, Yingpeng Du, Zhu Sun et al.
REvolve: Reward Evolution with Large Language Models using Human Feedback
RISHI HAZRA, Alkis Sygkounas, Andreas Persson et al.
Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning
Joey Hong, Anca Dragan, Sergey Levine
DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing
Vint Lee, Pieter Abbeel, Youngwoon Lee
DrS: Learning Reusable Dense Rewards for Multi-Stage Tasks
Tongzhou Mu, Minghua Liu, Hao Su
Enhancing Rating-Based Reinforcement Learning to Effectively Leverage Feedback from Large Vision-Language Models
Minh-Tung Luu, Younghwan Lee, Donghoon Lee et al.
Learning from negative feedback, or positive feedback or both
Abbas Abdolmaleki, Bilal Piot, Bobak Shahriari et al.
Non-Adversarial Inverse Reinforcement Learning via Successor Feature Matching
Arnav Kumar Jain, Harley Wiltzer, Jesse Farebrother et al.
CL-LoRA: Continual Low-Rank Adaptation for Rehearsal-Free Class-Incremental Learning
Jiangpeng He, Zhihao Duan, Fengqing Zhu
Aligning Language Models Using Follow-up Likelihood as Reward Signal
Chen Zhang, Dading Chong, Feng Jiang et al.
Inverse Reinforcement Learning by Estimating Expertise of Demonstrators
Mark Beliaev, Ramtin Pedarsani
SORREL: Suboptimal-Demonstration-Guided Reinforcement Learning for Learning to Branch
Shengyu Feng, Yiming Yang
PA2D-MORL: Pareto Ascent Directional Decomposition Based Multi-Objective Reinforcement Learning
Tianmeng Hu, Biao Luo
Real-Time Recurrent Reinforcement Learning
Julian Lemmel, Radu Grosu
ERL-TD: Evolutionary Reinforcement Learning Enhanced with Truncated Variance and Distillation Mutation
Qiuzhen Lin, Yangfan Chen, Lijia Ma et al.
UTILITY: Utilizing Explainable Reinforcement Learning to Improve Reinforcement Learning
Shicheng Liu, Minghui Zhu
ReNeg: Learning Negative Embedding with Reward Guidance
Xiaomin Li, yixuan liu, Takashi Isobe et al.
On the Robustness of Reward Models for Language Model Alignment
Jiwoo Hong, Noah Lee, Eunki Kim et al.
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
Shenao Zhang, Zhihan Liu, Boyi Liu et al.
SeRA: Self-Reviewing and Alignment of LLMs using Implicit Reward Margins
Jongwoo Ko, Saket Dingliwal, Bhavana Ganesh et al.
Model-based Offline Reinforcement Learning with Lower Expectile Q-Learning
Kwanyoung Park, Youngwoon Lee
Q-Adapter: Customizing Pre-trained LLMs to New Preferences with Forgetting Mitigation
Yi-Chen Li, Fuxiang Zhang, Wenjie Qiu et al.
ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning
Ruiyang Zhou, Shuozhe Li, Amy Zhang et al.
ReDit: Reward Dithering for Improved LLM Policy Optimization
Chenxing Wei, Jiarui Yu, Ying He et al.
SEEA-R1: Tree-Structured Reinforcement Fine-Tuning for Self-Evolving Embodied Agents
Wanxin Tian, Shijie Zhang, Kevin Zhang et al.
Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective
Jiawei Huang, Bingcong Li, Christoph Dann et al.
Enhancing Human Experience in Human-Agent Collaboration: A Human-Centered Modeling Approach Based on Positive Human Gain
Yiming Gao, Feiyu Liu, Liang Wang et al.
MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment
Tianze Wang, Dongnan Gui, Yifan Hu et al.
Projection Optimization: A General Framework for Multi-Objective and Multi-Group RLHF
Nuoya Xiong, Aarti Singh
Position: Lifetime tuning is incompatible with continual reinforcement learning
Golnaz Mesbahi, Parham Mohammad Panahi, Olya Mastikhina et al.
On Corruption-Robustness in Performative Reinforcement Learning
Vasilis Pollatos, Debmalya Mandal, Goran Radanovic
MetaCARD: Meta-Reinforcement Learning with Task Uncertainty Feedback via Decoupled Context-Aware Reward and Dynamics Components
Min Wang, Xin Li, Leiji Zhang et al.
Visual Reinforcement Learning with Residual Action
Zhenxian Liu, Peixi Peng, Yonghong Tian
Capturing Individual Human Preferences with Reward Features
Andre Barreto, Vincent Dumoulin, Yiran Mao et al.
Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts
Yanting Yang, Minghao Chen, Qibo Qiu et al.
Doubly Robust Alignment for Large Language Models
Erhan Xu, Kai Ye, Hongyi Zhou et al.
Neural Motion Simulator Pushing the Limit of World Models in Reinforcement Learning
Chenjie Hao, Weyl Lu, Yifan Xu et al.
SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning
Lin Zhang, Xianfang Zeng, Kangcong Li et al.
Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics
Muleilan Pei, Shaoshuai Shi, Xuesong Chen et al.
Reinforcement Learning from Imperfect Corrective Actions and Proxy Rewards
Zhaohui JIANG, Xuening Feng, Paul Weng et al.
Uncertainty and Influence aware Reward Model Refinement for Reinforcement Learning from Human Feedback
Zexu Sun, Yiju Guo, Yankai Lin et al.
Differentiable Information Enhanced Model-Based Reinforcement Learning
Xiaoyuan Zhang, Xinyan Cai, Bo Liu et al.