RLHF
Reinforcement learning from human feedback
Top Papers
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
Jipeng Zhang, Hanze Dong, Tong Zhang et al.
Eureka: Human-Level Reward Design via Coding Large Language Models
Yecheng Jason Ma, William Liang, Guanzhi Wang et al.
Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models
Seungone Kim, Jamin Shin, yejin cho et al.
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
Tianyu Yu, Yuan Yao, Haoye Zhang et al.
Preference Ranking Optimization for Human Alignment
Feifan Song, Bowen Yu, Minghao Li et al.
OpenChat: Advancing Open-source Language Models with Mixed-Quality Data
Guan Wang, Sijie Cheng, Xianyuan Zhan et al.
Understanding the Effects of RLHF on LLM Generalisation and Diversity
Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis et al.
Self-Play Preference Optimization for Language Model Alignment
Yue Wu, Zhiqing Sun, Rina Hughes et al.
Habitat 3.0: A Co-Habitat for Humans, Avatars, and Robots
Xavier Puig, Eric Undersander, Andrew Szot et al.
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Haozhe Wang, Chao Qu, Zuming Huang et al.
HIVE: Harnessing Human Feedback for Instructional Visual Editing
Shu Zhang, Xinyi Yang, Yihao Feng et al.
ToolRL: Reward is All Tool Learning Needs
Cheng Qian, Emre Can Acikgoz, Qi He et al.
Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
Juan Rocamonde, Victoriano Montesinos, Elvis Nava et al.
TTRL: Test-Time Reinforcement Learning
Yuxin Zuo, Kaiyan Zhang, Li Sheng et al.
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning
Zehan Qi, Xiao Liu, Iat Long Iong et al.
Universal Jailbreak Backdoors from Poisoned Human Feedback
Javier Rando, Florian Tramer
Improving Video Generation with Human Feedback
Jie Liu, Gongye Liu, Jiajun Liang et al.
HelpSteer2-Preference: Complementing Ratings with Preferences
Zhilin Wang, Alexander Bukharin, Olivier Delalleau et al.
Human Feedback is not Gold Standard
Tom Hosking, Phil Blunsom, Max Bartolo
InstructVideo: Instructing Video Diffusion Models with Human Feedback
Hangjie Yuan, Shiwei Zhang, Xiang Wang et al.
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation
Jiafei Duan, Wilbert Pumacay, Nishanth Kumar et al.
TLControl: Trajectory and Language Control for Human Motion Synthesis
WEILIN WAN, Zhiyang Dou, Taku Komura et al.
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
Xinyu Zhu, Mengzhou Xia, Zhepei Wei et al.
OGBench: Benchmarking Offline Goal-Conditioned RL
Seohong Park, Kevin Frans, Benjamin Eysenbach et al.
Confronting Reward Model Overoptimization with Constrained RLHF
Ted Moskovitz, Aaditya Singh, DJ Strouse et al.
Language Models Learn to Mislead Humans via RLHF
Jiaxin Wen, Ruiqi Zhong, Akbir Khan et al.
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
Davide Paglieri, Bartłomiej Cupiał, Samuel Coward et al.
Scaling Test-Time Compute Without Verification or RL is Suboptimal
Amrith Setlur, Nived Rajaraman, Sergey Levine et al.
LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models
Marwa Abdulhai, Isadora White, Charlie Snell et al.
CycleResearcher: Improving Automated Research via Automated Review
Yixuan Weng, Minjun Zhu, Guangsheng Bao et al.
Perception-R1: Pioneering Perception Policy with Reinforcement Learning
En Yu, Kangheng Lin, Liang Zhao et al.
RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts
Hjalmar Wijk, Tao Lin, Joel Becker et al.
Self-Improvement in Language Models: The Sharpening Mechanism
Audrey Huang, Adam Block, Dylan Foster et al.
RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness
Tianyu Yu, Haoye Zhang, Qiming Li et al.
CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control
Guy Tevet, Sigal Raab, Setareh Cohan et al.
How to Evaluate Reward Models for RLHF
Evan Frick, Tianle Li, Connor Chen et al.
BOND: Aligning LLMs with Best-of-N Distillation
Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot-Desenonges et al.
VinePPO: Refining Credit Assignment in RL Training of LLMs
Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance et al.
Duolando: Follower GPT with Off-Policy Reinforcement Learning for Dance Accompaniment
Siyao Li, Tianpei Gu, Zhitao Yang et al.
RRM: Robust Reward Model Training Mitigates Reward Hacking
Tianqi Liu, Wei Xiong, Jie Ren et al.
RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning
Hao Gao, Shaoyu Chen, Bo Jiang et al.
Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement
Jaehun Jung, Faeze Brahman, Yejin Choi
On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback
Marcus Williams, Micah Carroll, Adhyyan Narang et al.
Provable Offline Preference-Based Reinforcement Learning
Wenhao Zhan, Masatoshi Uehara, Nathan Kallus et al.
Dual RL: Unification and New Methods for Reinforcement and Imitation Learning
Harshit Sikchi, Qinqing Zheng, Amy Zhang et al.
Reasoning Gym: Reasoning Environments for Reinforcement Learning with Verifiable Rewards
Zafir Stojanovski, Oliver Stanley, Joe Sharratt et al.
Making RL with Preference-based Feedback Efficient via Randomization
Runzhe Wu, Wen Sun
Human-Object Interaction from Human-Level Instructions
Zhen Wu, Jiaman Li, Pei Xu et al.
Preference Optimization for Reasoning with Pseudo Feedback
Fangkai Jiao, Geyang Guo, Xingxing Zhang et al.
Random Feature Amplification: Feature Learning and Generalization in Neural Networks
Spencer Frei, Niladri Chatterji, Peter L. Bartlett
CPPO: Continual Learning for Reinforcement Learning with Human Feedback
Han Zhang, Yu Lei, Lin Gui et al.
HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages
Zhilin Wang, Jiaqi Zeng, Olivier Delalleau et al.
Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models
Hritik Bansal, John Dang, Aditya Grover
Enhancing Diffusion Models with Text-Encoder Reinforcement Learning
Chaofeng Chen, Annan Wang, Haoning Wu et al.
Revisiting Plasticity in Visual Reinforcement Learning: Data, Modules and Training Stages
Guozheng Ma, Lu Li, Sen Zhang et al.
RLIF: Interactive Imitation Learning as Reinforcement Learning
Jianlan Luo, Perry Dong, Yuexiang Zhai et al.
Moral Alignment for LLM Agents
Elizaveta Tennant, Stephen Hailes, Mirco Musolesi
A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning
Yinmin Zhang, Jie Liu, Chuming Li et al.
AlignSAM: Aligning Segment Anything Model to Open Context via Reinforcement Learning
Duojun Huang, Xinyu Xiong, Jie Ma et al.
EarnHFT: Efficient Hierarchical Reinforcement Learning for High Frequency Trading
Molei Qin, Shuo Sun, Wentao Zhang et al.
Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving
Zhenghao Peng, Wenjie Luo, Yiren Lu et al.
Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning
Yun Qu, Yuhang Jiang, Boyuan Wang et al.
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning
Fucai Ke, Zhixi Cai, Simindokht Jahangard et al.
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
Wenxuan Zhang, Philip Torr, Mohamed Elhoseiny et al.
Reward Guided Latent Consistency Distillation
William Wang, Jiachen Li, Weixi Feng et al.
Teaching Language Models to Critique via Reinforcement Learning
Zhihui Xie, Jie chen, Liyu Chen et al.
Self-Consistency Preference Optimization
Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang et al.
Robust Tracking via Mamba-based Context-aware Token Learning
Jinxia Xie, Bineng Zhong, Qihua Liang et al.
Reinforced Lifelong Editing for Language Models
Zherui Li, Houcheng Jiang, Hao Chen et al.
ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs
Jiaru Zou, Ling Yang, Jingwen Gu et al.
ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning
Hongyin Zhang, Zifeng Zhuang, Han Zhao et al.
SeRL: Self-play Reinforcement Learning for Large Language Models with Limited Data
Wenkai Fang, Shunyu Liu, Yang Zhou et al.
HR-Pro: Point-Supervised Temporal Action Localization via Hierarchical Reliability Propagation
Huaxin Zhang, Xiang Wang, Xiaohao Xu et al.
Hierarchical World Models as Visual Whole-Body Humanoid Controllers
Nick Hansen, Jyothir S V, Vlad Sobal et al.
Online Preference Alignment for Language Models via Count-based Exploration
Chenjia Bai, Yang Zhang, Shuang Qiu et al.
Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards
Zijing Hu, Fengda Zhang, Long Chen et al.
Progress or Regress? Self-Improvement Reversal in Post-training
Ting Wu, Xuefeng Li, Pengfei Liu
Cross-Embodiment Dexterous Grasping with Reinforcement Learning
Haoqi Yuan, Bohan Zhou, Yuhui Fu et al.
SELF-EVOLVED REWARD LEARNING FOR LLMS
Chenghua Huang, Zhizhen Fan, Lu Wang et al.
Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2)
Zhenjie Yang, Xiaosong Jia, Qifeng Li et al.
Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds
Hao Liang, Zhiquan Luo
ThinkBot: Embodied Instruction Following with Thought Chain Reasoning
Guanxing Lu, Ziwei Wang, Changliu Liu et al.
Diverse Preference Learning for Capabilities and Alignment
Stewart Slocum, Asher Parker-Sartori, Dylan Hadfield-Menell
Learning to Optimize Permutation Flow Shop Scheduling via Graph-Based Imitation Learning
Longkang Li, Siyuan Liang, Zihao Zhu et al.
Horizon Reduction Makes RL Scalable
Seohong Park, Kevin Frans, Deepinder Mann et al.
RocketEval: Efficient automated LLM evaluation via grading checklist
Tianjun Wei, Wei Wen, Ruizhi Qiao et al.
Learning Optimal Advantage from Preferences and Mistaking It for Reward
W Bradley Knox, Stephane Hatgis-Kessell, Sigurdur Orn Adalgeirsson et al.
Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards
Xiaoyuan Liu, Tian Liang, Zhiwei He et al.
Reinforcement Learning Friendly Vision-Language Model for Minecraft
Haobin Jiang, Junpeng Yue, Hao Luo et al.
Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections
Bo Wang, Qinyuan Cheng, Runyu Peng et al.
Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF
Zhaolin Gao, Wenhao Zhan, Jonathan Chang et al.
SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning
Xiao Liang, Zhong-Zhi Li, Yeyun Gong et al.
Ctrl-U: Robust Conditional Image Generation via Uncertainty-aware Reward Modeling
Guiyu Zhang, Huan-ang Gao, Zijian Jiang et al.
ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments
Taewoong Kim, Cheolhong Min, Byeonghwi Kim et al.
Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards
Charles Arnal, Gaëtan Narozniak, Vivien Cabannes et al.
Rating-Based Reinforcement Learning
Devin White, Mingkang Wu, Ellen Novoseller et al.
Scaling Autonomous Agents via Automatic Reward Modeling And Planning
Zhenfang Chen, Delin Chen, Rui Sun et al.
PILAF: Optimal Human Preference Sampling for Reward Modeling
Yunzhen Feng, Ariel Kwiatkowski, Kunhao Zheng et al.
CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner
Tingbing Yan, Wenzheng Zeng, Yang Xiao et al.
Post-hoc Reward Calibration: A Case Study on Length Bias
Zeyu Huang, Zihan Qiu, zili wang et al.