🧬Language Models

RLHF

Reinforcement learning from human feedback

100 papers7,051 total citations

Compare with other topics

Feb '24 — Jan '261024 papers

Top Conferences

ICLR: 50 NeurIPS: 18 AAAI: 9 ICML: 9 ECCV: 7 CVPR: 6

Top Papers

#1

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

Jipeng Zhang, Hanze Dong, Tong Zhang et al.

Eureka: Human-Level Reward Design via Coding Large Language Models

Yecheng Jason Ma, William Liang, Guanzhi Wang et al.

Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models

Seungone Kim, Jamin Shin, yejin cho et al.

RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

Tianyu Yu, Yuan Yao, Haoye Zhang et al.

Preference Ranking Optimization for Human Alignment

Feifan Song, Bowen Yu, Minghao Li et al.

AAAI 2024arXiv:2306.17492

preference ranking optimizationhuman alignmentreinforcement learning from human feedbacklarge language models+2

334

citations

#6

OpenChat: Advancing Open-source Language Models with Mixed-Quality Data

Guan Wang, Sijie Cheng, Xianyuan Zhan et al.

Understanding the Effects of RLHF on LLM Generalisation and Diversity

Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis et al.

Self-Play Preference Optimization for Language Model Alignment

Yue Wu, Zhiqing Sun, Rina Hughes et al.

Habitat 3.0: A Co-Habitat for Humans, Avatars, and Robots

Xavier Puig, Eric Undersander, Andrew Szot et al.

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang et al.

HIVE: Harnessing Human Feedback for Instructional Visual Editing

Shu Zhang, Xinyi Yang, Yihao Feng et al.

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He et al.

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

Juan Rocamonde, Victoriano Montesinos, Elvis Nava et al.

TTRL: Test-Time Reinforcement Learning

Yuxin Zuo, Kaiyan Zhang, Li Sheng et al.

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

Zehan Qi, Xiao Liu, Iat Long Iong et al.

Universal Jailbreak Backdoors from Poisoned Human Feedback

Javier Rando, Florian Tramer

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang et al.

HelpSteer2-Preference: Complementing Ratings with Preferences

Zhilin Wang, Alexander Bukharin, Olivier Delalleau et al.

Human Feedback is not Gold Standard

Tom Hosking, Phil Blunsom, Max Bartolo

InstructVideo: Instructing Video Diffusion Models with Human Feedback

Hangjie Yuan, Shiwei Zhang, Xiang Wang et al.

AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation

Jiafei Duan, Wilbert Pumacay, Nishanth Kumar et al.

TLControl: Trajectory and Language Control for Human Motion Synthesis

WEILIN WAN, Zhiyang Dou, Taku Komura et al.

ECCV 2024arXiv:2311.17135

human motion synthesistrajectory controllanguage controlvq-vae+4

77

citations

#23

The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

Xinyu Zhu, Mengzhou Xia, Zhepei Wei et al.

NeurIPS 2025arXiv:2506.01347

reinforcement learningmathematical reasoninglanguage modelspolicy gradients+4

74

citations

#24

OGBench: Benchmarking Offline Goal-Conditioned RL

Seohong Park, Kevin Frans, Benjamin Eysenbach et al.

ICLR 2025arXiv:2410.20092

offline reinforcement learninggoal-conditioned rlbenchmark evaluationoffline gcrl algorithms+3

74

citations

#25

Confronting Reward Model Overoptimization with Constrained RLHF

Ted Moskovitz, Aaditya Singh, DJ Strouse et al.

Language Models Learn to Mislead Humans via RLHF

Jiaxin Wen, Ruiqi Zhong, Akbir Khan et al.

ICLR 2025arXiv:2409.12822

language model alignmentreinforcement learning from human feedbackmodel deception detectionhuman evaluation accuracy+4

73

citations

#27

BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games

Davide Paglieri, Bartłomiej Cupiał, Samuel Coward et al.

Scaling Test-Time Compute Without Verification or RL is Suboptimal

Amrith Setlur, Nived Rajaraman, Sergey Levine et al.

LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

Marwa Abdulhai, Isadora White, Charlie Snell et al.

CycleResearcher: Improving Automated Research via Automated Review

Yixuan Weng, Minjun Zhu, Guangsheng Bao et al.

Perception-R1: Pioneering Perception Policy with Reinforcement Learning

En Yu, Kangheng Lin, Liang Zhao et al.

NeurIPS 2025arXiv:2504.07954

58

citations

#32

RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts

Hjalmar Wijk, Tao Lin, Joel Becker et al.

Self-Improvement in Language Models: The Sharpening Mechanism

Audrey Huang, Adam Block, Dylan Foster et al.

ICLR 2025arXiv:2412.01951

self-improvement in language modelssharpening mechanismverification capabilitiespolicy sharpening+4

55

citations

#34

RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness

Tianyu Yu, Haoye Zhang, Qiming Li et al.

CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control

Guy Tevet, Sigal Raab, Setareh Cohan et al.

How to Evaluate Reward Models for RLHF

Evan Frick, Tianle Li, Connor Chen et al.

BOND: Aligning LLMs with Best-of-N Distillation

Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot-Desenonges et al.

VinePPO: Refining Credit Assignment in RL Training of LLMs

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance et al.

Duolando: Follower GPT with Off-Policy Reinforcement Learning for Dance Accompaniment

Siyao Li, Tianpei Gu, Zhitao Yang et al.

RRM: Robust Reward Model Training Mitigates Reward Hacking

Tianqi Liu, Wei Xiong, Jie Ren et al.

ICLR 2025arXiv:2409.13156

reward model trainingreward hacking mitigationcausal preference learningdata augmentation techniques+4

44

citations

#41

RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning

Hao Gao, Shaoyu Chen, Bo Jiang et al.

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

Jaehun Jung, Faeze Brahman, Yejin Choi

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

Marcus Williams, Micah Carroll, Adhyyan Narang et al.

Provable Offline Preference-Based Reinforcement Learning

Wenhao Zhan, Masatoshi Uehara, Nathan Kallus et al.

Dual RL: Unification and New Methods for Reinforcement and Imitation Learning

Harshit Sikchi, Qinqing Zheng, Amy Zhang et al.

Reasoning Gym: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

Zafir Stojanovski, Oliver Stanley, Joe Sharratt et al.

Making RL with Preference-based Feedback Efficient via Randomization

Runzhe Wu, Wen Sun

Human-Object Interaction from Human-Level Instructions

Zhen Wu, Jiaman Li, Pei Xu et al.

Preference Optimization for Reasoning with Pseudo Feedback

Fangkai Jiao, Geyang Guo, Xingxing Zhang et al.

Random Feature Amplification: Feature Learning and Generalization in Neural Networks

Spencer Frei, Niladri Chatterji, Peter L. Bartlett

CPPO: Continual Learning for Reinforcement Learning with Human Feedback

Han Zhang, Yu Lei, Lin Gui et al.

HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages

Zhilin Wang, Jiaqi Zeng, Olivier Delalleau et al.

Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models

Hritik Bansal, John Dang, Aditya Grover

Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

Chaofeng Chen, Annan Wang, Haoning Wu et al.

Revisiting Plasticity in Visual Reinforcement Learning: Data, Modules and Training Stages

Guozheng Ma, Lu Li, Sen Zhang et al.

RLIF: Interactive Imitation Learning as Reinforcement Learning

Jianlan Luo, Perry Dong, Yuexiang Zhai et al.

Moral Alignment for LLM Agents

Elizaveta Tennant, Stephen Hailes, Mirco Musolesi

ICLR 2025arXiv:2410.01639

moral alignmentllm agentsintrinsic rewardsreinforcement learning fine-tuning+4

25

citations

#58

A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning

Yinmin Zhang, Jie Liu, Chuming Li et al.

AAAI 2024arXiv:2312.07685

offline reinforcement learningq-value estimationonline finetuningoffline-to-online rl+3

25

citations

#59

AlignSAM: Aligning Segment Anything Model to Open Context via Reinforcement Learning

Duojun Huang, Xinyu Xiong, Jie Ma et al.

EarnHFT: Efficient Hierarchical Reinforcement Learning for High Frequency Trading

Molei Qin, Shuo Sun, Wentao Zhang et al.

AAAI 2024arXiv:2309.12891

hierarchical reinforcement learninghigh frequency tradingcryptocurrency marketdynamic programming+4

24

citations

#61

Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving

Zhenghao Peng, Wenjie Luo, Yiren Lu et al.

Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning

Yun Qu, Yuhang Jiang, Boyuan Wang et al.

HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning

Fucai Ke, Zhixi Cai, Simindokht Jahangard et al.

Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

Wenxuan Zhang, Philip Torr, Mohamed Elhoseiny et al.

Reward Guided Latent Consistency Distillation

William Wang, Jiachen Li, Weixi Feng et al.

Teaching Language Models to Critique via Reinforcement Learning

Zhihui Xie, Jie chen, Liyu Chen et al.

Self-Consistency Preference Optimization

Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang et al.

Robust Tracking via Mamba-based Context-aware Token Learning

Jinxia Xie, Bineng Zhong, Qihua Liang et al.

Reinforced Lifelong Editing for Language Models

Zherui Li, Houcheng Jiang, Hao Chen et al.

ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

Jiaru Zou, Ling Yang, Jingwen Gu et al.

ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning

Hongyin Zhang, Zifeng Zhuang, Han Zhao et al.

SeRL: Self-play Reinforcement Learning for Large Language Models with Limited Data

Wenkai Fang, Shunyu Liu, Yang Zhou et al.

NeurIPS 2025arXiv:2505.20347

reinforcement learninglarge language modelsself-instruction generationself-rewarding mechanisms+4

19

citations

#73

HR-Pro: Point-Supervised Temporal Action Localization via Hierarchical Reliability Propagation

Huaxin Zhang, Xiang Wang, Xiaohao Xu et al.

AAAI 2024arXiv:2308.12608

temporal action localizationpoint-supervised learninghierarchical reliability propagationsnippet-level discrimination+3

19

citations

#74

Hierarchical World Models as Visual Whole-Body Humanoid Controllers

Nick Hansen, Jyothir S V, Vlad Sobal et al.

Online Preference Alignment for Language Models via Count-based Exploration

Chenjia Bai, Yang Zhang, Shuang Qiu et al.

Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards

Zijing Hu, Fengda Zhang, Long Chen et al.

Progress or Regress? Self-Improvement Reversal in Post-training

Ting Wu, Xuefeng Li, Pengfei Liu

Cross-Embodiment Dexterous Grasping with Reinforcement Learning

Haoqi Yuan, Bohan Zhou, Yuhui Fu et al.

SELF-EVOLVED REWARD LEARNING FOR LLMS

Chenghua Huang, Zhizhen Fan, Lu Wang et al.

Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2)

Zhenjie Yang, Xiaosong Jia, Qifeng Li et al.

NeurIPS 2025arXiv:2505.16394

reinforcement learningautonomous drivingworld modelsmodel-based reinforcement learning+4

18

citations

#81

Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds

Hao Liang, Zhiquan Luo

ThinkBot: Embodied Instruction Following with Thought Chain Reasoning

Guanxing Lu, Ziwei Wang, Changliu Liu et al.

Diverse Preference Learning for Capabilities and Alignment

Stewart Slocum, Asher Parker-Sartori, Dylan Hadfield-Menell

Learning to Optimize Permutation Flow Shop Scheduling via Graph-Based Imitation Learning

Longkang Li, Siyuan Liang, Zihao Zhu et al.

AAAI 2024arXiv:2210.17178

permutation flow shop schedulinggraph-based imitation learningmanufacturing systems optimizationlarge-scale scheduling problems+4

16

citations

#85

Horizon Reduction Makes RL Scalable

Seohong Park, Kevin Frans, Deepinder Mann et al.

RocketEval: Efficient automated LLM evaluation via grading checklist

Tianjun Wei, Wei Wen, Ruizhi Qiao et al.

Learning Optimal Advantage from Preferences and Mistaking It for Reward

W Bradley Knox, Stephane Hatgis-Kessell, Sigurdur Orn Adalgeirsson et al.

AAAI 2024arXiv:2310.02456

reward function learninghuman preference modelingregret preference modelpartial return assumption+4

15

citations

#88

Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards

Xiaoyuan Liu, Tian Liang, Zhiwei He et al.

Reinforcement Learning Friendly Vision-Language Model for Minecraft

Haobin Jiang, Junpeng Yue, Hao Luo et al.

Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections

Bo Wang, Qinyuan Cheng, Runyu Peng et al.

NeurIPS 2025arXiv:2507.00018

supervised fine tuningdirect preference optimizationimplicit reward learningpreference learning+4

14

citations

#91

Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

Zhaolin Gao, Wenhao Zhan, Jonathan Chang et al.

SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning

Xiao Liang, Zhong-Zhi Li, Yeyun Gong et al.

Ctrl-U: Robust Conditional Image Generation via Uncertainty-aware Reward Modeling

Guiyu Zhang, Huan-ang Gao, Zijian Jiang et al.

ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments

Taewoong Kim, Cheolhong Min, Byeonghwi Kim et al.

Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards

Charles Arnal, Gaëtan Narozniak, Vivien Cabannes et al.

Rating-Based Reinforcement Learning

Devin White, Mingkang Wu, Ellen Novoseller et al.

AAAI 2024arXiv:2307.16348

reinforcement learninghuman ratingspreference-based learningrating prediction model+3

13

citations

#97

Scaling Autonomous Agents via Automatic Reward Modeling And Planning

Zhenfang Chen, Delin Chen, Rui Sun et al.

PILAF: Optimal Human Preference Sampling for Reward Modeling

Yunzhen Feng, Ariel Kwiatkowski, Kunhao Zheng et al.

CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner

Tingbing Yan, Wenzheng Zeng, Yang Xiao et al.

Post-hoc Reward Calibration: A Case Study on Length Bias

Zeyu Huang, Zihan Qiu, zili wang et al.

ICLR 2025

12

citations

RLHF

Top Conferences

Related Topics (Language Models)

Top Papers

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

Eureka: Human-Level Reward Design via Coding Large Language Models

Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models

RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

Preference Ranking Optimization for Human Alignment

OpenChat: Advancing Open-source Language Models with Mixed-Quality Data

Understanding the Effects of RLHF on LLM Generalisation and Diversity

Self-Play Preference Optimization for Language Model Alignment

Habitat 3.0: A Co-Habitat for Humans, Avatars, and Robots

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

HIVE: Harnessing Human Feedback for Instructional Visual Editing

ToolRL: Reward is All Tool Learning Needs

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

TTRL: Test-Time Reinforcement Learning

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

Universal Jailbreak Backdoors from Poisoned Human Feedback

Improving Video Generation with Human Feedback

HelpSteer2-Preference: Complementing Ratings with Preferences

Human Feedback is not Gold Standard

InstructVideo: Instructing Video Diffusion Models with Human Feedback

AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation

TLControl: Trajectory and Language Control for Human Motion Synthesis

The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

OGBench: Benchmarking Offline Goal-Conditioned RL

Confronting Reward Model Overoptimization with Constrained RLHF

Language Models Learn to Mislead Humans via RLHF

BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games

Scaling Test-Time Compute Without Verification or RL is Suboptimal

LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

CycleResearcher: Improving Automated Research via Automated Review

Perception-R1: Pioneering Perception Policy with Reinforcement Learning

RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts

Self-Improvement in Language Models: The Sharpening Mechanism

RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness

CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control

How to Evaluate Reward Models for RLHF

BOND: Aligning LLMs with Best-of-N Distillation

VinePPO: Refining Credit Assignment in RL Training of LLMs

Duolando: Follower GPT with Off-Policy Reinforcement Learning for Dance Accompaniment

RRM: Robust Reward Model Training Mitigates Reward Hacking

RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

Provable Offline Preference-Based Reinforcement Learning

Dual RL: Unification and New Methods for Reinforcement and Imitation Learning

Reasoning Gym: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

Making RL with Preference-based Feedback Efficient via Randomization

Human-Object Interaction from Human-Level Instructions

Preference Optimization for Reasoning with Pseudo Feedback

Random Feature Amplification: Feature Learning and Generalization in Neural Networks

CPPO: Continual Learning for Reinforcement Learning with Human Feedback

HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages

Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models

Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

Revisiting Plasticity in Visual Reinforcement Learning: Data, Modules and Training Stages

RLIF: Interactive Imitation Learning as Reinforcement Learning

Moral Alignment for LLM Agents

A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning

AlignSAM: Aligning Segment Anything Model to Open Context via Reinforcement Learning

EarnHFT: Efficient Hierarchical Reinforcement Learning for High Frequency Trading

Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving

Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning

HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning

Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

Reward Guided Latent Consistency Distillation

Teaching Language Models to Critique via Reinforcement Learning

Self-Consistency Preference Optimization

Robust Tracking via Mamba-based Context-aware Token Learning

Reinforced Lifelong Editing for Language Models

ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning

SeRL: Self-play Reinforcement Learning for Large Language Models with Limited Data

HR-Pro: Point-Supervised Temporal Action Localization via Hierarchical Reliability Propagation

Hierarchical World Models as Visual Whole-Body Humanoid Controllers

Online Preference Alignment for Language Models via Count-based Exploration

Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards