🧬Language Models

RLHF

Reinforcement learning from human feedback

100 papers7,019 total citations
Compare with other topics
Feb '24 Jan '261024 papers
Also includes: reinforcement learning from human feedback, rlhf, preference learning, human feedback, dpo

Top Papers

#1

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

Jipeng Zhang, Hanze Dong, Tong Zhang et al.

ICLR 2025
642
citations
#2

Eureka: Human-Level Reward Design via Coding Large Language Models

Yecheng Jason Ma, William Liang, Guanzhi Wang et al.

ICLR 2024
471
citations
#3

Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models

Seungone Kim, Jamin Shin, yejin cho et al.

ICLR 2024
378
citations
#4

RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

Tianyu Yu, Yuan Yao, Haoye Zhang et al.

CVPR 2024
344
citations
#5

Preference Ranking Optimization for Human Alignment

Feifan Song, Bowen Yu, Minghao Li et al.

AAAI 2024arXiv:2306.17492
preference ranking optimizationhuman alignmentreinforcement learning from human feedbacklarge language models+2
334
citations
#6

OpenChat: Advancing Open-source Language Models with Mixed-Quality Data

Guan Wang, Sijie Cheng, Xianyuan Zhan et al.

ICLR 2024
309
citations
#7

Understanding the Effects of RLHF on LLM Generalisation and Diversity

Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis et al.

ICLR 2024
267
citations
#8

Self-Play Preference Optimization for Language Model Alignment

Yue Wu, Zhiqing Sun, Rina Hughes et al.

ICLR 2025
207
citations
#9

Habitat 3.0: A Co-Habitat for Humans, Avatars, and Robots

Xavier Puig, Eric Undersander, Andrew Szot et al.

ICLR 2024
206
citations
#10

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang et al.

NeurIPS 2025
169
citations
#11

HIVE: Harnessing Human Feedback for Instructional Visual Editing

Shu Zhang, Xinyi Yang, Yihao Feng et al.

CVPR 2024
164
citations
#12

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He et al.

NeurIPS 2025
152
citations
#13

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

Juan Rocamonde, Victoriano Montesinos, Elvis Nava et al.

ICLR 2024
133
citations
#14

TTRL: Test-Time Reinforcement Learning

Yuxin Zuo, Kaiyan Zhang, Li Sheng et al.

NeurIPS 2025
118
citations
#15

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

Zehan Qi, Xiao Liu, Iat Long Iong et al.

ICLR 2025
110
citations
#16

Universal Jailbreak Backdoors from Poisoned Human Feedback

Javier Rando, Florian Tramer

ICLR 2024
108
citations
#17

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang et al.

NeurIPS 2025
106
citations
#18

HelpSteer2-Preference: Complementing Ratings with Preferences

Zhilin Wang, Alexander Bukharin, Olivier Delalleau et al.

ICLR 2025
102
citations
#19

Human Feedback is not Gold Standard

Tom Hosking, Phil Blunsom, Max Bartolo

ICLR 2024
83
citations
#20

InstructVideo: Instructing Video Diffusion Models with Human Feedback

Hangjie Yuan, Shiwei Zhang, Xiang Wang et al.

CVPR 2024
80
citations
#21

AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation

Jiafei Duan, Wilbert Pumacay, Nishanth Kumar et al.

ICLR 2025
80
citations
#22

TLControl: Trajectory and Language Control for Human Motion Synthesis

WEILIN WAN, Zhiyang Dou, Taku Komura et al.

ECCV 2024arXiv:2311.17135
human motion synthesistrajectory controllanguage controlvq-vae+4
77
citations
#23

The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

Xinyu Zhu, Mengzhou Xia, Zhepei Wei et al.

NeurIPS 2025arXiv:2506.01347
reinforcement learningmathematical reasoninglanguage modelspolicy gradients+4
74
citations
#24

OGBench: Benchmarking Offline Goal-Conditioned RL

Seohong Park, Kevin Frans, Benjamin Eysenbach et al.

ICLR 2025arXiv:2410.20092
offline reinforcement learninggoal-conditioned rlbenchmark evaluationoffline gcrl algorithms+3
74
citations
#25

Language Models Learn to Mislead Humans via RLHF

Jiaxin Wen, Ruiqi Zhong, Akbir Khan et al.

ICLR 2025arXiv:2409.12822
language model alignmentreinforcement learning from human feedbackmodel deception detectionhuman evaluation accuracy+4
73
citations
#26

Confronting Reward Model Overoptimization with Constrained RLHF

Ted Moskovitz, Aaditya Singh, DJ Strouse et al.

ICLR 2024
73
citations
#27

BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games

Davide Paglieri, Bartłomiej Cupiał, Samuel Coward et al.

ICLR 2025
70
citations
#28

Scaling Test-Time Compute Without Verification or RL is Suboptimal

Amrith Setlur, Nived Rajaraman, Sergey Levine et al.

ICML 2025
68
citations
#29

LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

Marwa Abdulhai, Isadora White, Charlie Snell et al.

ICML 2025
63
citations
#30

CycleResearcher: Improving Automated Research via Automated Review

Yixuan Weng, Minjun Zhu, Guangsheng Bao et al.

ICLR 2025
62
citations
#31

Perception-R1: Pioneering Perception Policy with Reinforcement Learning

En Yu, Kangheng Lin, Liang Zhao et al.

NeurIPS 2025arXiv:2504.07954
58
citations
#32

RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts

Hjalmar Wijk, Tao Lin, Joel Becker et al.

ICML 2025
56
citations
#33

Self-Improvement in Language Models: The Sharpening Mechanism

Audrey Huang, Adam Block, Dylan Foster et al.

ICLR 2025arXiv:2412.01951
self-improvement in language modelssharpening mechanismverification capabilitiespolicy sharpening+4
55
citations
#34

RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness

Tianyu Yu, Haoye Zhang, Qiming Li et al.

CVPR 2025
54
citations
#35

CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control

Guy Tevet, Sigal Raab, Setareh Cohan et al.

ICLR 2025
53
citations
#36

How to Evaluate Reward Models for RLHF

Evan Frick, Tianle Li, Connor Chen et al.

ICLR 2025
50
citations
#37

BOND: Aligning LLMs with Best-of-N Distillation

Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot-Desenonges et al.

ICLR 2025
50
citations
#38

VinePPO: Refining Credit Assignment in RL Training of LLMs

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance et al.

ICML 2025
48
citations
#39

Duolando: Follower GPT with Off-Policy Reinforcement Learning for Dance Accompaniment

Siyao Li, Tianpei Gu, Zhitao Yang et al.

ICLR 2024
45
citations
#40

RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning

Hao Gao, Shaoyu Chen, Bo Jiang et al.

NeurIPS 2025
43
citations
#41

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

Jaehun Jung, Faeze Brahman, Yejin Choi

ICLR 2025
42
citations
#42

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

Marcus Williams, Micah Carroll, Adhyyan Narang et al.

ICLR 2025
41
citations
#43

Provable Offline Preference-Based Reinforcement Learning

Wenhao Zhan, Masatoshi Uehara, Nathan Kallus et al.

ICLR 2024
39
citations
#44

Dual RL: Unification and New Methods for Reinforcement and Imitation Learning

Harshit Sikchi, Qinqing Zheng, Amy Zhang et al.

ICLR 2024
39
citations
#45

Reasoning Gym: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

Zafir Stojanovski, Oliver Stanley, Joe Sharratt et al.

NeurIPS 2025
39
citations
#46

Making RL with Preference-based Feedback Efficient via Randomization

Runzhe Wu, Wen Sun

ICLR 2024
37
citations
#47

Human-Object Interaction from Human-Level Instructions

Zhen Wu, Jiaman Li, Pei Xu et al.

ICCV 2025
36
citations
#48

Preference Optimization for Reasoning with Pseudo Feedback

Fangkai Jiao, Geyang Guo, Xingxing Zhang et al.

ICLR 2025
33
citations
#49

Random Feature Amplification: Feature Learning and Generalization in Neural Networks

Spencer Frei, Niladri Chatterji, Peter L. Bartlett

ICLR 2024
32
citations
#50

CPPO: Continual Learning for Reinforcement Learning with Human Feedback

Han Zhang, Yu Lei, Lin Gui et al.

ICLR 2024
32
citations
#51

HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages

Zhilin Wang, Jiaqi Zeng, Olivier Delalleau et al.

NeurIPS 2025
31
citations
#52

Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models

Hritik Bansal, John Dang, Aditya Grover

ICLR 2024
26
citations
#53

Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

Chaofeng Chen, Annan Wang, Haoning Wu et al.

ECCV 2024
26
citations
#54

Revisiting Plasticity in Visual Reinforcement Learning: Data, Modules and Training Stages

Guozheng Ma, Lu Li, Sen Zhang et al.

ICLR 2024
25
citations
#55

Moral Alignment for LLM Agents

Elizaveta Tennant, Stephen Hailes, Mirco Musolesi

ICLR 2025arXiv:2410.01639
moral alignmentllm agentsintrinsic rewardsreinforcement learning fine-tuning+4
25
citations
#56

RLIF: Interactive Imitation Learning as Reinforcement Learning

Jianlan Luo, Perry Dong, Yuexiang Zhai et al.

ICLR 2024
25
citations
#57

A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning

Yinmin Zhang, Jie Liu, Chuming Li et al.

AAAI 2024arXiv:2312.07685
offline reinforcement learningq-value estimationonline finetuningoffline-to-online rl+3
25
citations
#58

AlignSAM: Aligning Segment Anything Model to Open Context via Reinforcement Learning

Duojun Huang, Xinyu Xiong, Jie Ma et al.

CVPR 2024
24
citations
#59

EarnHFT: Efficient Hierarchical Reinforcement Learning for High Frequency Trading

Molei Qin, Shuo Sun, Wentao Zhang et al.

AAAI 2024arXiv:2309.12891
hierarchical reinforcement learninghigh frequency tradingcryptocurrency marketdynamic programming+4
24
citations
#60

Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving

Zhenghao Peng, Wenjie Luo, Yiren Lu et al.

ECCV 2024
23
citations
#61

Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning

Yun Qu, Yuhang Jiang, Boyuan Wang et al.

AAAI 2025
23
citations
#62

HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning

Fucai Ke, Zhixi Cai, Simindokht Jahangard et al.

ECCV 2024
23
citations
#63

Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

Wenxuan Zhang, Philip Torr, Mohamed Elhoseiny et al.

ICLR 2025
23
citations
#64

Reward Guided Latent Consistency Distillation

William Wang, Jiachen Li, Weixi Feng et al.

ICLR 2025
23
citations
#65

Self-Consistency Preference Optimization

Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang et al.

ICML 2025
23
citations
#66

Teaching Language Models to Critique via Reinforcement Learning

Zhihui Xie, Jie chen, Liyu Chen et al.

ICML 2025
23
citations
#67

Robust Tracking via Mamba-based Context-aware Token Learning

Jinxia Xie, Bineng Zhong, Qihua Liang et al.

AAAI 2025
22
citations
#68

Reinforced Lifelong Editing for Language Models

Zherui Li, Houcheng Jiang, Hao Chen et al.

ICML 2025
21
citations
#69

ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

Jiaru Zou, Ling Yang, Jingwen Gu et al.

NeurIPS 2025
20
citations
#70

ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning

Hongyin Zhang, Zifeng Zhuang, Han Zhao et al.

ICML 2025
20
citations
#71

HR-Pro: Point-Supervised Temporal Action Localization via Hierarchical Reliability Propagation

Huaxin Zhang, Xiang Wang, Xiaohao Xu et al.

AAAI 2024arXiv:2308.12608
temporal action localizationpoint-supervised learninghierarchical reliability propagationsnippet-level discrimination+3
19
citations
#72

SeRL: Self-play Reinforcement Learning for Large Language Models with Limited Data

Wenkai Fang, Shunyu Liu, Yang Zhou et al.

NeurIPS 2025arXiv:2505.20347
reinforcement learninglarge language modelsself-instruction generationself-rewarding mechanisms+4
19
citations
#73

Hierarchical World Models as Visual Whole-Body Humanoid Controllers

Nick Hansen, Jyothir S V, Vlad Sobal et al.

ICLR 2025
19
citations
#74

Online Preference Alignment for Language Models via Count-based Exploration

Chenjia Bai, Yang Zhang, Shuang Qiu et al.

ICLR 2025
19
citations
#75

Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards

Zijing Hu, Fengda Zhang, Long Chen et al.

CVPR 2025
19
citations
#76

Cross-Embodiment Dexterous Grasping with Reinforcement Learning

Haoqi Yuan, Bohan Zhou, Yuhui Fu et al.

ICLR 2025
18
citations
#77

SELF-EVOLVED REWARD LEARNING FOR LLMS

Chenghua Huang, Zhizhen Fan, Lu Wang et al.

ICLR 2025
18
citations
#78

Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2)

Zhenjie Yang, Xiaosong Jia, Qifeng Li et al.

NeurIPS 2025arXiv:2505.16394
reinforcement learningautonomous drivingworld modelsmodel-based reinforcement learning+4
18
citations
#79

Progress or Regress? Self-Improvement Reversal in Post-training

Ting Wu, Xuefeng Li, Pengfei Liu

ICLR 2025
18
citations
#80

Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds

Hao Liang, Zhiquan Luo

NeurIPS 2025
18
citations
#81

ThinkBot: Embodied Instruction Following with Thought Chain Reasoning

Guanxing Lu, Ziwei Wang, Changliu Liu et al.

ICLR 2025
17
citations
#82

Diverse Preference Learning for Capabilities and Alignment

Stewart Slocum, Asher Parker-Sartori, Dylan Hadfield-Menell

ICLR 2025
17
citations
#83

Learning to Optimize Permutation Flow Shop Scheduling via Graph-Based Imitation Learning

Longkang Li, Siyuan Liang, Zihao Zhu et al.

AAAI 2024arXiv:2210.17178
permutation flow shop schedulinggraph-based imitation learningmanufacturing systems optimizationlarge-scale scheduling problems+4
16
citations
#84

Horizon Reduction Makes RL Scalable

Seohong Park, Kevin Frans, Deepinder Mann et al.

NeurIPS 2025
15
citations
#85

RocketEval: Efficient automated LLM evaluation via grading checklist

Tianjun Wei, Wei Wen, Ruizhi Qiao et al.

ICLR 2025
15
citations
#86

Learning Optimal Advantage from Preferences and Mistaking It for Reward

W Bradley Knox, Stephane Hatgis-Kessell, Sigurdur Orn Adalgeirsson et al.

AAAI 2024arXiv:2310.02456
reward function learninghuman preference modelingregret preference modelpartial return assumption+4
15
citations
#87

Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards

Xiaoyuan Liu, Tian Liang, Zhiwei He et al.

NeurIPS 2025
15
citations
#88

Reinforcement Learning Friendly Vision-Language Model for Minecraft

Haobin Jiang, Junpeng Yue, Hao Luo et al.

ECCV 2024
14
citations
#89

Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections

Bo Wang, Qinyuan Cheng, Runyu Peng et al.

NeurIPS 2025arXiv:2507.00018
supervised fine tuningdirect preference optimizationimplicit reward learningpreference learning+4
14
citations
#90

Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

Zhaolin Gao, Wenhao Zhan, Jonathan Chang et al.

ICLR 2025
14
citations
#91

SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning

Xiao Liang, Zhong-Zhi Li, Yeyun Gong et al.

NeurIPS 2025
14
citations
#92

Ctrl-U: Robust Conditional Image Generation via Uncertainty-aware Reward Modeling

Guiyu Zhang, Huan-ang Gao, Zijian Jiang et al.

ICLR 2025
13
citations
#93

ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments

Taewoong Kim, Cheolhong Min, Byeonghwi Kim et al.

ECCV 2024
13
citations
#94

Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards

Charles Arnal, Gaëtan Narozniak, Vivien Cabannes et al.

NeurIPS 2025
13
citations
#95

Rating-Based Reinforcement Learning

Devin White, Mingkang Wu, Ellen Novoseller et al.

AAAI 2024arXiv:2307.16348
reinforcement learninghuman ratingspreference-based learningrating prediction model+3
13
citations
#96

Scaling Autonomous Agents via Automatic Reward Modeling And Planning

Zhenfang Chen, Delin Chen, Rui Sun et al.

ICLR 2025
13
citations
#97

PILAF: Optimal Human Preference Sampling for Reward Modeling

Yunzhen Feng, Ariel Kwiatkowski, Kunhao Zheng et al.

ICML 2025
13
citations
#98

CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner

Tingbing Yan, Wenzheng Zeng, Yang Xiao et al.

ECCV 2024
12
citations
#99

Post-hoc Reward Calibration: A Case Study on Length Bias

Zeyu Huang, Zihan Qiu, zili wang et al.

ICLR 2025
12
citations
#100

Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward

Yanming Wan, Jiaxing Wu, Marwa Abdulhai et al.

NeurIPS 2025arXiv:2504.03206
personalized dialogue systemsmulti-turn reinforcement learningcuriosity reward mechanismuser modeling+4
12
citations