🧬Language Models

RLHF

Reinforcement learning from human feedback

470 papers(showing top 100)4,840 total citations
Compare with other topics
Mar '24 Feb '26391 papers
Also includes: reinforcement learning from human feedback, rlhf, preference learning, human feedback, dpo

Top Papers

#1

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

Jipeng Zhang, Hanze Dong, Tong Zhang et al.

ICLR 2025
642
citations
#2

Eureka: Human-Level Reward Design via Coding Large Language Models

Yecheng Jason Ma, William Liang, Guanzhi Wang et al.

ICLR 2024arXiv:2310.12931
471
citations
#3

RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

Tianyu Yu, Yuan Yao, Haoye Zhang et al.

CVPR 2024arXiv:2312.00849
344
citations
#4

Preference Ranking Optimization for Human Alignment

Feifan Song, Bowen Yu, Minghao Li et al.

AAAI 2024arXiv:2306.17492
preference ranking optimizationhuman alignmentreinforcement learning from human feedbacklarge language models+2
334
citations
#5

Understanding the Effects of RLHF on LLM Generalisation and Diversity

Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis et al.

ICLR 2024arXiv:2310.06452
267
citations
#6

Self-Play Preference Optimization for Language Model Alignment

Yue Wu, Zhiqing Sun, Rina Hughes et al.

ICLR 2025
207
citations
#7

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang et al.

NeurIPS 2025arXiv:2504.08837
169
citations
#8

HIVE: Harnessing Human Feedback for Instructional Visual Editing

Shu Zhang, Xinyi Yang, Yihao Feng et al.

CVPR 2024arXiv:2303.09618
164
citations
#9

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He et al.

NeurIPS 2025
152
citations
#10

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

Juan Rocamonde, Victoriano Montesinos, Elvis Nava et al.

ICLR 2024arXiv:2310.12921
133
citations
#11

TTRL: Test-Time Reinforcement Learning

Yuxin Zuo, Kaiyan Zhang, Li Sheng et al.

NeurIPS 2025arXiv:2504.16084
test-time reinforcement learningreward estimationlarge language modelsreasoning tasks+4
118
citations
#12

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

Zehan Qi, Xiao Liu, Iat Long Iong et al.

ICLR 2025arXiv:2411.02337
llm web agentsonline curriculum reinforcement learningself-evolving curriculumoutcome-supervised reward model+3
110
citations
#13

Universal Jailbreak Backdoors from Poisoned Human Feedback

Javier Rando, Florian Tramer

ICLR 2024arXiv:2311.14455
108
citations
#14

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang et al.

NeurIPS 2025arXiv:2501.13918
106
citations
#15

HelpSteer2-Preference: Complementing Ratings with Preferences

Zhilin Wang, Alexander Bukharin, Olivier Delalleau et al.

ICLR 2025arXiv:2410.01257
reward modelingbradley-terry modelpreference annotationinstruction following alignment+4
102
citations
#16

InstructVideo: Instructing Video Diffusion Models with Human Feedback

Hangjie Yuan, Shiwei Zhang, Xiang Wang et al.

CVPR 2024arXiv:2312.12490
80
citations
#17

Confronting Reward Model Overoptimization with Constrained RLHF

Ted Moskovitz, Aaditya Singh, DJ Strouse et al.

ICLR 2024arXiv:2310.04373
73
citations
#18

Language Models Learn to Mislead Humans via RLHF

Jiaxin Wen, Ruiqi Zhong, Akbir Khan et al.

ICLR 2025arXiv:2409.12822
language model alignmentreinforcement learning from human feedbackmodel deception detectionhuman evaluation accuracy+4
71
citations
#19

LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

Marwa Abdulhai, Isadora White, Charlie Snell et al.

ICML 2025arXiv:2311.18232
63
citations
#20

Perception-R1: Pioneering Perception Policy with Reinforcement Learning

En Yu, Kangheng Lin, Liang Zhao et al.

NeurIPS 2025arXiv:2504.07954
58
citations
#21

RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness

Tianyu Yu, Haoye Zhang, Qiming Li et al.

CVPR 2025arXiv:2405.17220
hallucination reductionmultimodal large language modelspreference learningopen-source alignment+4
54
citations
#22

BOND: Aligning LLMs with Best-of-N Distillation

Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot-Desenonges et al.

ICLR 2025arXiv:2407.14622
reinforcement learning from human feedbackbest-of-n samplingdistribution matchingjeffreys divergence+4
50
citations
#23

How to Evaluate Reward Models for RLHF

Evan Frick, Tianle Li, Connor Chen et al.

ICLR 2025arXiv:2410.14872
reward model evaluationreinforcement learning from human feedbackhuman preference datasetsproxy task evaluation+3
50
citations
#24

RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning

Hao Gao, Shaoyu Chen, Bo Jiang et al.

NeurIPS 2025
43
citations
#25

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

Marcus Williams, Micah Carroll, Adhyyan Narang et al.

ICLR 2025arXiv:2411.02306
41
citations
#26

Dual RL: Unification and New Methods for Reinforcement and Imitation Learning

Harshit Sikchi, Qinqing Zheng, Amy Zhang et al.

ICLR 2024arXiv:2302.08560
39
citations
#27

Provable Offline Preference-Based Reinforcement Learning

Wenhao Zhan, Masatoshi Uehara, Nathan Kallus et al.

ICLR 2024arXiv:2305.14816
39
citations
#28

Making RL with Preference-based Feedback Efficient via Randomization

Runzhe Wu, Wen Sun

ICLR 2024arXiv:2310.14554
37
citations
#29

CPPO: Continual Learning for Reinforcement Learning with Human Feedback

Han Zhang, Yu Lei, Lin Gui et al.

ICLR 2024
32
citations
#30

HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages

Zhilin Wang, Jiaqi Zeng, Olivier Delalleau et al.

NeurIPS 2025arXiv:2505.11475
preference datasetsreinforcement learning from human feedbackreward modelsinstruction-following language models+4
31
citations
#31

Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models

Hritik Bansal, John Dang, Aditya Grover

ICLR 2024arXiv:2308.15812
26
citations
#32

EarnHFT: Efficient Hierarchical Reinforcement Learning for High Frequency Trading

Molei Qin, Shuo Sun, Wentao Zhang et al.

AAAI 2024arXiv:2309.12891
hierarchical reinforcement learninghigh frequency tradingcryptocurrency marketdynamic programming+4
24
citations
#33

Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

Wenxuan Zhang, Philip Torr, Mohamed Elhoseiny et al.

ICLR 2025arXiv:2408.15313
23
citations
#34

Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning

Yun Qu, Yuhang Jiang, Boyuan Wang et al.

AAAI 2025arXiv:2412.11120
23
citations
#35

Teaching Language Models to Critique via Reinforcement Learning

Zhihui Xie, Jie chen, Liyu Chen et al.

ICML 2025arXiv:2502.03492
23
citations
#36

Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving

Zhenghao Peng, Wenjie Luo, Yiren Lu et al.

ECCV 2024arXiv:2409.18343
autonomous drivingagent behavior modelingreinforcement learning fine-tuningdistribution shift+4
23
citations
#37

ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning

Hongyin Zhang, Zifeng Zhuang, Han Zhao et al.

ICML 2025arXiv:2505.07395
20
citations
#38

SeRL: Self-play Reinforcement Learning for Large Language Models with Limited Data

Wenkai Fang, Shunyu Liu, Yang Zhou et al.

NeurIPS 2025arXiv:2505.20347
reinforcement learninglarge language modelsself-instruction generationself-rewarding mechanisms+4
19
citations
#39

Online Preference Alignment for Language Models via Count-based Exploration

Chenjia Bai, Yang Zhang, Shuang Qiu et al.

ICLR 2025arXiv:2501.12735
19
citations
#40

Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards

Zijing Hu, Fengda Zhang, Long Chen et al.

CVPR 2025
19
citations
#41

Hierarchical World Models as Visual Whole-Body Humanoid Controllers

Nick Hansen, Jyothir S V, Vlad Sobal et al.

ICLR 2025arXiv:2405.18418
whole-body controlhumanoid roboticsvisual observationshierarchical world model+4
19
citations
#42

SELF-EVOLVED REWARD LEARNING FOR LLMS

Chenghua Huang, Zhizhen Fan, Lu Wang et al.

ICLR 2025arXiv:2411.00418
reinforcement learning from human feedbackreward model trainingself-evolved learninglanguage model alignment+3
18
citations
#43

Learning Optimal Advantage from Preferences and Mistaking It for Reward

W Bradley Knox, Stephane Hatgis-Kessell, Sigurdur Orn Adalgeirsson et al.

AAAI 2024arXiv:2310.02456
reward function learninghuman preference modelingregret preference modelpartial return assumption+4
15
citations
#44

Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards

Xiaoyuan Liu, Tian Liang, Zhiwei He et al.

NeurIPS 2025arXiv:2505.13445
15
citations
#45

Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

Zhaolin Gao, Wenhao Zhan, Jonathan Chang et al.

ICLR 2025arXiv:2410.04612
14
citations
#46

Reinforcement Learning Friendly Vision-Language Model for Minecraft

Haobin Jiang, Junpeng Yue, Hao Luo et al.

ECCV 2024arXiv:2303.10571
14
citations
#47

Rating-Based Reinforcement Learning

Devin White, Mingkang Wu, Ellen Novoseller et al.

AAAI 2024arXiv:2307.16348
reinforcement learninghuman ratingspreference-based learningrating prediction model+3
13
citations
#48

PILAF: Optimal Human Preference Sampling for Reward Modeling

Yunzhen Feng, Ariel Kwiatkowski, Kunhao Zheng et al.

ICML 2025arXiv:2502.04270
13
citations
#49

Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards

Charles Arnal, Gaëtan Narozniak, Vivien Cabannes et al.

NeurIPS 2025arXiv:2506.20520
13
citations
#50

Scaling Autonomous Agents via Automatic Reward Modeling And Planning

Zhenfang Chen, Delin Chen, Rui Sun et al.

ICLR 2025arXiv:2502.12130
13
citations
#51

Post-hoc Reward Calibration: A Case Study on Length Bias

Zeyu Huang, Zihan Qiu, zili wang et al.

ICLR 2025
12
citations
#52

Stabilizing Contrastive RL: Techniques for Robotic Goal Reaching from Offline Data

Chongyi Zheng, Benjamin Eysenbach, Homer Walke et al.

ICLR 2024arXiv:2306.03346
11
citations
#53

ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning

Tonghe Zhang, Chao Yu, Sichang Su et al.

NeurIPS 2025arXiv:2505.22094
flow matchingreinforcement learning fine-tuningrobotic controlrectified flow+4
10
citations
#54

More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness

Aaron J. Li, Satyapriya Krishna, Hima Lakkaraju

ICLR 2025arXiv:2404.18870
reinforcement learning from human feedbacklarge language modelsmodel trustworthinesspreference alignment+4
10
citations
#55

Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference

Qining Zhang, Lei Ying

ICLR 2025arXiv:2409.17401
9
citations
#56

Evaluating Large Language Models through Role-Guide and Self-Reflection: A Comparative Study

Lili Zhao, Yang Wang, Qi Liu et al.

ICLR 2025
9
citations
#57

Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics

Dongyoung Kim, Huiwon Jang, Sumin Park et al.

NeurIPS 2025arXiv:2506.00070
reinforcement learningembodied reasoningrobot controlvision-language models+4
9
citations
#58

Rapidly Adapting Policies to the Real-World via Simulation-Guided Fine-Tuning

Patrick Yin, Tyler Westenbroek, Ching-An Cheng et al.

ICLR 2025
9
citations
#59

Re2LLM: Reflective Reinforcement Large Language Model for Session-based Recommendation

Ziyan Wang, Yingpeng Du, Zhu Sun et al.

AAAI 2025arXiv:2403.16427
8
citations
#60

REvolve: Reward Evolution with Large Language Models using Human Feedback

RISHI HAZRA, Alkis Sygkounas, Andreas Persson et al.

ICLR 2025arXiv:2406.01309
reward function designreinforcement learninglarge language modelshuman feedback integration+3
8
citations
#61

Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning

Joey Hong, Anca Dragan, Sergey Levine

ICLR 2025arXiv:2411.05193
8
citations
#62

DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing

Vint Lee, Pieter Abbeel, Youngwoon Lee

ICLR 2024arXiv:2311.01450
8
citations
#63

DrS: Learning Reusable Dense Rewards for Multi-Stage Tasks

Tongzhou Mu, Minghua Liu, Hao Su

ICLR 2024arXiv:2404.16779
7
citations
#64

Enhancing Rating-Based Reinforcement Learning to Effectively Leverage Feedback from Large Vision-Language Models

Minh-Tung Luu, Younghwan Lee, Donghoon Lee et al.

ICML 2025arXiv:2506.12822
7
citations
#65

Learning from negative feedback, or positive feedback or both

Abbas Abdolmaleki, Bilal Piot, Bobak Shahriari et al.

ICLR 2025arXiv:2410.04166
preference optimizationnegative feedback learningexpectation-maximization algorithmshuman feedback training+3
7
citations
#66

Non-Adversarial Inverse Reinforcement Learning via Successor Feature Matching

Arnav Kumar Jain, Harley Wiltzer, Jesse Farebrother et al.

ICLR 2025arXiv:2411.07007
inverse reinforcement learningsuccessor feature matchingpolicy gradient descentstate-only imitation+4
6
citations
#67

CL-LoRA: Continual Low-Rank Adaptation for Rehearsal-Free Class-Incremental Learning

Jiangpeng He, Zhihao Duan, Fengqing Zhu

CVPR 2025arXiv:2505.24816
class-incremental learningparameter-efficient fine-tuninglow-rank adaptationdual-adapter architecture+4
6
citations
#68

Aligning Language Models Using Follow-up Likelihood as Reward Signal

Chen Zhang, Dading Chong, Feng Jiang et al.

AAAI 2025arXiv:2409.13948
6
citations
#69

Inverse Reinforcement Learning by Estimating Expertise of Demonstrators

Mark Beliaev, Ramtin Pedarsani

AAAI 2025arXiv:2402.01886
6
citations
#70

SORREL: Suboptimal-Demonstration-Guided Reinforcement Learning for Learning to Branch

Shengyu Feng, Yiming Yang

AAAI 2025arXiv:2412.15534
5
citations
#71

PA2D-MORL: Pareto Ascent Directional Decomposition Based Multi-Objective Reinforcement Learning

Tianmeng Hu, Biao Luo

AAAI 2024
5
citations
#72

Real-Time Recurrent Reinforcement Learning

Julian Lemmel, Radu Grosu

AAAI 2025arXiv:2311.04830
5
citations
#73

ERL-TD: Evolutionary Reinforcement Learning Enhanced with Truncated Variance and Distillation Mutation

Qiuzhen Lin, Yangfan Chen, Lijia Ma et al.

AAAI 2024
5
citations
#74

UTILITY: Utilizing Explainable Reinforcement Learning to Improve Reinforcement Learning

Shicheng Liu, Minghui Zhu

ICLR 2025
5
citations
#75

ReNeg: Learning Negative Embedding with Reward Guidance

Xiaomin Li, yixuan liu, Takashi Isobe et al.

CVPR 2025arXiv:2412.19637
5
citations
#76

On the Robustness of Reward Models for Language Model Alignment

Jiwoo Hong, Noah Lee, Eunki Kim et al.

ICML 2025arXiv:2505.07271
5
citations
#77

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

Shenao Zhang, Zhihan Liu, Boyi Liu et al.

ICML 2025arXiv:2410.08067
5
citations
#78

SeRA: Self-Reviewing and Alignment of LLMs using Implicit Reward Margins

Jongwoo Ko, Saket Dingliwal, Bhavana Ganesh et al.

ICLR 2025
direct alignment algorithmspreference optimizationimplicit reward modelingoff-policy alignment+3
5
citations
#79

Model-based Offline Reinforcement Learning with Lower Expectile Q-Learning

Kwanyoung Park, Youngwoon Lee

ICLR 2025arXiv:2407.00699
5
citations
#80

Q-Adapter: Customizing Pre-trained LLMs to New Preferences with Forgetting Mitigation

Yi-Chen Li, Fuxiang Zhang, Wenjie Qiu et al.

ICLR 2025
5
citations
#81

ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning

Ruiyang Zhou, Shuozhe Li, Amy Zhang et al.

NeurIPS 2025
4
citations
#82

ReDit: Reward Dithering for Improved LLM Policy Optimization

Chenxing Wei, Jiarui Yu, Ying He et al.

NeurIPS 2025arXiv:2506.18631
4
citations
#83

SEEA-R1: Tree-Structured Reinforcement Fine-Tuning for Self-Evolving Embodied Agents

Wanxin Tian, Shijie Zhang, Kevin Zhang et al.

NeurIPS 2025arXiv:2506.21669
4
citations
#84

Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective

Jiawei Huang, Bingcong Li, Christoph Dann et al.

ICML 2025arXiv:2502.19255
4
citations
#85

Enhancing Human Experience in Human-Agent Collaboration: A Human-Centered Modeling Approach Based on Positive Human Gain

Yiming Gao, Feiyu Liu, Liang Wang et al.

ICLR 2024arXiv:2401.16444
4
citations
#86

MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment

Tianze Wang, Dongnan Gui, Yifan Hu et al.

ICML 2025arXiv:2502.18699
4
citations
#87

Projection Optimization: A General Framework for Multi-Objective and Multi-Group RLHF

Nuoya Xiong, Aarti Singh

ICML 2025arXiv:2502.15145
4
citations
#88

Position: Lifetime tuning is incompatible with continual reinforcement learning

Golnaz Mesbahi, Parham Mohammad Panahi, Olya Mastikhina et al.

ICML 2025arXiv:2404.02113
4
citations
#89

On Corruption-Robustness in Performative Reinforcement Learning

Vasilis Pollatos, Debmalya Mandal, Goran Radanovic

AAAI 2025arXiv:2505.05609
4
citations
#90

MetaCARD: Meta-Reinforcement Learning with Task Uncertainty Feedback via Decoupled Context-Aware Reward and Dynamics Components

Min Wang, Xin Li, Leiji Zhang et al.

AAAI 2024
4
citations
#91

Visual Reinforcement Learning with Residual Action

Zhenxian Liu, Peixi Peng, Yonghong Tian

AAAI 2025
4
citations
#92

Capturing Individual Human Preferences with Reward Features

Andre Barreto, Vincent Dumoulin, Yiran Mao et al.

NeurIPS 2025
4
citations
#93

Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts

Yanting Yang, Minghao Chen, Qibo Qiu et al.

ECCV 2024arXiv:2407.14872
vision-language modelslanguage-conditioned rewardrobotic planningreinforcement learning+4
4
citations
#94

Doubly Robust Alignment for Large Language Models

Erhan Xu, Kai Ye, Hongyi Zhou et al.

NeurIPS 2025
4
citations
#95

Neural Motion Simulator Pushing the Limit of World Models in Reinforcement Learning

Chenjie Hao, Weyl Lu, Yifan Xu et al.

CVPR 2025arXiv:2504.07095
4
citations
#96

SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning

Lin Zhang, Xianfang Zeng, Kangcong Li et al.

ICCV 2025arXiv:2508.06125
3
citations
#97

Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics

Muleilan Pei, Shaoshuai Shi, Xuesong Chen et al.

ICCV 2025arXiv:2507.12083
trajectory predictionmotion forecastingautonomous drivinginverse reinforcement learning+4
3
citations
#98

Reinforcement Learning from Imperfect Corrective Actions and Proxy Rewards

Zhaohui JIANG, Xuening Feng, Paul Weng et al.

ICLR 2025arXiv:2410.05782
3
citations
#99

Uncertainty and Influence aware Reward Model Refinement for Reinforcement Learning from Human Feedback

Zexu Sun, Yiju Guo, Yankai Lin et al.

ICLR 2025
reward modelinghuman preference learningpolicy optimizationoff-distribution problem+4
3
citations
#100

Differentiable Information Enhanced Model-Based Reinforcement Learning

Xiaoyuan Zhang, Xinyan Cai, Bo Liu et al.

AAAI 2025arXiv:2503.01178
3
citations