🧬Reinforcement Learning

Reinforcement Learning

Learning through interaction and rewards

100 papers2,813 total citations
Compare with other topics
Feb '24 Jan '26730 papers
Also includes: reinforcement learning, rl, reward learning, policy

Top Papers

#1

Eureka: Human-Level Reward Design via Coding Large Language Models

Yecheng Jason Ma, William Liang, Guanzhi Wang et al.

ICLR 2024
471
citations
#2

Self-Play Preference Optimization for Language Model Alignment

Yue Wu, Zhiqing Sun, Rina Hughes et al.

ICLR 2025
207
citations
#3

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang et al.

NeurIPS 2025
169
citations
#4

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He et al.

NeurIPS 2025
152
citations
#5

HelpSteer2-Preference: Complementing Ratings with Preferences

Zhilin Wang, Alexander Bukharin, Olivier Delalleau et al.

ICLR 2025
102
citations
#6

The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

Xinyu Zhu, Mengzhou Xia, Zhepei Wei et al.

NeurIPS 2025arXiv:2506.01347
reinforcement learningmathematical reasoninglanguage modelspolicy gradients+4
74
citations
#7

Learning to Act without Actions

Dominik Schmidt, Minqi Jiang

ICLR 2024
69
citations
#8

BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP

Jiawang Bai, Kuofeng Gao, Shaobo Min et al.

CVPR 2024
68
citations
#9

LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

Marwa Abdulhai, Isadora White, Charlie Snell et al.

ICML 2025
63
citations
#10

Perception-R1: Pioneering Perception Policy with Reinforcement Learning

En Yu, Kangheng Lin, Liang Zhao et al.

NeurIPS 2025arXiv:2504.07954
58
citations
#11

CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control

Guy Tevet, Sigal Raab, Setareh Cohan et al.

ICLR 2025
53
citations
#12

Duolando: Follower GPT with Off-Policy Reinforcement Learning for Dance Accompaniment

Siyao Li, Tianpei Gu, Zhitao Yang et al.

ICLR 2024
45
citations
#13

OMNI-EPIC: Open-endedness via Models of human Notions of Interestingness with Environments Programmed in Code

Maxence Faldor, Jenny Zhang, Antoine Cully et al.

ICLR 2025
44
citations
#14

RRM: Robust Reward Model Training Mitigates Reward Hacking

Tianqi Liu, Wei Xiong, Jie Ren et al.

ICLR 2025arXiv:2409.13156
reward model trainingreward hacking mitigationcausal preference learningdata augmentation techniques+4
44
citations
#15

Diffusion Reward: Learning Rewards via Conditional Video Diffusion

Tao Huang, Guangqi Jiang, Yanjie Ze et al.

ECCV 2024
43
citations
#16

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

Marcus Williams, Micah Carroll, Adhyyan Narang et al.

ICLR 2025
41
citations
#17

Dual RL: Unification and New Methods for Reinforcement and Imitation Learning

Harshit Sikchi, Qinqing Zheng, Amy Zhang et al.

ICLR 2024
39
citations
#18

Reasoning Gym: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

Zafir Stojanovski, Oliver Stanley, Joe Sharratt et al.

NeurIPS 2025
39
citations
#19

CPPO: Continual Learning for Reinforcement Learning with Human Feedback

Han Zhang, Yu Lei, Lin Gui et al.

ICLR 2024
32
citations
#20

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

Rui Qian, Shuangrui Ding, Xiaoyi Dong et al.

CVPR 2025arXiv:2501.03218
video large language modelsactive real-time interactionstreaming video processingdisentangled system architecture+4
31
citations
#21

Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

Chaofeng Chen, Annan Wang, Haoning Wu et al.

ECCV 2024
26
citations
#22

Entity-Centric Reinforcement Learning for Object Manipulation from Pixels

Dan Haramati, Tal Daniel, Aviv Tamar

ICLR 2024
25
citations
#23

RLIF: Interactive Imitation Learning as Reinforcement Learning

Jianlan Luo, Perry Dong, Yuexiang Zhai et al.

ICLR 2024
25
citations
#24

ACT: Empowering Decision Transformer with Dynamic Programming via Advantage Conditioning

Chen-Xiao Gao, Chenyang Wu, Mingjun Cao et al.

AAAI 2024arXiv:2309.05915
decision transformeroffline policy optimizationadvantage conditioningdynamic programming+3
25
citations
#25

Revisiting Plasticity in Visual Reinforcement Learning: Data, Modules and Training Stages

Guozheng Ma, Lu Li, Sen Zhang et al.

ICLR 2024
25
citations
#26

AlignSAM: Aligning Segment Anything Model to Open Context via Reinforcement Learning

Duojun Huang, Xinyu Xiong, Jie Ma et al.

CVPR 2024
24
citations
#27

Reward Guided Latent Consistency Distillation

William Wang, Jiachen Li, Weixi Feng et al.

ICLR 2025
23
citations
#28

Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning

Yun Qu, Yuhang Jiang, Boyuan Wang et al.

AAAI 2025
23
citations
#29

Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving

Zhenghao Peng, Wenjie Luo, Yiren Lu et al.

ECCV 2024
23
citations
#30

Domain Randomization via Entropy Maximization

Gabriele Tiboni, Pascal Klink, Jan Peters et al.

ICLR 2024
20
citations
#31

DiffAIL: Diffusion Adversarial Imitation Learning

Bingzheng Wang, Guoqiang Wu, Teng Pang et al.

AAAI 2024arXiv:2312.06348
imitation learningadversarial imitation learningdiffusion modelsreward function learning+4
20
citations
#32

ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning

Hongyin Zhang, Zifeng Zhuang, Han Zhao et al.

ICML 2025
20
citations
#33

Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards

Zijing Hu, Fengda Zhang, Long Chen et al.

CVPR 2025
19
citations
#34

Online Preference Alignment for Language Models via Count-based Exploration

Chenjia Bai, Yang Zhang, Shuang Qiu et al.

ICLR 2025
19
citations
#35

SeRL: Self-play Reinforcement Learning for Large Language Models with Limited Data

Wenkai Fang, Shunyu Liu, Yang Zhou et al.

NeurIPS 2025arXiv:2505.20347
reinforcement learninglarge language modelsself-instruction generationself-rewarding mechanisms+4
19
citations
#36

Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2)

Zhenjie Yang, Xiaosong Jia, Qifeng Li et al.

NeurIPS 2025arXiv:2505.16394
reinforcement learningautonomous drivingworld modelsmodel-based reinforcement learning+4
18
citations
#37

SELF-EVOLVED REWARD LEARNING FOR LLMS

Chenghua Huang, Zhizhen Fan, Lu Wang et al.

ICLR 2025arXiv:2411.00418
reinforcement learning from human feedbackreward model trainingself-evolved learninglanguage model alignment+3
18
citations
#38

Cross-Embodiment Dexterous Grasping with Reinforcement Learning

Haoqi Yuan, Bohan Zhou, Yuhui Fu et al.

ICLR 2025
18
citations
#39

MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization

Bhavya, Stelian Coros, Andreas Krause et al.

ICLR 2025
18
citations
#40

ThinkBot: Embodied Instruction Following with Thought Chain Reasoning

Guanxing Lu, Ziwei Wang, Changliu Liu et al.

ICLR 2025
17
citations
#41

RGMComm: Return Gap Minimization via Discrete Communications in Multi-Agent Reinforcement Learning

Jingdi Chen, Tian Lan, Carlee Joe-Wong

AAAI 2024arXiv:2308.03358
multi-agent reinforcement learningdiscrete communicationreturn gap minimizationonline clustering problem+4
17
citations
#42

Learning Optimal Advantage from Preferences and Mistaking It for Reward

W Bradley Knox, Stephane Hatgis-Kessell, Sigurdur Orn Adalgeirsson et al.

AAAI 2024arXiv:2310.02456
reward function learninghuman preference modelingregret preference modelpartial return assumption+4
15
citations
#43

Ready-to-React: Online Reaction Policy for Two-Character Interaction Generation

Zhi Cen, Huaijin Pi, Sida Peng et al.

ICLR 2025
14
citations
#44

Simulating Human-like Daily Activities with Desire-driven Autonomy

Yiding Wang, Yuxuan Chen, Fangwei Zhong et al.

ICLR 2025
14
citations
#45

Reward-Consistent Dynamics Models are Strongly Generalizable for Offline Reinforcement Learning

Fan-Ming Luo, Tian Xu, Xingchen Cao et al.

ICLR 2024
14
citations
#46

SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning

Xiao Liang, Zhong-Zhi Li, Yeyun Gong et al.

NeurIPS 2025
14
citations
#47

Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

Zhaolin Gao, Wenhao Zhan, Jonathan Chang et al.

ICLR 2025
14
citations
#48

Reinforcement Learning Friendly Vision-Language Model for Minecraft

Haobin Jiang, Junpeng Yue, Hao Luo et al.

ECCV 2024
14
citations
#49

PILAF: Optimal Human Preference Sampling for Reward Modeling

Yunzhen Feng, Ariel Kwiatkowski, Kunhao Zheng et al.

ICML 2025
13
citations
#50

Rating-Based Reinforcement Learning

Devin White, Mingkang Wu, Ellen Novoseller et al.

AAAI 2024arXiv:2307.16348
reinforcement learninghuman ratingspreference-based learningrating prediction model+3
13
citations
#51

Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards

Charles Arnal, Gaëtan Narozniak, Vivien Cabannes et al.

NeurIPS 2025
13
citations
#52

Scaling Autonomous Agents via Automatic Reward Modeling And Planning

Zhenfang Chen, Delin Chen, Rui Sun et al.

ICLR 2025
13
citations
#53

AdaManip: Adaptive Articulated Object Manipulation Environments and Policy Learning

Yuanfei Wang, Xiaojie Zhang, Ruihai Wu et al.

ICLR 2025arXiv:2502.11124
articulated object manipulationadaptive manipulation policy3d visual diffusionimitation learning+4
12
citations
#54

Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward

Yanming Wan, Jiaxing Wu, Marwa Abdulhai et al.

NeurIPS 2025arXiv:2504.03206
personalized dialogue systemsmulti-turn reinforcement learningcuriosity reward mechanismuser modeling+4
12
citations
#55

RAT: Adversarial Attacks on Deep Reinforcement Agents for Targeted Behaviors

Fengshuo Bai, Runze Liu, Yali Du et al.

AAAI 2025
12
citations
#56

ConcaveQ: Non-monotonic Value Function Factorization via Concave Representations in Deep Multi-Agent Reinforcement Learning

Huiqun Li, Hanhan Zhou, Yifei Zou et al.

AAAI 2024arXiv:2312.15555
value function factorizationmulti-agent reinforcement learningnon-monotonic mixing functionsconcave representations+3
12
citations
#57

Multi-Teacher Knowledge Distillation with Reinforcement Learning for Visual Recognition

Chuanguang Yang, XinQiang Yu, Han Yang et al.

AAAI 2025
12
citations
#58

MetaRLEC: Meta-Reinforcement Learning for Discovery of Brain Effective Connectivity

Zuozhen Zhang, Junzhong Ji, Jinduo Liu

AAAI 2024
11
citations
#59

ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting

Shaofei Cai, Zihao Wang, Kewei Lian et al.

CVPR 2025
11
citations
#60

Open the Black Box: Step-based Policy Updates for Temporally-Correlated Episodic Reinforcement Learning

Ge Li, Hongyi Zhou, Dominik Roth et al.

ICLR 2024
11
citations
#61

Smart Help: Strategic Opponent Modeling for Proactive and Adaptive Robot Assistance in Households

Zhihao Cao, ZiDong Wang, Siwen Xie et al.

CVPR 2024
10
citations
#62

Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program

Minghe Gao, Xuqi Liu, Zhongqi Yue et al.

ICCV 2025
10
citations
#63

Open-World Reinforcement Learning over Long Short-Term Imagination

Jiajian Li, Qi Wang, Yunbo Wang et al.

ICLR 2025
10
citations
#64

ConfigX: Modular Configuration for Evolutionary Algorithms via Multitask Reinforcement Learning

Hongshu Guo, Zeyuan Ma, Jiacheng Chen et al.

AAAI 2025
9
citations
#65

Strategy Coopetition Explains the Emergence and Transience of In-Context Learning

Aaditya Singh, Ted Moskovitz, Sara Dragutinović et al.

ICML 2025
9
citations
#66

GENTEEL-NEGOTIATOR: LLM-Enhanced Mixture-of-Expert-Based Reinforcement Learning Approach for Polite Negotiation Dialogue

Priyanshu Priya, Rishikant Chigrupaatii, Mauajama Firdaus et al.

AAAI 2025
9
citations
#67

Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

Zun Wang, Jialu Li, Yicong Hong et al.

ICLR 2025
9
citations
#68

SAM-R1: Leveraging SAM for Reward Feedback in Multimodal Segmentation via Reinforcement Learning

Jiaqi Huang, Zunnan Xu, Jun Zhou et al.

NeurIPS 2025
8
citations
#69

DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing

Vint Lee, Pieter Abbeel, Youngwoon Lee

ICLR 2024
8
citations
#70

Scaling Off-Policy Reinforcement Learning with Batch and Weight Normalization

Daniel Palenicek, Florian Vogt, Joe Watson et al.

NeurIPS 2025
8
citations
#71

Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

Hyungjoo Chae, Seonghwan Kim, Junhee Cho et al.

NeurIPS 2025arXiv:2505.15277
process reward modelsweb navigation agentsstep-level assessmentpreference pair datasets+3
8
citations
#72

Offline Multi-Agent Reinforcement Learning via In-Sample Sequential Policy Optimization

Zongkai Liu, Qian Lin, Chao Yu et al.

AAAI 2025
8
citations
#73

Social Reward: Evaluating and Enhancing Generative AI through Million-User Feedback from an Online Creative Community

Arman Isajanyan, Artur Shatveryan, David Kocharian et al.

ICLR 2024
8
citations
#74

REvolve: Reward Evolution with Large Language Models using Human Feedback

RISHI HAZRA, Alkis Sygkounas, Andreas Persson et al.

ICLR 2025
8
citations
#75

DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

Gang Li, Ming Lin, Tomer Galanti et al.

NeurIPS 2025
8
citations
#76

Bidirectional Progressive Transformer for Interaction Intention Anticipation

Zichen Zhang, Hongchen Luo, Wei Zhai et al.

ECCV 2024
8
citations
#77

Direct Alignment with Heterogeneous Preferences

Ali Shirali, Arash Nasr-Esfahany, Abdullah Alomar et al.

NeurIPS 2025arXiv:2502.16320
human preference alignmentheterogeneous preferencesdirect alignment methodsreward function learning+4
8
citations
#78

SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning

Borong Zhang, Yuhao Zhang, Jiaming Ji et al.

NeurIPS 2025
7
citations
#79

Distilling Reinforcement Learning Algorithms for In-Context Model-Based Planning

Jaehyeon Son, Soochan Lee, Gunhee Kim

ICLR 2025
7
citations
#80

DrS: Learning Reusable Dense Rewards for Multi-Stage Tasks

Tongzhou Mu, Minghua Liu, Hao Su

ICLR 2024
7
citations
#81

Enhancing Rating-Based Reinforcement Learning to Effectively Leverage Feedback from Large Vision-Language Models

Minh-Tung Luu, Younghwan Lee, Donghoon Lee et al.

ICML 2025
7
citations
#82

The Bandit Whisperer: Communication Learning for Restless Bandits

Yunfan Zhao, Tonghan Wang, Dheeraj Mysore Nagaraj et al.

AAAI 2025
7
citations
#83

WHAT MAKES MATH PROBLEMS HARD FOR REINFORCEMENT LEARNING: A CASE STUDY

Ali Shehper, Anibal Medina-Mardones, Lucas Fagan et al.

NeurIPS 2025
7
citations
#84

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

Yunheng Li, Jing Cheng, Shaoyong Jia et al.

NeurIPS 2025
6
citations
#85

Bidirectional Decoding: Improving Action Chunking via Guided Test-Time Sampling

Yuejiang Liu, Jubayer Hamid, Annie Xie et al.

ICLR 2025arXiv:2408.17355
action chunkingbidirectional decodingrobot learninghuman demonstrations+3
6
citations
#86

Advantage Alignment Algorithms

Juan Duque, Milad Aghajohari, Timotheus Cooijmans et al.

ICLR 2025
6
citations
#87

When Maximum Entropy Misleads Policy Optimization

Ruipeng Zhang, Ya-Chien Chang, Sicun Gao

ICML 2025
6
citations
#88

InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing

Jinlu Zhang, Yixin Chen, Zan Wang et al.

CVPR 2025
6
citations
#89

Inverse Reinforcement Learning by Estimating Expertise of Demonstrators

Mark Beliaev, Ramtin Pedarsani

AAAI 2025
6
citations
#90

REVECA: Adaptive Planning and Trajectory-Based Validation in Cooperative Language Agents Using Information Relevance and Relative Proximity

SeungWon Seo, SeongRae Noh, Junhyeok Lee et al.

AAAI 2025
6
citations
#91

The Curse of Diversity in Ensemble-Based Exploration

Zhixuan Lin, Pierluca D'Oro, Evgenii Nikishin et al.

ICLR 2024
6
citations
#92

Non-Adversarial Inverse Reinforcement Learning via Successor Feature Matching

Arnav Kumar Jain, Harley Wiltzer, Jesse Farebrother et al.

ICLR 2025arXiv:2411.07007
inverse reinforcement learningsuccessor feature matchingpolicy gradient descentstate-only imitation+4
6
citations
#93

Enabling Realtime Reinforcement Learning at Scale with Staggered Asynchronous Inference

Matt Riemer, Gopeshh Raaj Subbaraj, Glen Berseth et al.

ICLR 2025
6
citations
#94

Compliant Residual DAgger: Improving Real-World Contact-Rich Manipulation with Human Corrections

Xiaomeng Xu, Yifan Hou, Zeyi Liu et al.

NeurIPS 2025arXiv:2506.16685
contact-rich manipulationdataset aggregationhuman correctionscompliance control+3
5
citations
#95

Explore 3D Dance Generation via Reward Model from Automatically-Ranked Demonstrations

Zilin Wang, Haolin Zhuang, Lu Li et al.

AAAI 2024arXiv:2312.11442
3d dance generationreward model trainingreinforcement learningmusic-conditioned generation+4
5
citations
#96

PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning

Yan Zhang, Yao Feng, Alpár Cseke et al.

ICCV 2025arXiv:2503.17544
avatar motor systemsgenerative motion modelshuman motion generationfoundation models+4
5
citations
#97

Neural Interactive Proofs

Lewis Hammond, Sam Adam-Day

ICLR 2025
5
citations
#98

Unsupervised Object Interaction Learning with Counterfactual Dynamics Models

Jongwook Choi, Sungtae Lee, Xinyu Wang et al.

AAAI 2024
5
citations
#99

Noise-Resilient Symbolic Regression with Dynamic Gating Reinforcement Learning

Chenglu Sun, Shuo Shen, Wenzhi Tao et al.

AAAI 2025
5
citations
#100

Real-Time Recurrent Reinforcement Learning

Julian Lemmel, Radu Grosu

AAAI 2025
5
citations