🧬Reinforcement Learning

Deep Reinforcement Learning

Deep learning for RL

718 papers(showing top 100)2,946 total citations
Compare with other topics
Mar '24 Feb '26609 papers
Also includes: deep reinforcement learning, deep rl, drl, neural network rl

Top Papers

#1

Understanding the Effects of RLHF on LLM Generalisation and Diversity

Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis et al.

ICLR 2024arXiv:2310.06452
267
citations
#2

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li et al.

NeurIPS 2025arXiv:2503.21776
rule-based reinforcement learningmultimodal large language modelsvideo reasoningtemporal modeling+3
232
citations
#3

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang et al.

NeurIPS 2025arXiv:2504.08837
169
citations
#4

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He et al.

NeurIPS 2025
152
citations
#5

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

Juan Rocamonde, Victoriano Montesinos, Elvis Nava et al.

ICLR 2024arXiv:2310.12921
133
citations
#6

TTRL: Test-Time Reinforcement Learning

Yuxin Zuo, Kaiyan Zhang, Li Sheng et al.

NeurIPS 2025arXiv:2504.16084
test-time reinforcement learningreward estimationlarge language modelsreasoning tasks+4
118
citations
#7

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

Zehan Qi, Xiao Liu, Iat Long Iong et al.

ICLR 2025arXiv:2411.02337
llm web agentsonline curriculum reinforcement learningself-evolving curriculumoutcome-supervised reward model+3
110
citations
#8

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Mingjie Liu, Shizhe Diao, Ximing Lu et al.

NeurIPS 2025arXiv:2505.24864
reinforcement learningreasoning capabilitieskl divergence controlreference policy resetting+4
96
citations
#9

General-Reasoner: Advancing LLM Reasoning Across All Domains

Xueguang Ma, Qian Liu, Dongfu Jiang et al.

NeurIPS 2025arXiv:2505.14652
74
citations
#10

Learning to Act without Actions

Dominik Schmidt, Minqi Jiang

ICLR 2024arXiv:2312.10812
69
citations
#11

Large-scale Reinforcement Learning for Diffusion Models

Yinan Zhang, Eric Tzeng, Yilun Du et al.

ECCV 2024arXiv:2401.12244
69
citations
#12

METRA: Scalable Unsupervised RL with Metric-Aware Abstraction

Seohong Park, Oleh Rybkin, Sergey Levine

ICLR 2024arXiv:2310.08887
68
citations
#13

LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

Marwa Abdulhai, Isadora White, Charlie Snell et al.

ICML 2025arXiv:2311.18232
63
citations
#14

Perception-R1: Pioneering Perception Policy with Reinforcement Learning

En Yu, Kangheng Lin, Liang Zhao et al.

NeurIPS 2025arXiv:2504.07954
58
citations
#15

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Mingyang Chen, Linzhuang Sun, Tianpeng Li et al.

NeurIPS 2025arXiv:2503.19470
reasoning with searchreinforcement learningmulti-hop question answeringsearch-guided reasoning+3
57
citations
#16

Simplifying Deep Temporal Difference Learning

Matteo Gallici, Mattie Fellows, Benjamin Ellis et al.

ICLR 2025arXiv:2407.04811
temporal difference learningoff-policy learningq-learning algorithmsreinforcement learning stability+4
53
citations
#17

Jumanji: a Diverse Suite of Scalable Reinforcement Learning Environments in JAX

Clément Bonnet, Daniel Luo, Donal Byrne et al.

ICLR 2024arXiv:2306.09884
47
citations
#18

RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning

Hao Gao, Shaoyu Chen, Bo Jiang et al.

NeurIPS 2025
43
citations
#19

Think2Drive: Efficient Reinforcement Learning by Thinking with Latent World Model for Autonomous Driving (in CARLA-v2)

Qifeng Li, Xiaosong Jia, Shaobo Wang et al.

ECCV 2024
reinforcement learningautonomous drivingworld modellatent state space+4
43
citations
#20

Dual RL: Unification and New Methods for Reinforcement and Imitation Learning

Harshit Sikchi, Qinqing Zheng, Amy Zhang et al.

ICLR 2024arXiv:2302.08560
39
citations
#21

Provable Offline Preference-Based Reinforcement Learning

Wenhao Zhan, Masatoshi Uehara, Nathan Kallus et al.

ICLR 2024arXiv:2305.14816
39
citations
#22

SafeDreamer: Safe Reinforcement Learning with World Models

Weidong Huang, Jiaming Ji, Chunhe Xia et al.

ICLR 2024arXiv:2307.07176
34
citations
#23

CPPO: Continual Learning for Reinforcement Learning with Human Feedback

Han Zhang, Yu Lei, Lin Gui et al.

ICLR 2024
32
citations
#24

Efficient Online Reinforcement Learning Fine-Tuning Need Not Retain Offline Data

Zhiyuan Zhou, Andy Peng, Qiyang Li et al.

ICLR 2025arXiv:2412.07762
reinforcement learning fine-tuningoffline reinforcement learningonline reinforcement learningdistribution mismatch+4
27
citations
#25

BadRL: Sparse Targeted Backdoor Attack against Reinforcement Learning

Jing Cui, Yufei Han, Yuzhe Ma et al.

AAAI 2024arXiv:2312.12585
backdoor attacksreinforcement learning securitysparse poisoningtargeted state observations+3
26
citations
#26

Entity-Centric Reinforcement Learning for Object Manipulation from Pixels

Dan Haramati, Tal Daniel, Aviv Tamar

ICLR 2024arXiv:2404.01220
25
citations
#27

Grounded Reinforcement Learning for Visual Reasoning

Gabriel Sarch, Snigdha Saha, Naitik Khandelwal et al.

NeurIPS 2025
25
citations
#28

Efficient Online Reinforcement Learning for Diffusion Policy

Haitong Ma, Tianyi Chen, Kai Wang et al.

ICML 2025arXiv:2502.00361
24
citations
#29

Pre-Training Goal-based Models for Sample-Efficient Reinforcement Learning

Haoqi Yuan, Zhancun Mu, Feiyang Xie et al.

ICLR 2024
23
citations
#30

Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning

Yun Qu, Yuhang Jiang, Boyuan Wang et al.

AAAI 2025arXiv:2412.11120
23
citations
#31

Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving

Zhenghao Peng, Wenjie Luo, Yiren Lu et al.

ECCV 2024arXiv:2409.18343
autonomous drivingagent behavior modelingreinforcement learning fine-tuningdistribution shift+4
23
citations
#32

ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning

Hongyin Zhang, Zifeng Zhuang, Han Zhao et al.

ICML 2025arXiv:2505.07395
20
citations
#33

Exploring the Promise and Limits of Real-Time Recurrent Learning

Kazuki Irie, Anand Gopalakrishnan, Jürgen Schmidhuber

ICLR 2024arXiv:2305.19044
20
citations
#34

Efficient Reinforcement Learning with Large Language Model Priors

Xue Yan, Yan Song, Xidong Feng et al.

ICLR 2025arXiv:2410.07927
20
citations
#35

SeRL: Self-play Reinforcement Learning for Large Language Models with Limited Data

Wenkai Fang, Shunyu Liu, Yang Zhou et al.

NeurIPS 2025arXiv:2505.20347
reinforcement learninglarge language modelsself-instruction generationself-rewarding mechanisms+4
19
citations
#36

Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards

Zijing Hu, Fengda Zhang, Long Chen et al.

CVPR 2025
19
citations
#37

MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization

Bhavya, Stelian Coros, Andreas Krause et al.

ICLR 2025
18
citations
#38

SELF-EVOLVED REWARD LEARNING FOR LLMS

Chenghua Huang, Zhizhen Fan, Lu Wang et al.

ICLR 2025arXiv:2411.00418
reinforcement learning from human feedbackreward model trainingself-evolved learninglanguage model alignment+3
18
citations
#39

Reinforcement Learning Finetunes Small Subnetworks in Large Language Models

Sagnik Mukherjee, Lifan Yuan, Dilek Hakkani-Tur et al.

NeurIPS 2025
15
citations
#40

Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards

Xiaoyuan Liu, Tian Liang, Zhiwei He et al.

NeurIPS 2025arXiv:2505.13445
15
citations
#41

Transformers Can Learn Temporal Difference Methods for In-Context Reinforcement Learning

Jiuqi Wang, Ethan Blaser, Hadi Daneshmand et al.

ICLR 2025arXiv:2405.13861
in-context reinforcement learningtemporal difference learningpolicy evaluationtransformer architecture+3
14
citations
#42

Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

Zhaolin Gao, Wenhao Zhan, Jonathan Chang et al.

ICLR 2025arXiv:2410.04612
14
citations
#43

Reinforcement Learning Friendly Vision-Language Model for Minecraft

Haobin Jiang, Junpeng Yue, Hao Luo et al.

ECCV 2024arXiv:2303.10571
14
citations
#44

Rating-Based Reinforcement Learning

Devin White, Mingkang Wu, Ellen Novoseller et al.

AAAI 2024arXiv:2307.16348
reinforcement learninghuman ratingspreference-based learningrating prediction model+3
13
citations
#45

Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards

Charles Arnal, Gaëtan Narozniak, Vivien Cabannes et al.

NeurIPS 2025arXiv:2506.20520
13
citations
#46

AdaWM: Adaptive World Model based Planning for Autonomous Driving

Hang Wang, Xin Ye, Feng Tao et al.

ICLR 2025arXiv:2501.13072
world model reinforcement learningautonomous driving planningdistribution shiftdynamics model mismatch+4
12
citations
#47

Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration

Hao Zhong, Muzhi Zhu, Zongze Du et al.

NeurIPS 2025arXiv:2505.20256
reinforcement learningomnimodal reasoningtwo-system architecturekeyframe selection+4
12
citations
#48

Multi-Teacher Knowledge Distillation with Reinforcement Learning for Visual Recognition

Chuanguang Yang, XinQiang Yu, Han Yang et al.

AAAI 2025arXiv:2502.18510
12
citations
#49

RAT: Adversarial Attacks on Deep Reinforcement Agents for Targeted Behaviors

Fengshuo Bai, Runze Liu, Yali Du et al.

AAAI 2025arXiv:2412.10713
12
citations
#50

Game-Theoretic Robust Reinforcement Learning Handles Temporally-Coupled Perturbations

Yongyuan Liang, Yanchao Sun, Ruijie Zheng et al.

ICLR 2024arXiv:2307.12062
12
citations
#51

Stabilizing Contrastive RL: Techniques for Robotic Goal Reaching from Offline Data

Chongyi Zheng, Benjamin Eysenbach, Homer Walke et al.

ICLR 2024arXiv:2306.03346
11
citations
#52

Dynamic Layer Tying for Parameter-Efficient Transformers

Tamir David-Hay, Lior Wolf

ICLR 2024
11
citations
#53

XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning

Alexander Nikulin, Ilya Zisman, Alexey Zemtsov et al.

ICLR 2025
11
citations
#54

UNEX-RL: Reinforcing Long-Term Rewards in Multi-Stage Recommender Systems with UNidirectional EXecution

Gengrui Zhang, Xiaoshuang Chen, Yao WANG et al.

AAAI 2024arXiv:2401.06470
reinforcement learningmulti-stage recommender systemsmulti-agent reinforcement learninglong-term rewards+4
11
citations
#55

Stabilizing Reinforcement Learning in Differentiable Multiphysics Simulation

Eliot Xing, Vernon Luk, Jean Oh

ICLR 2025arXiv:2412.12089
11
citations
#56

Not All Tasks Are Equally Difficult: Multi-Task Deep Reinforcement Learning with Dynamic Depth Routing

Jinmin He, Kai Li, Yifan Zang et al.

AAAI 2024arXiv:2312.14472
multi-task reinforcement learningdynamic depth routingparameter sharingrouting network+3
10
citations
#57

Model-based RL as a Minimalist Approach to Horizon-Free and Second-Order Bounds

Zhiyong Wang, Dongruo Zhou, John C.S. Lui et al.

ICLR 2025arXiv:2408.08994
10
citations
#58

Open-World Reinforcement Learning over Long Short-Term Imagination

Jiajian Li, Qi Wang, Yunbo Wang et al.

ICLR 2025
10
citations
#59

MAD-TD: Model-Augmented Data stabilizes High Update Ratio RL

Claas Voelcker, Marcel Hussing, ERIC EATON et al.

ICLR 2025
10
citations
#60

ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning

Tonghe Zhang, Chao Yu, Sichang Su et al.

NeurIPS 2025arXiv:2505.22094
flow matchingreinforcement learning fine-tuningrobotic controlrectified flow+4
10
citations
#61

Massively Scalable Inverse Reinforcement Learning in Google Maps

Matt Barnes, Matthew Abueg, Oliver Lange et al.

ICLR 2024arXiv:2305.11290
10
citations
#62

A Large Recurrent Action Model: xLSTM enables Fast Inference for Robotics Tasks

Thomas Schmied, Thomas Adler, Vihang Patil et al.

ICML 2025arXiv:2410.22391
10
citations
#63

Learning from Sparse Offline Datasets via Conservative Density Estimation

Zhepeng Cen, Zuxin Liu, Zitong Wang et al.

ICLR 2024arXiv:2401.08819
10
citations
#64

Unlocking the Power of Representations in Long-term Novelty-based Exploration

Alaa Saade, Steven Kapturowski, Daniele Calandriello et al.

ICLR 2024arXiv:2305.01521
9
citations
#65

Rapidly Adapting Policies to the Real-World via Simulation-Guided Fine-Tuning

Patrick Yin, Tyler Westenbroek, Ching-An Cheng et al.

ICLR 2025
9
citations
#66

Prioritized Generative Replay

Ren Wang, Kevin Frans, Pieter Abbeel et al.

ICLR 2025arXiv:2410.18082
9
citations
#67

Building Minimal and Reusable Causal State Abstractions for Reinforcement Learning

Zizhao Wang, Caroline Wang, Xuesu Xiao et al.

AAAI 2024arXiv:2401.12497
causal state abstractionsreinforcement learningimplicit dynamics modelsfactored state spaces+4
9
citations
#68

A Single Goal is All You Need: Skills and Exploration Emerge from Contrastive RL without Rewards, Demonstrations, or Subgoals

Grace Liu, Michael Tang, Benjamin Eysenbach

ICLR 2025arXiv:2408.05804
contrastive reinforcement learningskill emergencedirected explorationreward-free learning+2
9
citations
#69

Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics

Dongyoung Kim, Huiwon Jang, Sumin Park et al.

NeurIPS 2025arXiv:2506.00070
reinforcement learningembodied reasoningrobot controlvision-language models+4
9
citations
#70

Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference

Qining Zhang, Lei Ying

ICLR 2025arXiv:2409.17401
9
citations
#71

SRL: Scaling Distributed Reinforcement Learning to Over Ten Thousand Cores

Zhiyu Mei, Wei Fu, Jiaxuan Gao et al.

ICLR 2024arXiv:2306.16688
8
citations
#72

DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing

Vint Lee, Pieter Abbeel, Youngwoon Lee

ICLR 2024arXiv:2311.01450
8
citations
#73

Flow-Based Policy for Online Reinforcement Learning

Lei Lv, Yunfei Li, Yu Luo et al.

NeurIPS 2025arXiv:2506.12811
8
citations
#74

Offline Multi-Agent Reinforcement Learning via In-Sample Sequential Policy Optimization

Zongkai Liu, Qian Lin, Chao Yu et al.

AAAI 2025arXiv:2412.07639
8
citations
#75

Efficient Reward Poisoning Attacks on Online Deep Reinforcement Learning

Yinglun Xu, Qi Zeng, Gagandeep Singh

ICLR 2025arXiv:2205.14842
reward poisoning attacksonline deep reinforcement learningadversarial mdp attacksblack-box attacks+4
8
citations
#76

Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning

Joey Hong, Anca Dragan, Sergey Levine

ICLR 2025arXiv:2411.05193
8
citations
#77

REvolve: Reward Evolution with Large Language Models using Human Feedback

RISHI HAZRA, Alkis Sygkounas, Andreas Persson et al.

ICLR 2025arXiv:2406.01309
reward function designreinforcement learninglarge language modelshuman feedback integration+3
8
citations
#78

Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining

Jie Cheng, Ruixi Qiao, ma yingwei et al.

ICLR 2025arXiv:2410.00564
offline reinforcement learningmodel-based rlworld modeljoint optimization+4
7
citations
#79

Enhancing Rating-Based Reinforcement Learning to Effectively Leverage Feedback from Large Vision-Language Models

Minh-Tung Luu, Younghwan Lee, Donghoon Lee et al.

ICML 2025arXiv:2506.12822
7
citations
#80

Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration

Max Wilcoxson, Qiyang Li, Kevin Frans et al.

ICML 2025arXiv:2410.18076
7
citations
#81

Identifying Policy Gradient Subspaces

Jan Schneider, Pierre Schumacher, Simon Guist et al.

ICLR 2024arXiv:2401.06604
7
citations
#82

Causally Aligned Curriculum Learning

Mingxuan Li, Junzhe Zhang, Elias Bareinboim

ICLR 2024arXiv:2503.16799
7
citations
#83

DrS: Learning Reusable Dense Rewards for Multi-Stage Tasks

Tongzhou Mu, Minghua Liu, Hao Su

ICLR 2024arXiv:2404.16779
7
citations
#84

Beyond Verifiable Rewards: Scaling Reinforcement Learning in Language Models to Unverifiable Data

Yunhao Tang, Sid Wang, Lovish Madaan et al.

NeurIPS 2025
7
citations
#85

Distilling Reinforcement Learning Algorithms for In-Context Model-Based Planning

Jaehyeon Son, Soochan Lee, Gunhee Kim

ICLR 2025
7
citations
#86

Non-Adversarial Inverse Reinforcement Learning via Successor Feature Matching

Arnav Kumar Jain, Harley Wiltzer, Jesse Farebrother et al.

ICLR 2025arXiv:2411.07007
inverse reinforcement learningsuccessor feature matchingpolicy gradient descentstate-only imitation+4
6
citations
#87

Bigger, Regularized, Categorical: High-Capacity Value Functions are Efficient Multi-Task Learners

Michal Nauman, Marek Cygan, Carmelo Sferrazza et al.

NeurIPS 2025arXiv:2505.23150
6
citations
#88

Network Sparsity Unlocks the Scaling Potential of Deep Reinforcement Learning

Guozheng Ma, Lu Li, Zilin Wang et al.

ICML 2025arXiv:2506.17204
6
citations
#89

EARL-BO: Reinforcement Learning for Multi-Step Lookahead, High-Dimensional Bayesian Optimization

Mujin Cheon, Jay Lee, Dong-Yeun Koh et al.

ICML 2025arXiv:2411.00171
6
citations
#90

Stable Hadamard Memory: Revitalizing Memory-Augmented Agents for Reinforcement Learning

Hung Le, Dung Nguyen, Kien Do et al.

ICLR 2025arXiv:2410.10132
memory-augmented agentspartially observable environmentsreinforcement learninghadamard product+4
6
citations
#91

Are Expressive Models Truly Necessary for Offline RL?

Guan Wang, Haoyi Niu, Jianxiong Li et al.

AAAI 2025arXiv:2412.11253
6
citations
#92

Inverse Reinforcement Learning by Estimating Expertise of Demonstrators

Mark Beliaev, Ramtin Pedarsani

AAAI 2025arXiv:2402.01886
6
citations
#93

CL-LoRA: Continual Low-Rank Adaptation for Rehearsal-Free Class-Incremental Learning

Jiangpeng He, Zhihao Duan, Fengqing Zhu

CVPR 2025arXiv:2505.24816
class-incremental learningparameter-efficient fine-tuninglow-rank adaptationdual-adapter architecture+4
6
citations
#94

ReNeg: Learning Negative Embedding with Reward Guidance

Xiaomin Li, yixuan liu, Takashi Isobe et al.

CVPR 2025arXiv:2412.19637
5
citations
#95

Embedding Safety into RL: A New Take on Trust Region Methods

Nikola Milosevic, Johannes Müller, Nico Scherf

ICML 2025arXiv:2411.02957
5
citations
#96

REValueD: Regularised Ensemble Value-Decomposition for Factorisable Markov Decision Processes

David Ireland, Giovanni Montana

ICLR 2024arXiv:2401.08850
5
citations
#97

Hybrid Latent Reasoning via Reinforcement Learning

Zhenrui Yue, Bowen Jin, Huimin Zeng et al.

NeurIPS 2025arXiv:2505.18454
5
citations
#98

SeRA: Self-Reviewing and Alignment of LLMs using Implicit Reward Margins

Jongwoo Ko, Saket Dingliwal, Bhavana Ganesh et al.

ICLR 2025
direct alignment algorithmspreference optimizationimplicit reward modelingoff-policy alignment+3
5
citations
#99

Reinforcement learning with combinatorial actions for coupled restless bandits

Lily Xu, Bryan Wilder, Elias Khalil et al.

ICLR 2025arXiv:2503.01919
reinforcement learningcombinatorial action spacesrestless banditsmixed-integer programming+3
5
citations
#100

Model-based Offline Reinforcement Learning with Lower Expectile Q-Learning

Kwanyoung Park, Youngwoon Lee

ICLR 2025arXiv:2407.00699
5
citations