🧬Reinforcement Learning

Deep Reinforcement Learning

Deep learning for RL

100 papers4,624 total citations
Compare with other topics
Mar '24 Feb '261208 papers
Also includes: deep reinforcement learning, deep rl, drl, neural network rl

Top Papers

#1

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu et al.

NeurIPS 2025arXiv:2504.13837
reinforcement learning with verifiable rewardsreasoning capacitylarge language modelsmathematics tasks+4
483
citations
#2

Understanding the Effects of RLHF on LLM Generalisation and Diversity

Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis et al.

ICLR 2024
267
citations
#3

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li et al.

NeurIPS 2025arXiv:2503.21776
rule-based reinforcement learningmultimodal large language modelsvideo reasoningtemporal modeling+3
236
citations
#4

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang et al.

NeurIPS 2025
169
citations
#5

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He et al.

NeurIPS 2025
152
citations
#6

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

Huajian Xin, Z.Z. Ren, Junxiao Song et al.

ICLR 2025arXiv:2408.08152
theorem provingreinforcement learningmonte-carlo tree searchproof assistant feedback+4
134
citations
#7

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

Juan Rocamonde, Victoriano Montesinos, Elvis Nava et al.

ICLR 2024
133
citations
#8

TTRL: Test-Time Reinforcement Learning

Yuxin Zuo, Kaiyan Zhang, Li Sheng et al.

NeurIPS 2025arXiv:2504.16084
test-time reinforcement learningreward estimationlarge language modelsreasoning tasks+4
122
citations
#9

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

Zehan Qi, Xiao Liu, Iat Long Iong et al.

ICLR 2025arXiv:2411.02337
llm web agentsonline curriculum reinforcement learningself-evolving curriculumoutcome-supervised reward model+3
113
citations
#10

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Mingjie Liu, Shizhe Diao, Ximing Lu et al.

NeurIPS 2025arXiv:2505.24864
reinforcement learningreasoning capabilitieskl divergence controlreference policy resetting+4
99
citations
#11

d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning

Siyan Zhao, Devaansh Gupta, Qinqing Zheng et al.

NeurIPS 2025arXiv:2504.12216
diffusion large language modelsreinforcement learningnon-autoregressive generationreasoning capabilities+4
75
citations
#12

The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

Xinyu Zhu, Mengzhou Xia, Zhepei Wei et al.

NeurIPS 2025arXiv:2506.01347
reinforcement learningmathematical reasoninglanguage modelspolicy gradients+4
74
citations
#13

OGBench: Benchmarking Offline Goal-Conditioned RL

Seohong Park, Kevin Frans, Benjamin Eysenbach et al.

ICLR 2025arXiv:2410.20092
offline reinforcement learninggoal-conditioned rlbenchmark evaluationoffline gcrl algorithms+3
74
citations
#14

General-Reasoner: Advancing LLM Reasoning Across All Domains

Xueguang Ma, Qian Liu, Dongfu Jiang et al.

NeurIPS 2025
74
citations
#15

Large-scale Reinforcement Learning for Diffusion Models

Yinan Zhang, Eric Tzeng, Yilun Du et al.

ECCV 2024
69
citations
#16

Learning to Act without Actions

Dominik Schmidt, Minqi Jiang

ICLR 2024
69
citations
#17

METRA: Scalable Unsupervised RL with Metric-Aware Abstraction

Seohong Park, Oleh Rybkin, Sergey Levine

ICLR 2024
68
citations
#18

LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

Marwa Abdulhai, Isadora White, Charlie Snell et al.

ICML 2025
63
citations
#19

Perception-R1: Pioneering Perception Policy with Reinforcement Learning

En Yu, Kangheng Lin, Liang Zhao et al.

NeurIPS 2025arXiv:2504.07954
58
citations
#20

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Mingyang Chen, Linzhuang Sun, Tianpeng Li et al.

NeurIPS 2025arXiv:2503.19470
reasoning with searchreinforcement learningmulti-hop question answeringsearch-guided reasoning+3
56
citations
#21

RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts

Hjalmar Wijk, Tao Lin, Joel Becker et al.

ICML 2025
56
citations
#22

Simplifying Deep Temporal Difference Learning

Matteo Gallici, Mattie Fellows, Benjamin Ellis et al.

ICLR 2025
53
citations
#23

CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control

Guy Tevet, Sigal Raab, Setareh Cohan et al.

ICLR 2025
53
citations
#24

VinePPO: Refining Credit Assignment in RL Training of LLMs

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance et al.

ICML 2025
48
citations
#25

Jumanji: a Diverse Suite of Scalable Reinforcement Learning Environments in JAX

Clément Bonnet, Daniel Luo, Donal Byrne et al.

ICLR 2024
47
citations
#26

SDDGR: Stable Diffusion-based Deep Generative Replay for Class Incremental Object Detection

JUNSU KIM, Hoseong Cho, Jihyeon Kim et al.

CVPR 2024
47
citations
#27

Duolando: Follower GPT with Off-Policy Reinforcement Learning for Dance Accompaniment

Siyao Li, Tianpei Gu, Zhitao Yang et al.

ICLR 2024
45
citations
#28

TabM: Advancing tabular deep learning with parameter-efficient ensembling

Yury Gorishniy, Akim Kotelnikov, Artem Babenko

ICLR 2025
45
citations
#29

RRM: Robust Reward Model Training Mitigates Reward Hacking

Tianqi Liu, Wei Xiong, Jie Ren et al.

ICLR 2025arXiv:2409.13156
reward model trainingreward hacking mitigationcausal preference learningdata augmentation techniques+4
44
citations
#30

RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning

Hao Gao, Shaoyu Chen, Bo Jiang et al.

NeurIPS 2025
43
citations
#31

Think2Drive: Efficient Reinforcement Learning by Thinking with Latent World Model for Autonomous Driving (in CARLA-v2)

Qifeng Li, Xiaosong Jia, Shaobo Wang et al.

ECCV 2024
reinforcement learningautonomous drivingworld modellatent state space+4
43
citations
#32

Dual RL: Unification and New Methods for Reinforcement and Imitation Learning

Harshit Sikchi, Qinqing Zheng, Amy Zhang et al.

ICLR 2024
39
citations
#33

Provable Offline Preference-Based Reinforcement Learning

Wenhao Zhan, Masatoshi Uehara, Nathan Kallus et al.

ICLR 2024
39
citations
#34

Reasoning Gym: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

Zafir Stojanovski, Oliver Stanley, Joe Sharratt et al.

NeurIPS 2025
39
citations
#35

Scaling RL to Long Videos

Yukang Chen, Wei Huang, Baifeng Shi et al.

NeurIPS 2025arXiv:2507.07966
vision-language modelslong video reasoningreinforcement learningchain-of-thought fine-tuning+4
38
citations
#36

Agentic RL Scaling Law: Spontaneous Code Execution for Mathematical Problem Solving

Xinji Mai, Haotian Xu, Xing W et al.

NeurIPS 2025
reinforcement learning scalingtool-integrated reasoningspontaneous code executionmathematical problem solving+4
38
citations
#37

DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning

Ruowen Zhao, James Jun Liang Chen Ye, Zhengyi Wang et al.

ICCV 2025arXiv:2503.15265
triangle mesh generationauto-regressive modelingmesh tokenization algorithmreinforcement learning for 3d+4
35
citations
#38

Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective

Jorge (Zhoujun) Cheng, Shibo Hao, Tianyang Liu et al.

NeurIPS 2025arXiv:2506.14965
reinforcement learninglarge language model reasoningcross-domain trainingreward signal design+4
35
citations
#39

SafeDreamer: Safe Reinforcement Learning with World Models

Weidong Huang, Jiaming Ji, Chunhe Xia et al.

ICLR 2024
34
citations
#40

CPPO: Continual Learning for Reinforcement Learning with Human Feedback

Han Zhang, Yu Lei, Lin Gui et al.

ICLR 2024
32
citations
#41

Random Feature Amplification: Feature Learning and Generalization in Neural Networks

Spencer Frei, Niladri Chatterji, Peter L. Bartlett

ICLR 2024
32
citations
#42

Policy Decorator: Model-Agnostic Online Refinement for Large Policy Model

Xiu Yuan, Tongzhou Mu, Stone Tao et al.

ICLR 2025arXiv:2412.13630
imitation learningresidual policyonline refinementrobot learning+3
27
citations
#43

Efficient Online Reinforcement Learning Fine-Tuning Need Not Retain Offline Data

Zhiyuan Zhou, Andy Peng, Qiyang Li et al.

ICLR 2025arXiv:2412.07762
reinforcement learning fine-tuningoffline reinforcement learningonline reinforcement learningdistribution mismatch+4
27
citations
#44

BadRL: Sparse Targeted Backdoor Attack against Reinforcement Learning

Jing Cui, Yufei Han, Yuzhe Ma et al.

AAAI 2024arXiv:2312.12585
backdoor attacksreinforcement learning securitysparse poisoningtargeted state observations+3
26
citations
#45

Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

Chaofeng Chen, Annan Wang, Haoning Wu et al.

ECCV 2024
26
citations
#46

A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning

Yinmin Zhang, Jie Liu, Chuming Li et al.

AAAI 2024arXiv:2312.07685
offline reinforcement learningq-value estimationonline finetuningoffline-to-online rl+3
25
citations
#47

RLIF: Interactive Imitation Learning as Reinforcement Learning

Jianlan Luo, Perry Dong, Yuexiang Zhai et al.

ICLR 2024
25
citations
#48

Entity-Centric Reinforcement Learning for Object Manipulation from Pixels

Dan Haramati, Tal Daniel, Aviv Tamar

ICLR 2024
25
citations
#49

ACT: Empowering Decision Transformer with Dynamic Programming via Advantage Conditioning

Chen-Xiao Gao, Chenyang Wu, Mingjun Cao et al.

AAAI 2024arXiv:2309.05915
decision transformeroffline policy optimizationadvantage conditioningdynamic programming+3
25
citations
#50

Revisiting Plasticity in Visual Reinforcement Learning: Data, Modules and Training Stages

Guozheng Ma, Lu Li, Sen Zhang et al.

ICLR 2024
25
citations
#51

Grounded Reinforcement Learning for Visual Reasoning

Gabriel Sarch, Snigdha Saha, Naitik Khandelwal et al.

NeurIPS 2025
25
citations
#52

Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO

Chengzhuo Tong, Ziyu Guo, Renrui Zhang et al.

NeurIPS 2025arXiv:2505.17017
reinforcement learningautoregressive image generationchain-of-thought reasoningdirect preference optimization+4
25
citations
#53

Efficient Online Reinforcement Learning for Diffusion Policy

Haitong Ma, Tianyi Chen, Kai Wang et al.

ICML 2025
24
citations
#54

AlignSAM: Aligning Segment Anything Model to Open Context via Reinforcement Learning

Duojun Huang, Xinyu Xiong, Jie Ma et al.

CVPR 2024
24
citations
#55

Implicit bias of SGD in $L_2$-regularized linear DNNs: One-way jumps from high to low rank

Zihan Wang, Arthur Jacot

ICLR 2024
23
citations
#56

Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

Jiangjie Chen, Qianyu He, Siyu Yuan et al.

NeurIPS 2025
23
citations
#57

Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning

Yun Qu, Yuhang Jiang, Boyuan Wang et al.

AAAI 2025
23
citations
#58

DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products

Julien Siems, Timur Carstensen, Arber Zela et al.

NeurIPS 2025
23
citations
#59

Pre-Training Goal-based Models for Sample-Efficient Reinforcement Learning

Haoqi Yuan, Zhancun Mu, Feiyang Xie et al.

ICLR 2024
23
citations
#60

Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving

Zhenghao Peng, Wenjie Luo, Yiren Lu et al.

ECCV 2024arXiv:2409.18343
autonomous drivingagent behavior modelingreinforcement learning fine-tuningdistribution shift+4
23
citations
#61

HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning

Fucai Ke, Zhixi Cai, Simindokht Jahangard et al.

ECCV 2024
23
citations
#62

Towards General-Purpose Model-Free Reinforcement Learning

Scott Fujimoto, Pierluca D'Oro, Amy Zhang et al.

ICLR 2025arXiv:2501.16142
model-free reinforcement learningmodel-based representationsvalue function linearizationgeneral-purpose rl+2
22
citations
#63

From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks

Clementine Domine, Nicolas Anguita, Alexandra M Proca et al.

ICLR 2025
22
citations
#64

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich et al.

NeurIPS 2025
22
citations
#65

Domain Prompt Learning with Quaternion Networks

Qinglong Cao, Zhengqin Xu, Yuntian Chen et al.

CVPR 2024
22
citations
#66

Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning

Desai Xie, Jiahao Li, Hao Tan et al.

CVPR 2024
21
citations
#67

DiffAIL: Diffusion Adversarial Imitation Learning

Bingzheng Wang, Guoqiang Wu, Teng Pang et al.

AAAI 2024arXiv:2312.06348
imitation learningadversarial imitation learningdiffusion modelsreward function learning+4
20
citations
#68

Efficient Reinforcement Learning with Large Language Model Priors

Xue Yan, Yan Song, Xidong Feng et al.

ICLR 2025
20
citations
#69

ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

Jiaru Zou, Ling Yang, Jingwen Gu et al.

NeurIPS 2025
20
citations
#70

Exploring the Promise and Limits of Real-Time Recurrent Learning

Kazuki Irie, Anand Gopalakrishnan, Jürgen Schmidhuber

ICLR 2024
20
citations
#71

Domain Randomization via Entropy Maximization

Gabriele Tiboni, Pascal Klink, Jan Peters et al.

ICLR 2024
20
citations
#72

ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning

Hongyin Zhang, Zifeng Zhuang, Han Zhao et al.

ICML 2025
20
citations
#73

A Rainbow in Deep Network Black Boxes

Florentin Guth, Brice Ménard, Gaspar Rochette et al.

ICLR 2025
19
citations
#74

SeRL: Self-play Reinforcement Learning for Large Language Models with Limited Data

Wenkai Fang, Shunyu Liu, Yang Zhou et al.

NeurIPS 2025arXiv:2505.20347
reinforcement learninglarge language modelsself-instruction generationself-rewarding mechanisms+4
19
citations
#75

Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards

Zijing Hu, Fengda Zhang, Long Chen et al.

CVPR 2025
19
citations
#76

MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization

Bhavya, Stelian Coros, Andreas Krause et al.

ICLR 2025
18
citations
#77

Learning from Reward-Free Offline Data: A Case for Planning with Latent Dynamics Models

Uladzislau Sobal, Wancong Zhang, Kyunghyun Cho et al.

NeurIPS 2025arXiv:2502.14819
reward-free offline learninglatent dynamics modelsmodel-based planninggoal-conditioned rl+4
18
citations
#78

SELF-EVOLVED REWARD LEARNING FOR LLMS

Chenghua Huang, Zhizhen Fan, Lu Wang et al.

ICLR 2025arXiv:2411.00418
reinforcement learning from human feedbackreward model trainingself-evolved learninglanguage model alignment+3
18
citations
#79

Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2)

Zhenjie Yang, Xiaosong Jia, Qifeng Li et al.

NeurIPS 2025arXiv:2505.16394
reinforcement learningautonomous drivingworld modelsmodel-based reinforcement learning+4
18
citations
#80

Cross-Embodiment Dexterous Grasping with Reinforcement Learning

Haoqi Yuan, Bohan Zhou, Yuhui Fu et al.

ICLR 2025
18
citations
#81

Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds

Hao Liang, Zhiquan Luo

NeurIPS 2025
18
citations
#82

Stitching Sub-trajectories with Conditional Diffusion Model for Goal-Conditioned Offline RL

Sungyoon Kim, Yunseon Choi, Daiki Matsunaga et al.

AAAI 2024arXiv:2402.07226
offline reinforcement learninggoal-conditioned rlconditional diffusion modelssub-trajectory stitching+4
17
citations
#83

Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards

Xiaoyuan Liu, Tian Liang, Zhiwei He et al.

NeurIPS 2025
15
citations
#84

Horizon Reduction Makes RL Scalable

Seohong Park, Kevin Frans, Deepinder Mann et al.

NeurIPS 2025
15
citations
#85

AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration

Andy Zhou, Kevin Wu, Francesco Pinto et al.

NeurIPS 2025arXiv:2503.15754
autonomous red teaminglarge language modelsmulti-agent architectureattack vector discovery+3
15
citations
#86

Reinforcement Learning Finetunes Small Subnetworks in Large Language Models

Sagnik Mukherjee, Lifan Yuan, Dilek Hakkani-Tur et al.

NeurIPS 2025
15
citations
#87

Reinforcement Learning Friendly Vision-Language Model for Minecraft

Haobin Jiang, Junpeng Yue, Hao Luo et al.

ECCV 2024
14
citations
#88

DRoC: Elevating Large Language Models for Complex Vehicle Routing via Decomposed Retrieval of Constraints

Xia Jiang, Yaoxin Wu, Chenhao Zhang et al.

ICLR 2025
14
citations
#89

Transformers Can Learn Temporal Difference Methods for In-Context Reinforcement Learning

Jiuqi Wang, Ethan Blaser, Hadi Daneshmand et al.

ICLR 2025arXiv:2405.13861
in-context reinforcement learningtemporal difference learningpolicy evaluationtransformer architecture+3
14
citations
#90

Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

Zhaolin Gao, Wenhao Zhan, Jonathan Chang et al.

ICLR 2025
14
citations
#91

SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning

Xiao Liang, Zhong-Zhi Li, Yeyun Gong et al.

NeurIPS 2025
14
citations
#92

Sharp Analysis for KL-Regularized Contextual Bandits and RLHF

Heyang Zhao, Chenlu Ye, Quanquan Gu et al.

NeurIPS 2025arXiv:2411.04625
kl-regularized rlreinforcement learning from human feedbackcontextual banditspolicy optimization+4
14
citations
#93

SURE: SUrvey REcipes for building reliable and robust deep networks

Yuting Li, Yingyi Chen, Xuanlong Yu et al.

CVPR 2024
14
citations
#94

Deep Distributed Optimization for Large-Scale Quadratic Programming

Augustinos Saravanos, Hunter Kuperman, Alex Oshin et al.

ICLR 2025arXiv:2412.12156
quadratic programmingdistributed optimizationoperator splittingconsensus approach+4
14
citations
#95

Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL

Fangwei Zhong, Kui Wu, Hai Ci et al.

ECCV 2024arXiv:2404.09857
embodied visual trackingvisual foundation modelsoffline reinforcement learningsemantic segmentation masks+4
13
citations
#96

AdaWM: Adaptive World Model based Planning for Autonomous Driving

Hang Wang, Xin Ye, Feng Tao et al.

ICLR 2025arXiv:2501.13072
world model reinforcement learningautonomous driving planningdistribution shiftdynamics model mismatch+4
13
citations
#97

On a Connection Between Imitation Learning and RLHF

Teng Xiao, Yige Yuan, Mingxiao Li et al.

ICLR 2025arXiv:2503.05079
imitation learningreinforcement learning from human feedbackpreference data alignmentlarge language model alignment+3
13
citations
#98

ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning

Tonghe Zhang, Chao Yu, Sichang Su et al.

NeurIPS 2025arXiv:2505.22094
flow matchingreinforcement learning fine-tuningrobotic controlrectified flow+4
13
citations
#99

Rating-Based Reinforcement Learning

Devin White, Mingkang Wu, Ellen Novoseller et al.

AAAI 2024arXiv:2307.16348
reinforcement learninghuman ratingspreference-based learningrating prediction model+3
13
citations
#100

ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments

Taewoong Kim, Cheolhong Min, Byeonghwi Kim et al.

ECCV 2024
13
citations