🧬Reinforcement Learning

Deep Reinforcement Learning

Deep learning for RL

100 papers3,814 total citations

Compare with other topics

Feb '24 — Jan '261257 papers

Top Conferences

ICLR: 43 NeurIPS: 27 AAAI: 12 ECCV: 7 CVPR: 6 ICML: 5

Top Papers

#1

Understanding the Effects of RLHF on LLM Generalisation and Diversity

Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis et al.

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li et al.

NeurIPS 2025arXiv:2503.21776

rule-based reinforcement learningmultimodal large language modelsvideo reasoningtemporal modeling+3

236

citations

#3

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang et al.

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He et al.

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

Juan Rocamonde, Victoriano Montesinos, Elvis Nava et al.

TTRL: Test-Time Reinforcement Learning

Yuxin Zuo, Kaiyan Zhang, Li Sheng et al.

NeurIPS 2025arXiv:2504.16084

test-time reinforcement learningreward estimationlarge language modelsreasoning tasks+4

122

citations

#7

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

Zehan Qi, Xiao Liu, Iat Long Iong et al.

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Mingjie Liu, Shizhe Diao, Ximing Lu et al.

The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

Xinyu Zhu, Mengzhou Xia, Zhepei Wei et al.

NeurIPS 2025arXiv:2506.01347

reinforcement learningmathematical reasoninglanguage modelspolicy gradients+4

74

citations

#10

OGBench: Benchmarking Offline Goal-Conditioned RL

Seohong Park, Kevin Frans, Benjamin Eysenbach et al.

ICLR 2025arXiv:2410.20092

offline reinforcement learninggoal-conditioned rlbenchmark evaluationoffline gcrl algorithms+3

74

citations

#11

General-Reasoner: Advancing LLM Reasoning Across All Domains

Xueguang Ma, Qian Liu, Dongfu Jiang et al.

Learning to Act without Actions

Dominik Schmidt, Minqi Jiang

Large-scale Reinforcement Learning for Diffusion Models

Yinan Zhang, Eric Tzeng, Yilun Du et al.

METRA: Scalable Unsupervised RL with Metric-Aware Abstraction

Seohong Park, Oleh Rybkin, Sergey Levine

LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

Marwa Abdulhai, Isadora White, Charlie Snell et al.

Perception-R1: Pioneering Perception Policy with Reinforcement Learning

En Yu, Kangheng Lin, Liang Zhao et al.

NeurIPS 2025arXiv:2504.07954

58

citations

#17

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Mingyang Chen, Linzhuang Sun, Tianpeng Li et al.

RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts

Hjalmar Wijk, Tao Lin, Joel Becker et al.

Simplifying Deep Temporal Difference Learning

Matteo Gallici, Mattie Fellows, Benjamin Ellis et al.

CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control

Guy Tevet, Sigal Raab, Setareh Cohan et al.

VinePPO: Refining Credit Assignment in RL Training of LLMs

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance et al.

SDDGR: Stable Diffusion-based Deep Generative Replay for Class Incremental Object Detection

JUNSU KIM, Hoseong Cho, Jihyeon Kim et al.

Jumanji: a Diverse Suite of Scalable Reinforcement Learning Environments in JAX

Clément Bonnet, Daniel Luo, Donal Byrne et al.

Duolando: Follower GPT with Off-Policy Reinforcement Learning for Dance Accompaniment

Siyao Li, Tianpei Gu, Zhitao Yang et al.

TabM: Advancing tabular deep learning with parameter-efficient ensembling

Yury Gorishniy, Akim Kotelnikov, Artem Babenko

RRM: Robust Reward Model Training Mitigates Reward Hacking

Tianqi Liu, Wei Xiong, Jie Ren et al.

ICLR 2025arXiv:2409.13156

reward model trainingreward hacking mitigationcausal preference learningdata augmentation techniques+4

44

citations

#27

RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning

Hao Gao, Shaoyu Chen, Bo Jiang et al.

Think2Drive: Efficient Reinforcement Learning by Thinking with Latent World Model for Autonomous Driving (in CARLA-v2)

Qifeng Li, Xiaosong Jia, Shaobo Wang et al.

ECCV 2024

reinforcement learningautonomous drivingworld modellatent state space+4

43

citations

#29

Reasoning Gym: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

Zafir Stojanovski, Oliver Stanley, Joe Sharratt et al.

Dual RL: Unification and New Methods for Reinforcement and Imitation Learning

Harshit Sikchi, Qinqing Zheng, Amy Zhang et al.

Provable Offline Preference-Based Reinforcement Learning

Wenhao Zhan, Masatoshi Uehara, Nathan Kallus et al.

SafeDreamer: Safe Reinforcement Learning with World Models

Weidong Huang, Jiaming Ji, Chunhe Xia et al.

Random Feature Amplification: Feature Learning and Generalization in Neural Networks

Spencer Frei, Niladri Chatterji, Peter L. Bartlett

CPPO: Continual Learning for Reinforcement Learning with Human Feedback

Han Zhang, Yu Lei, Lin Gui et al.

Efficient Online Reinforcement Learning Fine-Tuning Need Not Retain Offline Data

Zhiyuan Zhou, Andy Peng, Qiyang Li et al.

BadRL: Sparse Targeted Backdoor Attack against Reinforcement Learning

Jing Cui, Yufei Han, Yuzhe Ma et al.

AAAI 2024arXiv:2312.12585

backdoor attacksreinforcement learning securitysparse poisoningtargeted state observations+3

26

citations

#37

Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

Chaofeng Chen, Annan Wang, Haoning Wu et al.

RLIF: Interactive Imitation Learning as Reinforcement Learning

Jianlan Luo, Perry Dong, Yuexiang Zhai et al.

Entity-Centric Reinforcement Learning for Object Manipulation from Pixels

Dan Haramati, Tal Daniel, Aviv Tamar

A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning

Yinmin Zhang, Jie Liu, Chuming Li et al.

AAAI 2024arXiv:2312.07685

offline reinforcement learningq-value estimationonline finetuningoffline-to-online rl+3

25

citations

#41

Grounded Reinforcement Learning for Visual Reasoning

Gabriel Sarch, Snigdha Saha, Naitik Khandelwal et al.

Revisiting Plasticity in Visual Reinforcement Learning: Data, Modules and Training Stages

Guozheng Ma, Lu Li, Sen Zhang et al.

ACT: Empowering Decision Transformer with Dynamic Programming via Advantage Conditioning

Chen-Xiao Gao, Chenyang Wu, Mingjun Cao et al.

AAAI 2024arXiv:2309.05915

decision transformeroffline policy optimizationadvantage conditioningdynamic programming+3

25

citations

#44

Efficient Online Reinforcement Learning for Diffusion Policy

Haitong Ma, Tianyi Chen, Kai Wang et al.

AlignSAM: Aligning Segment Anything Model to Open Context via Reinforcement Learning

Duojun Huang, Xinyu Xiong, Jie Ma et al.

DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products

Julien Siems, Timur Carstensen, Arber Zela et al.

Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning

Yun Qu, Yuhang Jiang, Boyuan Wang et al.

Pre-Training Goal-based Models for Sample-Efficient Reinforcement Learning

Haoqi Yuan, Zhancun Mu, Feiyang Xie et al.

Implicit bias of SGD in $L_2$-regularized linear DNNs: One-way jumps from high to low rank

Zihan Wang, Arthur Jacot

HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning

Fucai Ke, Zhixi Cai, Simindokht Jahangard et al.

Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

Jiangjie Chen, Qianyu He, Siyu Yuan et al.

Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving

Zhenghao Peng, Wenjie Luo, Yiren Lu et al.

From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks

Clementine Domine, Nicolas Anguita, Alexandra M Proca et al.

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich et al.

Domain Prompt Learning with Quaternion Networks

Qinglong Cao, Zhengqin Xu, Yuntian Chen et al.

Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning

Desai Xie, Jiahao Li, Hao Tan et al.

ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning

Hongyin Zhang, Zifeng Zhuang, Han Zhao et al.

Efficient Reinforcement Learning with Large Language Model Priors

Xue Yan, Yan Song, Xidong Feng et al.

DiffAIL: Diffusion Adversarial Imitation Learning

Bingzheng Wang, Guoqiang Wu, Teng Pang et al.

AAAI 2024arXiv:2312.06348

imitation learningadversarial imitation learningdiffusion modelsreward function learning+4

20

citations

#60

Domain Randomization via Entropy Maximization

Gabriele Tiboni, Pascal Klink, Jan Peters et al.

Exploring the Promise and Limits of Real-Time Recurrent Learning

Kazuki Irie, Anand Gopalakrishnan, Jürgen Schmidhuber

ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

Jiaru Zou, Ling Yang, Jingwen Gu et al.

A Rainbow in Deep Network Black Boxes

Florentin Guth, Brice Ménard, Gaspar Rochette et al.

SeRL: Self-play Reinforcement Learning for Large Language Models with Limited Data

Wenkai Fang, Shunyu Liu, Yang Zhou et al.

NeurIPS 2025arXiv:2505.20347

reinforcement learninglarge language modelsself-instruction generationself-rewarding mechanisms+4

19

citations

#65

Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards

Zijing Hu, Fengda Zhang, Long Chen et al.

SELF-EVOLVED REWARD LEARNING FOR LLMS

Chenghua Huang, Zhizhen Fan, Lu Wang et al.

MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization

Bhavya, Stelian Coros, Andreas Krause et al.

Cross-Embodiment Dexterous Grasping with Reinforcement Learning

Haoqi Yuan, Bohan Zhou, Yuhui Fu et al.

Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2)

Zhenjie Yang, Xiaosong Jia, Qifeng Li et al.

NeurIPS 2025arXiv:2505.16394

reinforcement learningautonomous drivingworld modelsmodel-based reinforcement learning+4

18

citations

#70

Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds

Hao Liang, Zhiquan Luo

Stitching Sub-trajectories with Conditional Diffusion Model for Goal-Conditioned Offline RL

Sungyoon Kim, Yunseon Choi, Daiki Matsunaga et al.

AAAI 2024arXiv:2402.07226

offline reinforcement learninggoal-conditioned rlconditional diffusion modelssub-trajectory stitching+4

17

citations

#72

Horizon Reduction Makes RL Scalable

Seohong Park, Kevin Frans, Deepinder Mann et al.

AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration

Andy Zhou, Kevin Wu, Francesco Pinto et al.

NeurIPS 2025arXiv:2503.15754

autonomous red teaminglarge language modelsmulti-agent architectureattack vector discovery+3

15

citations

#74

Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards

Xiaoyuan Liu, Tian Liang, Zhiwei He et al.

Reinforcement Learning Finetunes Small Subnetworks in Large Language Models

Sagnik Mukherjee, Lifan Yuan, Dilek Hakkani-Tur et al.

SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning

Xiao Liang, Zhong-Zhi Li, Yeyun Gong et al.

Reinforcement Learning Friendly Vision-Language Model for Minecraft

Haobin Jiang, Junpeng Yue, Hao Luo et al.

Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

Zhaolin Gao, Wenhao Zhan, Jonathan Chang et al.

Transformers Can Learn Temporal Difference Methods for In-Context Reinforcement Learning

Jiuqi Wang, Ethan Blaser, Hadi Daneshmand et al.

DRoC: Elevating Large Language Models for Complex Vehicle Routing via Decomposed Retrieval of Constraints

Xia Jiang, Yaoxin Wu, Chenhao Zhang et al.

SURE: SUrvey REcipes for building reliable and robust deep networks

Yuting Li, Yingyi Chen, Xuanlong Yu et al.

Deep Distributed Optimization for Large-Scale Quadratic Programming

Augustinos Saravanos, Hunter Kuperman, Alex Oshin et al.

Implicit Search via Discrete Diffusion: A Study on Chess

Jiacheng Ye, Zhenyu Wu, Jiahui Gao et al.

AdaWM: Adaptive World Model based Planning for Autonomous Driving

Hang Wang, Xin Ye, Feng Tao et al.

ICLR 2025arXiv:2501.13072

world model reinforcement learningautonomous driving planningdistribution shiftdynamics model mismatch+4

13

citations

#85

Rating-Based Reinforcement Learning

Devin White, Mingkang Wu, Ellen Novoseller et al.

AAAI 2024arXiv:2307.16348

reinforcement learninghuman ratingspreference-based learningrating prediction model+3

13

citations

#86

Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards

Charles Arnal, Gaëtan Narozniak, Vivien Cabannes et al.

ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments

Taewoong Kim, Cheolhong Min, Byeonghwi Kim et al.

Multi-Teacher Knowledge Distillation with Reinforcement Learning for Visual Recognition

Chuanguang Yang, XinQiang Yu, Han Yang et al.

Game-Theoretic Robust Reinforcement Learning Handles Temporally-Coupled Perturbations

Yongyuan Liang, Yanchao Sun, Ruijie Zheng et al.

Jasmine: Harnessing Diffusion Prior for Self-supervised Depth Estimation

Jiyuan Wang, Chunyu Lin, cheng guan et al.

R-EDL: Relaxing Nonessential Settings of Evidential Deep Learning

Mengyuan Chen, Junyu Gao, Changsheng Xu

Coreset Selection via Reducible Loss in Continual Learning

Ruilin Tong, Yuhang Liu, Javen Qinfeng Shi et al.

ICLR 2025

coreset selectioncontinual learningrehearsal memorybilevel optimization+3

12

citations

#93

ConcaveQ: Non-monotonic Value Function Factorization via Concave Representations in Deep Multi-Agent Reinforcement Learning

Huiqun Li, Hanhan Zhou, Yifei Zou et al.

AAAI 2024arXiv:2312.15555

value function factorizationmulti-agent reinforcement learningnon-monotonic mixing functionsconcave representations+3

12

citations

#94

Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration

Hao Zhong, Muzhi Zhu, Zongze Du et al.

RAT: Adversarial Attacks on Deep Reinforcement Agents for Targeted Behaviors

Fengshuo Bai, Runze Liu, Yali Du et al.

Pareto Deep Long-Tailed Recognition: A Conflict-Averse Solution

Zhipeng Zhou, Liu Liu, Peilin Zhao et al.

XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning

Alexander Nikulin, Ilya Zisman, Alexey Zemtsov et al.

Stabilizing Reinforcement Learning in Differentiable Multiphysics Simulation

Eliot Xing, Vernon Luk, Jean Oh

UNEX-RL: Reinforcing Long-Term Rewards in Multi-Stage Recommender Systems with UNidirectional EXecution

Gengrui Zhang, Xiaoshuang Chen, Yao WANG et al.

AAAI 2024arXiv:2401.06470

reinforcement learningmulti-stage recommender systemsmulti-agent reinforcement learninglong-term rewards+4

11

citations

#100

MetaRLEC: Meta-Reinforcement Learning for Discovery of Brain Effective Connectivity

Zuozhen Zhang, Junzhong Ji, Jinduo Liu

AAAI 2024

11

citations

Deep Reinforcement Learning

Top Conferences

Related Topics (Reinforcement Learning)

Top Papers

Understanding the Effects of RLHF on LLM Generalisation and Diversity

Video-R1: Reinforcing Video Reasoning in MLLMs

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

ToolRL: Reward is All Tool Learning Needs

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

TTRL: Test-Time Reinforcement Learning

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

OGBench: Benchmarking Offline Goal-Conditioned RL

General-Reasoner: Advancing LLM Reasoning Across All Domains

Learning to Act without Actions

Large-scale Reinforcement Learning for Diffusion Models

METRA: Scalable Unsupervised RL with Metric-Aware Abstraction

LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

Perception-R1: Pioneering Perception Policy with Reinforcement Learning

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts

Simplifying Deep Temporal Difference Learning

CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control

VinePPO: Refining Credit Assignment in RL Training of LLMs

SDDGR: Stable Diffusion-based Deep Generative Replay for Class Incremental Object Detection

Jumanji: a Diverse Suite of Scalable Reinforcement Learning Environments in JAX

Duolando: Follower GPT with Off-Policy Reinforcement Learning for Dance Accompaniment

TabM: Advancing tabular deep learning with parameter-efficient ensembling

RRM: Robust Reward Model Training Mitigates Reward Hacking

RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning

Think2Drive: Efficient Reinforcement Learning by Thinking with Latent World Model for Autonomous Driving (in CARLA-v2)

Reasoning Gym: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

Dual RL: Unification and New Methods for Reinforcement and Imitation Learning

Provable Offline Preference-Based Reinforcement Learning

SafeDreamer: Safe Reinforcement Learning with World Models

Random Feature Amplification: Feature Learning and Generalization in Neural Networks

CPPO: Continual Learning for Reinforcement Learning with Human Feedback

Efficient Online Reinforcement Learning Fine-Tuning Need Not Retain Offline Data

BadRL: Sparse Targeted Backdoor Attack against Reinforcement Learning

Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

RLIF: Interactive Imitation Learning as Reinforcement Learning

Entity-Centric Reinforcement Learning for Object Manipulation from Pixels

A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning

Grounded Reinforcement Learning for Visual Reasoning

Revisiting Plasticity in Visual Reinforcement Learning: Data, Modules and Training Stages

ACT: Empowering Decision Transformer with Dynamic Programming via Advantage Conditioning

Efficient Online Reinforcement Learning for Diffusion Policy

AlignSAM: Aligning Segment Anything Model to Open Context via Reinforcement Learning

DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products

Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning

Pre-Training Goal-based Models for Sample-Efficient Reinforcement Learning

Implicit bias of SGD in $L_2$-regularized linear DNNs: One-way jumps from high to low rank

HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning

Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving

From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Domain Prompt Learning with Quaternion Networks

Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning

ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning

Efficient Reinforcement Learning with Large Language Model Priors

DiffAIL: Diffusion Adversarial Imitation Learning

Domain Randomization via Entropy Maximization

Exploring the Promise and Limits of Real-Time Recurrent Learning

ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

A Rainbow in Deep Network Black Boxes

SeRL: Self-play Reinforcement Learning for Large Language Models with Limited Data

Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards

SELF-EVOLVED REWARD LEARNING FOR LLMS

MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization

Cross-Embodiment Dexterous Grasping with Reinforcement Learning

Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2)

Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds

Stitching Sub-trajectories with Conditional Diffusion Model for Goal-Conditioned Offline RL

Horizon Reduction Makes RL Scalable

AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration

Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards

Reinforcement Learning Finetunes Small Subnetworks in Large Language Models

SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning