🧬Reinforcement Learning

Policy Optimization

Policy gradient and optimization methods

100 papers4,929 total citations
Compare with other topics
Feb '24 Jan '261253 papers
Also includes: policy optimization, policy gradient, ppo, trpo, actor-critic

Top Papers

#1

YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information

Chien-Yao Wang, I-Hau Yeh, Hong-Yuan Mark Liao

ECCV 2024arXiv:2402.13616
programmable gradient informationinformation bottleneckreversible functionsgradient path planning+4
2,952
citations
#2

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Jingyi Zhang, Jiaxing Huang, Huanjin Yao et al.

ICCV 2025
206
citations
#3

Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization

Weiran Yao, Shelby Heinecke, Juan Carlos Niebles et al.

ICLR 2024
104
citations
#4

Theoretical guarantees on the best-of-n alignment policy

Ahmad Beirami, Alekh Agarwal, Jonathan Berant et al.

ICML 2025
89
citations
#5

GLOP: Learning Global Partition and Local Construction for Solving Large-Scale Routing Problems in Real-Time

Haoran Ye, Jiarui Wang, Helan Liang et al.

AAAI 2024arXiv:2312.08224
neural solversrouting problemstravelling salesman problemsautoregressive neural heuristics+4
76
citations
#6

Offline Actor-Critic for Average Reward MDPs

William Powell, Jeongyeol Kwon, Qiaomin Xie et al.

NeurIPS 2025
offline policy optimizationaverage-reward mdpspessimistic actor-criticlinear function approximation+3
73
citations
#7

Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning

Chongyu Fan, Jiancheng Liu, Licong Lin et al.

NeurIPS 2025
70
citations
#8

CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models

Zhihang Lin, Mingbao Lin, Yuan Xie et al.

NeurIPS 2025arXiv:2503.22342
policy optimizationtraining accelerationreasoning modelscompletion pruning+2
47
citations
#9

Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization

Audrey Huang, Wenhao Zhan, Tengyang Xie et al.

ICLR 2025
42
citations
#10

Gradient Reweighting: Towards Imbalanced Class-Incremental Learning

Jiangpeng He

CVPR 2024
39
citations
#11

Preference Optimization for Reasoning with Pseudo Feedback

Fangkai Jiao, Geyang Guo, Xingxing Zhang et al.

ICLR 2025
33
citations
#12

CPPO: Continual Learning for Reinforcement Learning with Human Feedback

Han Zhang, Yu Lei, Lin Gui et al.

ICLR 2024
32
citations
#13

Gradient Alignment in Physics-informed Neural Networks: A Second-Order Optimization Perspective

Sifan Wang, Ananyae bhartari, Bowen Li et al.

NeurIPS 2025
32
citations
#14

Methods for Convex $(L_0,L_1)$-Smooth Optimization: Clipping, Acceleration, and Adaptivity

Eduard Gorbunov, Nazarii Tupitsa, Sayantan Choudhury et al.

ICLR 2025
27
citations
#15

Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization

Junkang Wu, Yuexiang Xie, Zhengyi Yang et al.

ICLR 2025
27
citations
#16

Self-Improvement for Neural Combinatorial Optimization: Sample Without Replacement, but Improvement

Dominik Grimm, Jonathan Pirnay

ICLR 2025
26
citations
#17

ACT: Empowering Decision Transformer with Dynamic Programming via Advantage Conditioning

Chen-Xiao Gao, Chenyang Wu, Mingjun Cao et al.

AAAI 2024arXiv:2309.05915
decision transformeroffline policy optimizationadvantage conditioningdynamic programming+3
25
citations
#18

A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning

Yinmin Zhang, Jie Liu, Chuming Li et al.

AAAI 2024arXiv:2312.07685
offline reinforcement learningq-value estimationonline finetuningoffline-to-online rl+3
25
citations
#19

AffordDP: Generalizable Diffusion Policy with Transferable Affordance

Shijie Wu, Yihang Zhu, Yunao Huang et al.

CVPR 2025arXiv:2412.03142
diffusion policyrobotic manipulationaffordance learningcontact point estimation+4
25
citations
#20

Efficient Online Reinforcement Learning for Diffusion Policy

Haitong Ma, Tianyi Chen, Kai Wang et al.

ICML 2025
24
citations
#21

The AdEMAMix Optimizer: Better, Faster, Older

Matteo Pagliardini, Pierre Ablin, David Grangier

ICLR 2025
23
citations
#22

Self-Consistency Preference Optimization

Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang et al.

ICML 2025
23
citations
#23

SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters

Teng Xiao, Yige Yuan, Zhengyu Chen et al.

ICLR 2025arXiv:2502.00883
preference optimizationlanguage model alignmenthyperparameter-free traininginverse perplexity+2
23
citations
#24

3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

Yuzi Yan, Yibo Miao, Jialian Li et al.

ICLR 2025
22
citations
#25

ConFIG: Towards Conflict-free Training of Physics Informed Neural Networks

Qiang Liu, Mengyu Chu, Nils Thuerey

ICLR 2025
21
citations
#26

Leaving the Nest: Going beyond Local Loss Functions for Predict-Then-Optimize

Sanket Shah, Bryan Wilder, Andrew Perrault et al.

AAAI 2024arXiv:2305.16830
predict-then-optimizedecision-making under uncertaintytask-specific loss functionssample efficiency+1
20
citations
#27

GOAL: A Generalist Combinatorial Optimization Agent Learner

Darko Drakulić, Sofia Michel, Jean-Marc Andreoli

ICLR 2025
20
citations
#28

Adversarial Adaptive Sampling: Unify PINN and Optimal Transport for the Approximation of PDEs

Kejun Tang, Jiayu Zhai, Xiaoliang Wan et al.

ICLR 2024
19
citations
#29

Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets

Zhen Liu, Tim Xiao, Weiyang Liu et al.

ICLR 2025arXiv:2412.07775
diffusion model alignmentgenerative flow networksreward finetuningdiversity preservation+4
19
citations
#30

Constrained Bayesian Optimization under Partial Observations: Balanced Improvements and Provable Convergence

Shengbo Wang, Ke Li

AAAI 2024arXiv:2312.03212
bayesian optimizationpartial observabilityconstrained optimizationacquisition function design+3
19
citations
#31

Traffic Flow Optimisation for Lifelong Multi-Agent Path Finding

Zhe Chen, Daniel Harabor, Jiaoyang Li et al.

AAAI 2024arXiv:2308.11234
multi-agent path findingtraffic flow optimizationcollision-free path planningcongestion avoidance+4
18
citations
#32

B2Opt: Learning to Optimize Black-box Optimization with Little Budget

Xiaobin Li, Kai Wu, Xiaoyu Zhang et al.

AAAI 2025
18
citations
#33

Understanding Optimization in Deep Learning with Central Flows

Jeremy Cohen, Alex Damian, Ameet Talwalkar et al.

ICLR 2025
18
citations
#34

Enhancing High-Resolution 3D Generation through Pixel-wise Gradient Clipping

Zijie Pan, Jiachen Lu, Xiatian Zhu et al.

ICLR 2024
18
citations
#35

RainbowPO: A Unified Framework for Combining Improvements in Preference Optimization

Hanyang Zhao, Genta Winata, Anirban Das et al.

ICLR 2025
17
citations
#36

u-$\mu$P: The Unit-Scaled Maximal Update Parametrization

Charles Blake, Constantin Eichenberg, Josef Dean et al.

ICLR 2025
17
citations
#37

No Preference Left Behind: Group Distributional Preference Optimization

Binwei Yao, Zefan Cai, Yun-Shiuan Chuang et al.

ICLR 2025arXiv:2412.20299
preference alignmentgroup distributional preferencespluralistic alignmentbelief-conditioned preferences+3
17
citations
#38

Fine-Tuning Discrete Diffusion Models with Policy Gradient Methods

Oussama Zekri, Nicolas Boulle

NeurIPS 2025
17
citations
#39

Aioli: A Unified Optimization Framework for Language Model Data Mixing

Mayee Chen, Michael Hu, Nicholas Lourie et al.

ICLR 2025
16
citations
#40

BEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute

Dujian Ding, Ankur Mallick, Shaokun Zhang et al.

ICML 2025
16
citations
#41

ET-SEED: EFFICIENT TRAJECTORY-LEVEL SE(3) EQUIVARIANT DIFFUSION POLICY

Chenrui Tie, Yue Chen, Ruihai Wu et al.

ICLR 2025
15
citations
#42

Apollo-MILP: An Alternating Prediction-Correction Neural Solving Framework for Mixed-Integer Linear Programming

Haoyang Liu, Jie Wang, Zijie Geng et al.

ICLR 2025arXiv:2503.01129
mixed-integer linear programmingneural solving frameworktrust-region searchproblem reduction+4
15
citations
#43

Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models

Yiran Guo, Lijie Xu, Jie Liu et al.

NeurIPS 2025
15
citations
#44

Multi-Objective Bayesian Optimization with Active Preference Learning

Ryota Ozaki, Kazuki Ishikawa, Youhei Kanzaki et al.

AAAI 2024arXiv:2311.13460
bayesian optimizationmulti-objective optimizationpreference learningpareto front identification+4
14
citations
#45

Deep Distributed Optimization for Large-Scale Quadratic Programming

Augustinos Saravanos, Hunter Kuperman, Alex Oshin et al.

ICLR 2025
14
citations
#46

Improved Regret Bounds for Linear Adversarial MDPs via Linear Optimization

XiangCheng Zhang, Fang Kong, Baoxiang Wang et al.

ICLR 2025
14
citations
#47

Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model

Wenhong Zhu, Zhiwei He, Xiaofeng Wang et al.

ICLR 2025arXiv:2410.18640
preference optimizationweak-to-strong generalizationmodel alignmentlanguage model alignment+2
14
citations
#48

One Forward is Enough for Neural Network Training via Likelihood Ratio Method

Jinyang Jiang, Zeliang Zhang, Chenliang Xu et al.

ICLR 2024
14
citations
#49

Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven Optimization

Yue Zhang, Liqiang Jing, Vibhav Gogate

AAAI 2025
12
citations
#50

In Search of Adam’s Secret Sauce

Antonio Orvieto, Robert Gower

NeurIPS 2025
12
citations
#51

Beyond Stationarity: Convergence Analysis of Stochastic Softmax Policy Gradient Methods

Sara Klein, Simon Weissmann, Leif Döring

ICLR 2024
12
citations
#52

Pareto Deep Long-Tailed Recognition: A Conflict-Averse Solution

Zhipeng Zhou, Liu Liu, Peilin Zhao et al.

ICLR 2024
12
citations
#53

DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models

Ziyi Wu, Anil Kag, Ivan Skorokhodov et al.

NeurIPS 2025
11
citations
#54

Bilevel Optimization under Unbounded Smoothness: A New Algorithm and Convergence Analysis

Jie Hao, Xiaochuan Gong, Mingrui Liu

ICLR 2024
11
citations
#55

SDGMNet: Statistic-Based Dynamic Gradient Modulation for Local Descriptor Learning

Yuxin Deng, Jiayi Ma

AAAI 2024arXiv:2106.04434
local descriptor learninggradient modulationtriplet lossstatistical characteristics+3
11
citations
#56

Learning to Pivot as a Smart Expert

Tianhao Liu, Shanwen Pu, Dongdong Ge et al.

AAAI 2024arXiv:2308.08171
linear programmingsimplex methodpivot rulesinterior point methods+3
11
citations
#57

QLABGrad: A Hyperparameter-Free and Convergence-Guaranteed Scheme for Deep Learning

Fang-Xiang Wu, Minghan Fu

AAAI 2024arXiv:2302.00252
learning rate adaptationgradient descent optimizationhyperparameter-free trainingconvergence guarantee+3
11
citations
#58

Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and Performance of SGD for Fine-Tuning Language Models

Zeman Li, Xinwei Zhang, Peilin Zhong et al.

ICLR 2025
11
citations
#59

PatchDPO: Patch-level DPO for Finetuning-free Personalized Image Generation

Qihan Huang, Weilong Dai, Jinlong Liu et al.

CVPR 2025arXiv:2412.03177
personalized image generationdirect preference optimizationpatch-level optimizationfinetuning-free generation+3
10
citations
#60

Cumulative Regret Analysis of the Piyavskii–Shubert Algorithm and Its Variants for Global Optimization

Kaan Gokcesu, Hakan Gökcesu

AAAI 2024arXiv:2108.10859
global optimizationcumulative regret analysislipschitz continuous functionslipschitz smooth functions+4
10
citations
#61

Understanding and Improving Optimization in Predictive Coding Networks

Nicholas Alonso, Jeffrey Krichmar, Emre Neftci

AAAI 2024arXiv:2305.13562
predictive coding networksinference learning algorithmbiological plausibilityoptimization methods+3
10
citations
#62

$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training

Jin Zhou, Kaiwen Wang, Jonathan Chang et al.

NeurIPS 2025
10
citations
#63

Training-Free Guidance Beyond Differentiability: Scalable Path Steering with Tree Search in Diffusion and Flow Models

Yingqing Guo, Yukang Yang, Hui Yuan et al.

NeurIPS 2025
10
citations
#64

Radiology Report Generation via Multi-objective Preference Optimization

Ting Xiao, Lei Shi, Peng Liu et al.

AAAI 2025
9
citations
#65

Pareto Front-Diverse Batch Multi-Objective Bayesian Optimization

Alaleh Ahmadianshalchi, Syrine Belakaria, Janardhan Rao Doppa

AAAI 2024arXiv:2406.08799
multi-objective optimizationbayesian optimizationacquisition function selectionbatch selection+3
9
citations
#66

Few for Many: Tchebycheff Set Scalarization for Many-Objective Optimization

Xi Lin, Yilu Liu, Xiaoyuan Zhang et al.

ICLR 2025
9
citations
#67

InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment

Yunhong Lu, Qichao Wang, Hengyuan Cao et al.

CVPR 2025
9
citations
#68

Efficient Alternating Minimization with Applications to Weighted Low Rank Approximation

Zhao Song, Mingquan Ye, Junze Yin et al.

ICLR 2025arXiv:2306.04169
weighted low rank approximationalternating minimizationmatrix completionhadamard product+2
9
citations
#69

DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization

Wenze Chen, Shiyu Huang, Yuan Chiang et al.

AAAI 2024arXiv:2207.05631
reinforcement learningdiverse strategy discoverypolicy optimizationinformation-theoretic diversity+3
9
citations
#70

On the Optimization and Generalization of Two-layer Transformers with Sign Gradient Descent

Bingrui Li, Wei Huang, Andi Han et al.

ICLR 2025
9
citations
#71

Transition Path Sampling with Improved Off-Policy Training of Diffusion Path Samplers

Kiyoung Seong, Seonghyun Park, Seonghwan Kim et al.

ICLR 2025arXiv:2405.19961
transition path samplingdiffusion path samplerscollective variablesmolecular dynamics simulations+4
9
citations
#72

Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference

Qining Zhang, Lei Ying

ICLR 2025
9
citations
#73

Improved Active Learning via Dependent Leverage Score Sampling

Atsushi Shimizu, Xiaoou Cheng, Christopher Musco et al.

ICLR 2024
9
citations
#74

Improving Physics-Augmented Continuum Neural Radiance Field-Based Geometry-Agnostic System Identification with Lagrangian Particle Optimization

Takuhiro Kaneko

CVPR 2024
9
citations
#75

Boost Your Human Image Generation Model via Direct Preference Optimization

Sanghyeon Na, Yonggyu Kim, Hyunjoon Lee

CVPR 2025arXiv:2405.20216
human image generationdirect preference optimizationtext-to-image synthesispersonalized image generation+3
8
citations
#76

Offline-to-Online Hyperparameter Transfer for Stochastic Bandits

Dravyansh Sharma, Arun Suggala

AAAI 2025
8
citations
#77

Online Guidance Graph Optimization for Lifelong Multi-Agent Path Finding

Hongzhi Zang, Yulun Zhang, He Jiang et al.

AAAI 2025
8
citations
#78

Fast and Robust: Task Sampling with Posterior and Diversity Synergies for Adaptive Decision-Makers in Randomized Environments

Yun Qu, Cheems Wang, Yixiu Mao et al.

ICML 2025
8
citations
#79

Preventing Catastrophic Overfitting in Fast Adversarial Training: A Bi-level Optimization Perspective

Zhaoxin Wang, Handing Wang, Cong Tian et al.

ECCV 2024
8
citations
#80

Sharpness-Aware Minimization: General Analysis and Improved Rates

Dimitris Oikonomou, Nicolas Loizou

ICLR 2025
8
citations
#81

Colored Noise in PPO: Improved Exploration and Performance through Correlated Action Sampling

Jakob Hollenstein, Georg Martius, Justus Piater

AAAI 2024arXiv:2312.11091
proximal policy optimizationcolored noiseaction samplingexploration strategies+3
8
citations
#82

Backdoor Adjustment via Group Adaptation for Debiased Coupon Recommendations

Junpeng Fang, Gongduo Zhang, Qing Cui et al.

AAAI 2024
8
citations
#83

Offline Data Enhanced On-Policy Policy Gradient with Provable Guarantees

Yifei Zhou, Ayush Sekhari, Yuda Song et al.

ICLR 2024
8
citations
#84

Offline Multi-Agent Reinforcement Learning via In-Sample Sequential Policy Optimization

Zongkai Liu, Qian Lin, Chao Yu et al.

AAAI 2025
8
citations
#85

Scaling Off-Policy Reinforcement Learning with Batch and Weight Normalization

Daniel Palenicek, Florian Vogt, Joe Watson et al.

NeurIPS 2025
8
citations
#86

DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

Gang Li, Ming Lin, Tomer Galanti et al.

NeurIPS 2025
8
citations
#87

Direct Alignment with Heterogeneous Preferences

Ali Shirali, Arash Nasr-Esfahany, Abdullah Alomar et al.

NeurIPS 2025arXiv:2502.16320
human preference alignmentheterogeneous preferencesdirect alignment methodsreward function learning+4
8
citations
#88

Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift

Seongho Son, William Bankes, Sayak Ray Chowdhury et al.

ICML 2025
8
citations
#89

Sample complexity of data-driven tuning of model hyperparameters in neural networks with structured parameter-dependent dual function

Maria-Florina Balcan, Anh Nguyen, Dravyansh Sharma

NeurIPS 2025
8
citations
#90

Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking

Paria Rashidinejad, Yuandong Tian

ICLR 2025
8
citations
#91

Regret Analysis of Repeated Delegated Choice

Suho Shin, Keivan Rezaei, Mohammad Hajiaghayi et al.

AAAI 2024arXiv:2310.04884
repeated delegated choiceonline learning variantregret analysisstrategic agent behavior+4
7
citations
#92

Decision Tree Induction Through LLMs via Semantically-Aware Evolution

Tennison Liu, Nicolas Huynh, Mihaela van der Schaar

ICLR 2025
7
citations
#93

Identifying Policy Gradient Subspaces

Jan Schneider, Pierre Schumacher, Simon Guist et al.

ICLR 2024
7
citations
#94

Finite-Sample Analysis of Policy Evaluation for Robust Average Reward Reinforcement Learning

Yang Xu, Washim Mondal, Vaneet Aggarwal

NeurIPS 2025
7
citations
#95

Dialogue for Prompting: A Policy-Gradient-Based Discrete Prompt Generation for Few-Shot Learning

Chengzhengxu Li, Xiaoming Liu, Yichen Wang et al.

AAAI 2024arXiv:2308.07272
prompt-based learningfew-shot learningdiscrete prompt optimizationpolicy gradient methods+4
7
citations
#96

Forward KL Regularized Preference Optimization for Aligning Diffusion Policies

Zhao Shan, Chenyou Fan, Shuang Qiu et al.

AAAI 2025
7
citations
#97

Learning Cross-hand Policies of High-DOF Reaching and Grasping

Qijin She, Shishun Zhang, Yunfan Ye et al.

ECCV 2024arXiv:2404.09150
cross-hand policy transferdexterous gripper controlrobotic reaching and graspinggripper-agnostic policy+3
7
citations
#98

Two-timescale Extragradient for Finding Local Minimax Points

Jiseok Chae, Kyuwon Kim, Donghwan Kim

ICLR 2024
7
citations
#99

EVOS: Efficient Implicit Neural Training via EVOlutionary Selector

Weixiang Zhang, Shuzhao Xie, Chengwei Ren et al.

CVPR 2025
6
citations
#100

Learning a Neural Solver for Parametric PDEs to Enhance Physics-Informed Methods

Lise Le Boudec, Emmanuel de Bézenac, Louis Serrano et al.

ICLR 2025
6
citations