Policy Optimization
Policy gradient and optimization methods
Top Papers
YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information
Chien-Yao Wang, I-Hau Yeh, Hong-Yuan Mark Liao
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
Jingyi Zhang, Jiaxing Huang, Huanjin Yao et al.
Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization
Weiran Yao, Shelby Heinecke, Juan Carlos Niebles et al.
Theoretical guarantees on the best-of-n alignment policy
Ahmad Beirami, Alekh Agarwal, Jonathan Berant et al.
GLOP: Learning Global Partition and Local Construction for Solving Large-Scale Routing Problems in Real-Time
Haoran Ye, Jiarui Wang, Helan Liang et al.
Offline Actor-Critic for Average Reward MDPs
William Powell, Jeongyeol Kwon, Qiaomin Xie et al.
Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning
Chongyu Fan, Jiancheng Liu, Licong Lin et al.
CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models
Zhihang Lin, Mingbao Lin, Yuan Xie et al.
Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization
Audrey Huang, Wenhao Zhan, Tengyang Xie et al.
Gradient Reweighting: Towards Imbalanced Class-Incremental Learning
Jiangpeng He
Preference Optimization for Reasoning with Pseudo Feedback
Fangkai Jiao, Geyang Guo, Xingxing Zhang et al.
Gradient Alignment in Physics-informed Neural Networks: A Second-Order Optimization Perspective
Sifan Wang, Ananyae bhartari, Bowen Li et al.
CPPO: Continual Learning for Reinforcement Learning with Human Feedback
Han Zhang, Yu Lei, Lin Gui et al.
Methods for Convex $(L_0,L_1)$-Smooth Optimization: Clipping, Acceleration, and Adaptivity
Eduard Gorbunov, Nazarii Tupitsa, Sayantan Choudhury et al.
Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization
Junkang Wu, Yuexiang Xie, Zhengyi Yang et al.
Self-Improvement for Neural Combinatorial Optimization: Sample Without Replacement, but Improvement
Dominik Grimm, Jonathan Pirnay
AffordDP: Generalizable Diffusion Policy with Transferable Affordance
Shijie Wu, Yihang Zhu, Yunao Huang et al.
ACT: Empowering Decision Transformer with Dynamic Programming via Advantage Conditioning
Chen-Xiao Gao, Chenyang Wu, Mingjun Cao et al.
A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning
Yinmin Zhang, Jie Liu, Chuming Li et al.
Efficient Online Reinforcement Learning for Diffusion Policy
Haitong Ma, Tianyi Chen, Kai Wang et al.
SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters
Teng Xiao, Yige Yuan, Zhengyu Chen et al.
Self-Consistency Preference Optimization
Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang et al.
The AdEMAMix Optimizer: Better, Faster, Older
Matteo Pagliardini, Pierre Ablin, David Grangier
3D-Properties: Identifying Challenges in DPO and Charting a Path Forward
Yuzi Yan, Yibo Miao, Jialian Li et al.
ConFIG: Towards Conflict-free Training of Physics Informed Neural Networks
Qiang Liu, Mengyu Chu, Nils Thuerey
GOAL: A Generalist Combinatorial Optimization Agent Learner
Darko Drakulić, Sofia Michel, Jean-Marc Andreoli
Leaving the Nest: Going beyond Local Loss Functions for Predict-Then-Optimize
Sanket Shah, Bryan Wilder, Andrew Perrault et al.
Constrained Bayesian Optimization under Partial Observations: Balanced Improvements and Provable Convergence
Shengbo Wang, Ke Li
Adversarial Adaptive Sampling: Unify PINN and Optimal Transport for the Approximation of PDEs
Kejun Tang, Jiayu Zhai, Xiaoliang Wan et al.
B2Opt: Learning to Optimize Black-box Optimization with Little Budget
Xiaobin Li, Kai Wu, Xiaoyu Zhang et al.
Traffic Flow Optimisation for Lifelong Multi-Agent Path Finding
Zhe Chen, Daniel Harabor, Jiaoyang Li et al.
Understanding Optimization in Deep Learning with Central Flows
Jeremy Cohen, Alex Damian, Ameet Talwalkar et al.
Enhancing High-Resolution 3D Generation through Pixel-wise Gradient Clipping
Zijie Pan, Jiachen Lu, Xiatian Zhu et al.
No Preference Left Behind: Group Distributional Preference Optimization
Binwei Yao, Zefan Cai, Yun-Shiuan Chuang et al.
u-$\mu$P: The Unit-Scaled Maximal Update Parametrization
Charles Blake, Constantin Eichenberg, Josef Dean et al.
RainbowPO: A Unified Framework for Combining Improvements in Preference Optimization
Hanyang Zhao, Genta Winata, Anirban Das et al.
Fine-Tuning Discrete Diffusion Models with Policy Gradient Methods
Oussama Zekri, Nicolas Boulle
Aioli: A Unified Optimization Framework for Language Model Data Mixing
Mayee Chen, Michael Hu, Nicholas Lourie et al.
BEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute
Dujian Ding, Ankur Mallick, Shaokun Zhang et al.
Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models
Yiran Guo, Lijie Xu, Jie Liu et al.
Apollo-MILP: An Alternating Prediction-Correction Neural Solving Framework for Mixed-Integer Linear Programming
Haoyang Liu, Jie Wang, Zijie Geng et al.
ET-SEED: EFFICIENT TRAJECTORY-LEVEL SE(3) EQUIVARIANT DIFFUSION POLICY
Chenrui Tie, Yue Chen, Ruihai Wu et al.
One Forward is Enough for Neural Network Training via Likelihood Ratio Method
Jinyang Jiang, Zeliang Zhang, Chenliang Xu et al.
Multi-Objective Bayesian Optimization with Active Preference Learning
Ryota Ozaki, Kazuki Ishikawa, Youhei Kanzaki et al.
Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model
Wenhong Zhu, Zhiwei He, Xiaofeng Wang et al.
Improved Regret Bounds for Linear Adversarial MDPs via Linear Optimization
XiangCheng Zhang, Fang Kong, Baoxiang Wang et al.
Deep Distributed Optimization for Large-Scale Quadratic Programming
Augustinos Saravanos, Hunter Kuperman, Alex Oshin et al.
Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven Optimization
Yue Zhang, Liqiang Jing, Vibhav Gogate
Beyond Stationarity: Convergence Analysis of Stochastic Softmax Policy Gradient Methods
Sara Klein, Simon Weissmann, Leif Döring
Pareto Deep Long-Tailed Recognition: A Conflict-Averse Solution
Zhipeng Zhou, Liu Liu, Peilin Zhao et al.
In Search of Adam’s Secret Sauce
Antonio Orvieto, Robert Gower
Bilevel Optimization under Unbounded Smoothness: A New Algorithm and Convergence Analysis
Jie Hao, Xiaochuan Gong, Mingrui Liu
SDGMNet: Statistic-Based Dynamic Gradient Modulation for Local Descriptor Learning
Yuxin Deng, Jiayi Ma
Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and Performance of SGD for Fine-Tuning Language Models
Zeman Li, Xinwei Zhang, Peilin Zhong et al.
Learning to Pivot as a Smart Expert
Tianhao Liu, Shanwen Pu, Dongdong Ge et al.
DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models
Ziyi Wu, Anil Kag, Ivan Skorokhodov et al.
QLABGrad: A Hyperparameter-Free and Convergence-Guaranteed Scheme for Deep Learning
Fang-Xiang Wu, Minghan Fu
PatchDPO: Patch-level DPO for Finetuning-free Personalized Image Generation
Qihan Huang, Weilong Dai, Jinlong Liu et al.
Cumulative Regret Analysis of the Piyavskii–Shubert Algorithm and Its Variants for Global Optimization
Kaan Gokcesu, Hakan Gökcesu
Understanding and Improving Optimization in Predictive Coding Networks
Nicholas Alonso, Jeffrey Krichmar, Emre Neftci
Training-Free Guidance Beyond Differentiability: Scalable Path Steering with Tree Search in Diffusion and Flow Models
Yingqing Guo, Yukang Yang, Hui Yuan et al.
$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training
Jin Zhou, Kaiwen Wang, Jonathan Chang et al.
Improving Physics-Augmented Continuum Neural Radiance Field-Based Geometry-Agnostic System Identification with Lagrangian Particle Optimization
Takuhiro Kaneko
Pareto Front-Diverse Batch Multi-Objective Bayesian Optimization
Alaleh Ahmadianshalchi, Syrine Belakaria, Janardhan Rao Doppa
Radiology Report Generation via Multi-objective Preference Optimization
Ting Xiao, Lei Shi, Peng Liu et al.
Few for Many: Tchebycheff Set Scalarization for Many-Objective Optimization
Xi Lin, Yilu Liu, Xiaoyuan Zhang et al.
InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment
Yunhong Lu, Qichao Wang, Hengyuan Cao et al.
DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization
Wenze Chen, Shiyu Huang, Yuan Chiang et al.
On the Optimization and Generalization of Two-layer Transformers with Sign Gradient Descent
Bingrui Li, Wei Huang, Andi Han et al.
Improved Active Learning via Dependent Leverage Score Sampling
Atsushi Shimizu, Xiaoou Cheng, Christopher Musco et al.
Transition Path Sampling with Improved Off-Policy Training of Diffusion Path Samplers
Kiyoung Seong, Seonghyun Park, Seonghwan Kim et al.
Efficient Alternating Minimization with Applications to Weighted Low Rank Approximation
Zhao Song, Mingquan Ye, Junze Yin et al.
Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference
Qining Zhang, Lei Ying
Boost Your Human Image Generation Model via Direct Preference Optimization
Sanghyeon Na, Yonggyu Kim, Hyunjoon Lee
Offline-to-Online Hyperparameter Transfer for Stochastic Bandits
Dravyansh Sharma, Arun Suggala
Backdoor Adjustment via Group Adaptation for Debiased Coupon Recommendations
Junpeng Fang, Gongduo Zhang, Qing Cui et al.
Preventing Catastrophic Overfitting in Fast Adversarial Training: A Bi-level Optimization Perspective
Zhaoxin Wang, Handing Wang, Cong Tian et al.
Online Guidance Graph Optimization for Lifelong Multi-Agent Path Finding
Hongzhi Zang, Yulun Zhang, He Jiang et al.
Sharpness-Aware Minimization: General Analysis and Improved Rates
Dimitris Oikonomou, Nicolas Loizou
Fast and Robust: Task Sampling with Posterior and Diversity Synergies for Adaptive Decision-Makers in Randomized Environments
Yun Qu, Cheems Wang, Yixiu Mao et al.
Offline Data Enhanced On-Policy Policy Gradient with Provable Guarantees
Yifei Zhou, Ayush Sekhari, Yuda Song et al.
Scaling Off-Policy Reinforcement Learning with Batch and Weight Normalization
Daniel Palenicek, Florian Vogt, Joe Watson et al.
Direct Alignment with Heterogeneous Preferences
Ali Shirali, Arash Nasr-Esfahany, Abdullah Alomar et al.
Offline Multi-Agent Reinforcement Learning via In-Sample Sequential Policy Optimization
Zongkai Liu, Qian Lin, Chao Yu et al.
DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization
Gang Li, Ming Lin, Tomer Galanti et al.
Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift
Seongho Son, William Bankes, Sayak Ray Chowdhury et al.
Sample complexity of data-driven tuning of model hyperparameters in neural networks with structured parameter-dependent dual function
Maria-Florina Balcan, Anh Nguyen, Dravyansh Sharma
Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking
Paria Rashidinejad, Yuandong Tian
Colored Noise in PPO: Improved Exploration and Performance through Correlated Action Sampling
Jakob Hollenstein, Georg Martius, Justus Piater
Decision Tree Induction Through LLMs via Semantically-Aware Evolution
Tennison Liu, Nicolas Huynh, Mihaela van der Schaar
Regret Analysis of Repeated Delegated Choice
Suho Shin, Keivan Rezaei, Mohammad Hajiaghayi et al.
Identifying Policy Gradient Subspaces
Jan Schneider, Pierre Schumacher, Simon Guist et al.
Finite-Sample Analysis of Policy Evaluation for Robust Average Reward Reinforcement Learning
Yang Xu, Washim Mondal, Vaneet Aggarwal
Two-timescale Extragradient for Finding Local Minimax Points
Jiseok Chae, Kyuwon Kim, Donghwan Kim
Dialogue for Prompting: A Policy-Gradient-Based Discrete Prompt Generation for Few-Shot Learning
Chengzhengxu Li, Xiaoming Liu, Yichen Wang et al.
Forward KL Regularized Preference Optimization for Aligning Diffusion Policies
Zhao Shan, Chenyou Fan, Shuang Qiu et al.
Learning Cross-hand Policies of High-DOF Reaching and Grasping
Qijin She, Shishun Zhang, Yunfan Ye et al.
EVOS: Efficient Implicit Neural Training via EVOlutionary Selector
Weixiang Zhang, Shuzhao Xie, Chengwei Ren et al.
Efficient Distributed Optimization under Heavy-Tailed Noise
Su Hyeong Lee, Manzil Zaheer, Tian Li
Revisiting a Design Choice in Gradient Temporal Difference Learning
Xiaochi Qian, Shangtong Zhang