"policy optimization" Papers
35 papers found
CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models
Zhihang Lin, Mingbao Lin, Yuan Xie et al.
EconGym: A Scalable AI Testbed with Diverse Economic Tasks
Qirui Mi, Qipeng Yang, Zijun Fan et al.
Non-convex entropic mean-field optimization via Best Response flow
Razvan-Andrei Lascu, Mateusz Majka
On the Convergence of Projected Policy Gradient for Any Constant Step Sizes
Jiacai Liu, Wenye Li, Dachao Lin et al.
Reinforcement Learning for Out-of-Distribution Reasoning in LLMs: An Empirical Study on Diagnosis-Related Group Coding
Hanyin Wang, Zhenbang Wu, Gururaj Kolar et al.
RRM: Robust Reward Model Training Mitigates Reward Hacking
Tianqi Liu, Wei Xiong, Jie Ren et al.
Unlocking Multimodal Mathematical Reasoning via Process Reward Model
Ruilin Luo, Zhuofan Zheng, Lei Wang et al.
Accelerated Policy Gradient: On the Convergence Rates of the Nesterov Momentum for Reinforcement Learning
Yen-Ju Chen, Nai-Chieh Huang, Ching-pei Lee et al.
Adapting Static Fairness to Sequential Decision-Making: Bias Mitigation Strategies towards Equal Long-term Benefit Rate
Yuancheng Xu, Chenghao Deng, Yanchao Sun et al.
Adaptive-Gradient Policy Optimization: Enhancing Policy Learning in Non-Smooth Differentiable Simulations
Feng Gao, Liangzhi Shi, Shenao Zhang et al.
Bayesian Design Principles for Offline-to-Online Reinforcement Learning
Hao Hu, yiqin yang, Jianing Ye et al.
Constrained Reinforcement Learning Under Model Mismatch
Zhongchang Sun, Sihong He, Fei Miao et al.
Dealing With Unbounded Gradients in Stochastic Saddle-point Optimization
Gergely Neu, Nneka Okolo
Degeneration-free Policy Optimization: RL Fine-Tuning for Language Models without Degeneration
Youngsoo Jang, Geon-Hyeong Kim, Byoungjip Kim et al.
DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization
Wenze Chen, Shiyu Huang, Yuan Chiang et al.
EvoRainbow: Combining Improvements in Evolutionary Reinforcement Learning for Policy Search
Pengyi Li, Yan Zheng, Hongyao Tang et al.
Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization
Yihan Du, Anna Winnicki, Gal Dalal et al.
Improving Instruction Following in Language Models through Proxy-Based Uncertainty Estimation
JoonHo Lee, Jae Oh Woo, Juree Seok et al.
Information-Directed Pessimism for Offline Reinforcement Learning
Alec Koppel, Sujay Bhatt, Jiacheng Guo et al.
Iterative Regularized Policy Optimization with Imperfect Demonstrations
Xudong Gong, Feng Dawei, Kele Xu et al.
Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback
songyang gao, Qiming Ge, Wei Shen et al.
Model-based Reinforcement Learning for Confounded POMDPs
Mao Hong, Zhengling Qi, Yanxun Xu
Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback
Asaf Cassel, Haipeng Luo, Aviv Rosenberg et al.
Optimistic Model Rollouts for Pessimistic Offline Policy Optimization
Yuanzhao Zhai, Yiying Li, Zijian Gao et al.
Optimizing Local Satisfaction of Long-Run Average Objectives in Markov Decision Processes
David Klaska, Antonin Kucera, Vojtěch Kůr et al.
Position: Automatic Environment Shaping is the Next Frontier in RL
Younghyo Park, Gabriel Margolis, Pulkit Agrawal
Probabilistic Constrained Reinforcement Learning with Formal Interpretability
YANRAN WANG, QIUCHEN QIAN, David Boyle
Provably Efficient Long-Horizon Exploration in Monte Carlo Tree Search through State Occupancy Regularization
Liam Schramm, Abdeslam Boularias
Provably Robust DPO: Aligning Language Models with Noisy Feedback
Sayak Ray Chowdhury, Anush Kini, Nagarajan Natarajan
Rate-Optimal Policy Optimization for Linear Markov Decision Processes
Uri Sherman, Alon Cohen, Tomer Koren et al.
Reflective Policy Optimization
Yaozhong Gan, yan renye, zhe wu et al.
ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantages
Andrew Jesson, Christopher Lu, Gunshi Gupta et al.
Reward Model Learning vs. Direct Policy Optimization: A Comparative Analysis of Learning from Human Preferences
Andi Nika, Debmalya Mandal, Parameswaran Kamalaruban et al.
Risk-Sensitive Policy Optimization via Predictive CVaR Policy Gradient
Ju-Hyun Kim, Seungki Min
Safe Reinforcement Learning using Finite-Horizon Gradient-based Estimation
Juntao Dai, Yaodong Yang, Qian Zheng et al.