NeurIPS 2025 "policy optimization" Papers

12 papers found

A Differential and Pointwise Control Approach to Reinforcement Learning

Minh Nguyen, Chandrajit Bajaj

NeurIPS 2025posterarXiv:2404.15617
1
citations

CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models

Zhihang Lin, Mingbao Lin, Yuan Xie et al.

NeurIPS 2025posterarXiv:2503.22342
47
citations

EconGym: A Scalable AI Testbed with Diverse Economic Tasks

Qirui Mi, Qipeng Yang, Zijun Fan et al.

NeurIPS 2025posterarXiv:2506.12110
4
citations

EvolvedGRPO: Unlocking Reasoning in LVLMs via Progressive Instruction Evolution

Zhebei Shen, Qifan Yu, Juncheng Li et al.

NeurIPS 2025poster

MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO

Yicheng Xiao, Lin Song, Yukang Chen et al.

NeurIPS 2025posterarXiv:2505.13031
18
citations

Non-convex entropic mean-field optimization via Best Response flow

Razvan-Andrei Lascu, Mateusz Majka

NeurIPS 2025posterarXiv:2505.22760
1
citations

On the Convergence of Projected Policy Gradient for Any Constant Step Sizes

Jiacai Liu, Wenye Li, Dachao Lin et al.

NeurIPS 2025posterarXiv:2311.01104
4
citations

Progress Reward Model for Reinforcement Learning via Large Language Models

Xiuhui Zhang, Ning Gao, Xingyu Jiang et al.

NeurIPS 2025poster

ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

Yiyang Zhou, Yangfan He, Yaofeng Su et al.

NeurIPS 2025posterarXiv:2506.01300
28
citations

Reinforcement Learning for Out-of-Distribution Reasoning in LLMs: An Empirical Study on Diagnosis-Related Group Coding

Hanyin Wang, Zhenbang Wu, Gururaj Kolar et al.

NeurIPS 2025spotlightarXiv:2505.21908
3
citations

Sharp Analysis for KL-Regularized Contextual Bandits and RLHF

Heyang Zhao, Chenlu Ye, Quanquan Gu et al.

NeurIPS 2025posterarXiv:2411.04625
14
citations

Unlocking Multimodal Mathematical Reasoning via Process Reward Model

Ruilin Luo, Zhuofan Zheng, Lei Wang et al.

NeurIPS 2025posterarXiv:2501.04686
29
citations