"kl-regularized rl" Papers
2 papers found
$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training
Jin Zhou, Kaiwen Wang, Jonathan Chang et al.
NeurIPS 2025posterarXiv:2502.20548
10
citations
Sharp Analysis for KL-Regularized Contextual Bandits and RLHF
Heyang Zhao, Chenlu Ye, Quanquan Gu et al.
NeurIPS 2025posterarXiv:2411.04625
14
citations