Multi-Turn Code Generation Through Single-Step Rewards

0citations

arXiv:2502.20380

citations

#766

in ICML 2025

of 3340 papers

Top Authors

Data Points

Top Authors

Arnav Kumar Jain Gonzalo Gonzalez-Pumariega Wayne Chen Alexander Rush Wenting Zhao Sanjiban Choudhury

Abstract

We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards.We propose a simple yet scalable approach, $\mu$CODE, that solves multi-turn code generation using only single-step rewards.Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn.$\mu$CODE iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code.Experimental evaluations show that our approach achieves significant improvements over state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of $\mu$CODE at utilizing the execution feedback.

Citation History

Jan 28, 2026