MALT: Improving Reasoning with Multi-Agent LLM Training

0citations

PDF Project

citations

#206

in COLM 2025

of 263 papers

Top Authors

Data Points

Top Authors

Sumeet Ramesh Motwani Chandler Smith Rocktim Jyoti Das Rafael Rafailov Philip Torr Ivan Laptev Fabio Pizzati Ronald Clark Christian Schroeder de Witt

Topics

reasoning multi-agent systems post-training reinforcement learning large language models

Abstract

Large Language Models (LLMs) often produce answers with a single chain-of-thought, which restricts their ability to explore reasoning paths or self-correct flawed outputs in complex tasks. In this paper, we introduce MALT (Multi-Agent LLM Training), a novel post-training strategy that divides the reasoning process into generation, verification, and refinement steps using a sequential pipeline of heterogeneous agents. During data generation, each agent is repeatedly sampled to form a multi-agent search tree, where final outputs are graded against ground-truth data. We then apply value iteration to propagate reward signals back to each role-conditioned model, automatically producing multi-agent post-training data without human or teacher-model supervision. Our off-policy approach allows each agent to specialize by learning from correct and incorrect trajectories, ultimately improving the end-to-end reasoning chain. On MATH, GSM8K, and CSQA, MALT surpasses the same baseline LLM with relative improvements of 15.66%, 7.42%, and 9.40%. It also generalizes to more challenging benchmarks, marking an early advance in multi-agent cooperative training.