Agent-as-a-Judge: Evaluate Agents with Agents

0citations
0
Citations
#766
in ICML 2025
of 3340 papers
13
Authors
1
Data Points

Abstract

Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes---ignoring the step-by-step nature of the thinking done by agentic systems---or require excessive manual labour. To address this, we introduce theAgent-as-a-Judgeframework, wherein agentic systems are used to evaluate agentic systems. This is a natural extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving processes for more precise evaluations. We apply the Agent-as-a-Judge framework to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we presentDevAI, a new benchmark of 55 realistic AI code generation tasks. DevAI includes rich manual annotations, like a total of 365 hierarchical solution requirements, which make it particularly suitable for an agentic evaluator. We benchmark three of the top code-generating agentic systems using Agent-as-a-Judge and find that our framework dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether, we believe that this work represents a concrete step towards enabling vastly more sophisticated agentic systems. To help that, our dataset and the full implementation of Agent-as-a-Judge will be publically available at https://github.com/metauto-ai/agent-as-a-judge

Citation History

Jan 28, 2026
0