Benchmarking Agentic Workflow Generation

19citations

arXiv:2410.07869 Project

Citations

#367

in ICLR 2025

of 3827 papers

Authors

Data Points

Authors

Shuofei Qiao Runnan Fang Zhisong Qiu Xiaobin Wang Ningyu Zhang Yong Jiang Pengjun Xie Fei Huang Huajun Chen

Topics

workflow generation agentic planning graph workflow structures subgraph matching llm agent capabilities sequence planning graph planning reasoning tasks

Abstract

Large Language Models (LLMs), with their exceptional ability to handle a wide range of tasks, have driven significant advancements in tackling reasoning and planning tasks, wherein decomposing complex problems into executable workflows is a crucial step in this process. Existing workflow evaluation frameworks either focus solely on holistic performance or suffer from limitations such as restricted scenario coverage, simplistic workflow structures, and lax evaluation standards. To this end, we introduce WorfBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. Additionally, we present WorfEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms to accurately quantify the LLM agent's workflow generation capabilities. Through comprehensive evaluations across different types of LLMs, we discover distinct gaps between the sequence planning capabilities and graph planning capabilities of LLM agents, with even GPT-4 exhibiting a gap of around 15%. We also train two open-source models and evaluate their generalization abilities on held-out tasks. Furthermore, we observe that the generated workflows can enhance downstream tasks, enabling them to achieve superior performance with less time during inference. Code and dataset are available at https://github.com/zjunlp/WorfBench.

Citation History

Jan 26, 2026

Jan 27, 2026

Feb 2, 2026

19+19