Synthesizing Software Engineering Data in a Test-Driven Manner

0
citations
#766
in ICML 2025
of 3340 papers
9
Top Authors
1
Data Points

Abstract

We introduceSWE-Flow, a novel data synthesis framework grounded in Test-Driven Development (TDD).Unlike existing software engineering data that rely on human-submitted issues,SWE-Flowautomatically infers incremental development steps directly from unit tests, which inherently encapsulate high-level requirements.The core ofSWE-Flowis the construction of a Runtime Dependency Graph (RDG), which precisely captures function interactions, enabling the generation of a structured, step-by-stepdevelopment schedule.At each step,SWE-Flowproduces a partial codebase, the corresponding unit tests, and the necessary code modifications, resulting in fully verifiable TDD tasks.With this approach, we generated 16,061 training instances and 2,020 test instances from real-world GitHub projects, creating theSWE-Flow-Evalbenchmark.Our experiments show that fine-tuning open model on this dataset significantly improves performance in TDD-based coding.To facilitate further research, we release all code, datasets, models, and Docker images atGithub.

Citation History

Jan 28, 2026
0