CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance

1citations

arXiv:2507.10646 Project

Citations

#940

in NeurIPS 2025

of 5858 papers

Authors

Data Points

Authors

Myeongsoo Kim Shweta Garg Baishakhi Ray Varun Kumar Anoop Deoras

Topics

code assistance benchmarking multi-turn chat interactions project-grounded programming llm-driven dataset construction executable environment verification programming assistant evaluation github issue analysis

Abstract

Programming assistants powered by large language models have improved dramatically, yet existing benchmarks still evaluate them in narrow code-generation settings. Recent efforts such as InfiBench and StackEval rely on Stack Overflow questions and remain limited to single-turn interactions, manually curated data, and isolated snippets rather than full project environments. We introduce CodeAssistBench (CAB), the first benchmark for evaluating multi-turn, project-grounded programming assistance at scale. CAB automatically constructs datasets from GitHub issues tagged as questions, using an LLM-driven pipeline that filters noise, extracts runnable contexts, builds executable containers, and verifies environment correctness. This enables continuous, automated expansion across diverse repositories without manual intervention. Using CAB, we create a testbed of 3,286 real-world issues across 214 repositories, spanning seven languages. Evaluating state-of-the-art models reveals a substantial gap: while models achieve 70-83% accuracy on Stack Overflow-style questions, they solve only 7.22-16.49% of CAB issues from post-training-cutoff repositories. These results highlight a fundamental challenge: current LLMs struggle to provide assistance in realistic, project-specific contexts despite strong performance on traditional Q&A benchmarks. CAB provides a scalable, reproducible framework for advancing research in multi-turn, codebase-grounded programming agents. The benchmark and pipeline are fully automated and publicly available at https://github.com/amazon-science/CodeAssistBench/.

Citation History

Jan 26, 2026

Jan 27, 2026

Feb 2, 2026

1+1