"benchmark construction" Papers
9 papers found
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation
Cheng Yang, Chufan Shi, Yaxin Liu et al.
ICLR 2025posterarXiv:2406.09961
65
citations
JudgeBench: A Benchmark for Evaluating LLM-Based Judges
Sijun Tan, Siyuan Zhuang, Kyle Montgomery et al.
ICLR 2025posterarXiv:2410.12784
150
citations
Linguini: A benchmark for language-agnostic linguistic reasoning
Eduardo Sánchez, Belen Alastruey, Christophe Ropers et al.
NeurIPS 2025posterarXiv:2409.12126
12
citations
MuSLR: Multimodal Symbolic Logical Reasoning
Jundong Xu, Hao Fei, Yuhui Zhang et al.
NeurIPS 2025posterarXiv:2509.25851
SECODEPLT: A Unified Benchmark for Evaluating the Security Risks and Capabilities of Code GenAI
Yuzhou Nie, Zhun Wang, Yu Yang et al.
NeurIPS 2025poster
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
Xeron Du, Yifan Yao, Kaijing Ma et al.
NeurIPS 2025posterarXiv:2502.14739
118
citations
WritingBench: A Comprehensive Benchmark for Generative Writing
Yuning Wu, Jiahao Mei, Ming Yan et al.
NeurIPS 2025posterarXiv:2503.05244
41
citations
AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls
YU DU, Fangyun Wei, Hongyang Zhang
ICML 2024poster
Rethinking Generative Large Language Model Evaluation for Semantic Comprehension
Fangyun Wei, Xi Chen, Lin Luo
ICML 2024poster