2025 "benchmark construction" Papers

10 papers found

ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation

Cheng Yang, Chufan Shi, Yaxin Liu et al.

ICLR 2025posterarXiv:2406.09961
65
citations

ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting

Chengyou Jia, Changliang Xia, Zhuohang Dang et al.

CVPR 2025posterarXiv:2411.17176
7
citations

Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations?

Yiwei Yang, Chung Peng Lee, Shangbin Feng et al.

NeurIPS 2025posterarXiv:2506.18322
3
citations

JudgeBench: A Benchmark for Evaluating LLM-Based Judges

Sijun Tan, Siyuan Zhuang, Kyle Montgomery et al.

ICLR 2025posterarXiv:2410.12784
150
citations

Linguini: A benchmark for language-agnostic linguistic reasoning

Eduardo Sánchez, Belen Alastruey, Christophe Ropers et al.

NeurIPS 2025posterarXiv:2409.12126
12
citations

MuSLR: Multimodal Symbolic Logical Reasoning

Jundong Xu, Hao Fei, Yuhui Zhang et al.

NeurIPS 2025posterarXiv:2509.25851

SECODEPLT: A Unified Benchmark for Evaluating the Security Risks and Capabilities of Code GenAI

Yuzhou Nie, Zhun Wang, Yu Yang et al.

NeurIPS 2025poster

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Xeron Du, Yifan Yao, Kaijing Ma et al.

NeurIPS 2025posterarXiv:2502.14739
118
citations

SysBench: Can LLMs Follow System Message?

Yanzhao Qin, Tao Zhang, Tao Zhang et al.

ICLR 2025poster
5
citations

WritingBench: A Comprehensive Benchmark for Generative Writing

Yuning Wu, Jiahao Mei, Ming Yan et al.

NeurIPS 2025posterarXiv:2503.05244
41
citations