2025 Poster "benchmark construction" Papers
15 papers found
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
Yuhui Zhang, Yuchang Su, Yiming Liu et al.
CVPR 2025posterarXiv:2501.03225
21
citations
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation
Cheng Yang, Chufan Shi, Yaxin Liu et al.
ICLR 2025posterarXiv:2406.09961
65
citations
ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting
Chengyou Jia, Changliang Xia, Zhuohang Dang et al.
CVPR 2025posterarXiv:2411.17176
7
citations
Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations?
Yiwei Yang, Chung Peng Lee, Shangbin Feng et al.
NeurIPS 2025posterarXiv:2506.18322
3
citations
JudgeBench: A Benchmark for Evaluating LLM-Based Judges
Sijun Tan, Siyuan Zhuang, Kyle Montgomery et al.
ICLR 2025posterarXiv:2410.12784
150
citations
Linguini: A benchmark for language-agnostic linguistic reasoning
Eduardo Sánchez, Belen Alastruey, Christophe Ropers et al.
NeurIPS 2025posterarXiv:2409.12126
12
citations
MuSLR: Multimodal Symbolic Logical Reasoning
Jundong Xu, Hao Fei, Yuhui Zhang et al.
NeurIPS 2025posterarXiv:2509.25851
NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens
Cunxiang Wang, Ruoxi Ning, Boqi Pan et al.
ICLR 2025posterarXiv:2403.12766
23
citations
SECODEPLT: A Unified Benchmark for Evaluating the Security Risks and Capabilities of Code GenAI
Yuzhou Nie, Zhun Wang, Yu Yang et al.
NeurIPS 2025poster
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
Xeron Du, Yifan Yao, Kaijing Ma et al.
NeurIPS 2025posterarXiv:2502.14739
118
citations
SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications
Jinyang Li, Xiaolong Li, Ge Qu et al.
NeurIPS 2025posterarXiv:2506.18951
8
citations
SysBench: Can LLMs Follow System Message?
Yanzhao Qin, Tao Zhang, Tao Zhang et al.
ICLR 2025poster
5
citations
The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs
HONG LI, Nanxi Li, Yuanjie Chen et al.
ICLR 2025posterarXiv:2410.01417
3
citations
UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?
Yuanxin Liu, Rui Zhu, Shuhuai Ren et al.
NeurIPS 2025posterarXiv:2503.09949
2
citations
WritingBench: A Comprehensive Benchmark for Generative Writing
Yuning Wu, Jiahao Mei, Ming Yan et al.
NeurIPS 2025posterarXiv:2503.05244
41
citations