"benchmark evaluation" Papers

32 篇论文

AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench

Edan Toledo, Karen Hambardzumyan, Martin Josifoski et al.

NeurIPS 2025spotlightarXiv:2507.02554
15
citations

AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models

Kim Sung-Bin, Oh Hyun-Bin, Lee Jung-Mok et al.

ICLR 2025posterarXiv:2410.18325
17
citations

Beyond Graphs: Can Large Language Models Comprehend Hypergraphs?

Yifan Feng, Chengwu Yang, Xingliang Hou et al.

ICLR 2025posterarXiv:2410.10083
10
citations

CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy

Zhibo Yang, Jun Tang, Zhaohai Li et al.

ICCV 2025posterarXiv:2412.02210
42
citations

C-SEO Bench: Does Conversational SEO Work?

Haritz Puerto, Martin Gubri, Tommaso Green et al.

NeurIPS 2025posterarXiv:2506.11097
2
citations

DiscoveryBench: Towards Data-Driven Discovery with Large Language Models

Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal et al.

ICLR 2025posterarXiv:2407.01725
36
citations

HELMET: How to Evaluate Long-context Models Effectively and Thoroughly

Howard Yen, Tianyu Gao, Minmin Hou et al.

ICLR 2025poster
23
citations

HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?

Yusen Zhang, Wenliang Zheng, Aashrith Madasu et al.

ICCV 2025posterarXiv:2504.18406

Is Artificial Intelligence Generated Image Detection a Solved Problem?

Ziqiang Li, Jiazhen Yan, Ziwen He et al.

NeurIPS 2025posterarXiv:2505.12335
15
citations

LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs

Yuhao Wu, Ming Shan Hee, Zhiqiang Hu et al.

ICLR 2025posterarXiv:2409.02076
34
citations

Massive Sound Embedding Benchmark (MSEB)

Georg Heigold, Ehsan Variani, Tom Bagby et al.

NeurIPS 2025poster

MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

Tianhao Peng, Haochen Wang, Yuanxing Zhang et al.

NeurIPS 2025posterarXiv:2511.07250
2
citations

OGBench: Benchmarking Offline Goal-Conditioned RL

Seohong Park, Kevin Frans, Benjamin Eysenbach et al.

ICLR 2025posterarXiv:2410.20092
74
citations

OpenAnimals: Revisiting Person Re-Identification for Animals Towards Better Generalization

Saihui Hou, Panjian Huang, Zengbin Wang et al.

ICCV 2025posterarXiv:2410.00204
2
citations

OverLayBench: A Benchmark for Layout-to-Image Generation with Dense Overlaps

Bingnan Li, Chen-Yu Wang, Haiyang Xu et al.

NeurIPS 2025posterarXiv:2509.19282
1
citations

TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine

Jiacheng Xie, Yang Yu, Ziyang Zhang et al.

NeurIPS 2025posterarXiv:2505.24063
2
citations

This Time is Different: An Observability Perspective on Time Series Foundation Models

Ben Cohen, Emaad Khwaja, Youssef Doubli et al.

NeurIPS 2025posterarXiv:2505.14766
11
citations

THUNDER: Tile-level Histopathology image UNDERstanding benchmark

Pierre Marza, Leo Fillioux, Sofiène Boutaj et al.

NeurIPS 2025spotlightarXiv:2507.07860
3
citations

WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

Eun Chang, Zhuangqun Huang, Yiwei Liao et al.

NeurIPS 2025posterarXiv:2511.22154

Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark

Fangjun Li, David C. Hogg, Anthony G. Cohn

AAAI 2024paperarXiv:2401.03991
51
citations

Benchmarking Large Language Models in Retrieval-Augmented Generation

Jiawei Chen, Hongyu Lin, Xianpei Han et al.

AAAI 2024paperarXiv:2309.01431
458
citations

Beyond ELBOs: A Large-Scale Evaluation of Variational Methods for Sampling

Denis Blessing, Xiaogang Jia, Johannes Esslinger et al.

ICML 2024poster

CurBench: Curriculum Learning Benchmark

Yuwei Zhou, Zirui Pan, Xin Wang et al.

ICML 2024poster

EfficientZero V2: Mastering Discrete and Continuous Control with Limited Data

Shengjie Wang, Shaohuai Liu, Weirui Ye et al.

ICML 2024spotlight

Evaluating and Analyzing Relationship Hallucinations in Large Vision-Language Models

Mingrui Wu, Jiayi Ji, Oucheng Huang et al.

ICML 2024poster

InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks

Xueyu Hu, Ziyu Zhao, Shuang Wei et al.

ICML 2024poster

MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation

Qian Huang, Jian Vora, Percy Liang et al.

ICML 2024poster

Position: Towards Implicit Prompt For Text-To-Image Models

Yue Yang, Yuqi Lin, Hong Liu et al.

ICML 2024poster

Premise Order Matters in Reasoning with Large Language Models

Xinyun Chen, Ryan Chi, Xuezhi Wang et al.

ICML 2024poster

RewriteLM: An Instruction-Tuned Large Language Model for Text Rewriting

Lei Shu, Liangchen Luo, Jayakumar Hoskere et al.

AAAI 2024paperarXiv:2305.15685
76
citations

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Xiaoxuan Wang, ziniu hu, Pan Lu et al.

ICML 2024poster

TravelPlanner: A Benchmark for Real-World Planning with Language Agents

Jian Xie, Kai Zhang, Jiangjie Chen et al.

ICML 2024spotlight