NeurIPS "benchmark evaluation" Papers

13 papers found

AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench

Edan Toledo, Karen Hambardzumyan, Martin Josifoski et al.

NeurIPS 2025spotlightarXiv:2507.02554
15
citations

A Technical Report on “Erasing the Invisible”: The 2024 NeurIPS Competition on Stress Testing Image Watermarks

Mucong Ding, Bang An, Tahseen Rabbani et al.

NeurIPS 2025poster

C-SEO Bench: Does Conversational SEO Work?

Haritz Puerto, Martin Gubri, Tommaso Green et al.

NeurIPS 2025posterarXiv:2506.11097
2
citations

DGCBench: A Deep Graph Clustering Benchmark

Benyu Wu, Yue Liu, Qiaoyu Tan et al.

NeurIPS 2025poster

Is Artificial Intelligence Generated Image Detection a Solved Problem?

Ziqiang Li, Jiazhen Yan, Ziwen He et al.

NeurIPS 2025posterarXiv:2505.12335
15
citations

LabUtopia: High-Fidelity Simulation and Hierarchical Benchmark for Scientific Embodied Agents

Rui Li, Zixuan Hu, Wenxi Qu et al.

NeurIPS 2025posterarXiv:2505.22634
2
citations

Massive Sound Embedding Benchmark (MSEB)

Georg Heigold, Ehsan Variani, Tom Bagby et al.

NeurIPS 2025poster

MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

Tianhao Peng, Haochen Wang, Yuanxing Zhang et al.

NeurIPS 2025posterarXiv:2511.07250
2
citations

OverLayBench: A Benchmark for Layout-to-Image Generation with Dense Overlaps

Bingnan Li, Chen-Yu Wang, Haiyang Xu et al.

NeurIPS 2025posterarXiv:2509.19282
1
citations

TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine

Jiacheng Xie, Yang Yu, Ziyang Zhang et al.

NeurIPS 2025posterarXiv:2505.24063
2
citations

This Time is Different: An Observability Perspective on Time Series Foundation Models

Ben Cohen, Emaad Khwaja, Youssef Doubli et al.

NeurIPS 2025posterarXiv:2505.14766
11
citations

THUNDER: Tile-level Histopathology image UNDERstanding benchmark

Pierre Marza, Leo Fillioux, Sofiène Boutaj et al.

NeurIPS 2025spotlightarXiv:2507.07860
3
citations

WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

Eun Chang, Zhuangqun Huang, Yiwei Liao et al.

NeurIPS 2025posterarXiv:2511.22154