NeurIPS 2025 "benchmark evaluation" Papers
12 papers found
AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench
Edan Toledo, Karen Hambardzumyan, Martin Josifoski et al.
NeurIPS 2025spotlightarXiv:2507.02554
15
citations
A Technical Report on “Erasing the Invisible”: The 2024 NeurIPS Competition on Stress Testing Image Watermarks
Mucong Ding, Bang An, Tahseen Rabbani et al.
NeurIPS 2025poster
C-SEO Bench: Does Conversational SEO Work?
Haritz Puerto, Martin Gubri, Tommaso Green et al.
NeurIPS 2025posterarXiv:2506.11097
2
citations
DGCBench: A Deep Graph Clustering Benchmark
Benyu Wu, Yue Liu, Qiaoyu Tan et al.
NeurIPS 2025poster
Is Artificial Intelligence Generated Image Detection a Solved Problem?
Ziqiang Li, Jiazhen Yan, Ziwen He et al.
NeurIPS 2025posterarXiv:2505.12335
15
citations
Massive Sound Embedding Benchmark (MSEB)
Georg Heigold, Ehsan Variani, Tom Bagby et al.
NeurIPS 2025poster
MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs
Tianhao Peng, Haochen Wang, Yuanxing Zhang et al.
NeurIPS 2025posterarXiv:2511.07250
2
citations
OverLayBench: A Benchmark for Layout-to-Image Generation with Dense Overlaps
Bingnan Li, Chen-Yu Wang, Haiyang Xu et al.
NeurIPS 2025posterarXiv:2509.19282
1
citations
TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine
Jiacheng Xie, Yang Yu, Ziyang Zhang et al.
NeurIPS 2025posterarXiv:2505.24063
2
citations
This Time is Different: An Observability Perspective on Time Series Foundation Models
Ben Cohen, Emaad Khwaja, Youssef Doubli et al.
NeurIPS 2025posterarXiv:2505.14766
11
citations
THUNDER: Tile-level Histopathology image UNDERstanding benchmark
Pierre Marza, Leo Fillioux, Sofiène Boutaj et al.
NeurIPS 2025spotlightarXiv:2507.07860
3
citations
WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios
Eun Chang, Zhuangqun Huang, Yiwei Liao et al.
NeurIPS 2025posterarXiv:2511.22154