2025 Poster "benchmark evaluation" Papers

29 papers found

A Technical Report on “Erasing the Invisible”: The 2024 NeurIPS Competition on Stress Testing Image Watermarks

Mucong Ding, Bang An, Tahseen Rabbani et al.

NeurIPS 2025poster

AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models

Kim Sung-Bin, Oh Hyun-Bin, Lee Jung-Mok et al.

ICLR 2025posterarXiv:2410.18325
17
citations

Beyond Graphs: Can Large Language Models Comprehend Hypergraphs?

Yifan Feng, Chengwu Yang, Xingliang Hou et al.

ICLR 2025posterarXiv:2410.10083
10
citations

BOOM: Benchmarking Out-Of-distribution Molecular Property Predictions of Machine Learning Models

Evan Antoniuk, Shehtab Zaman, Tal Ben-Nun et al.

NeurIPS 2025posterarXiv:2505.01912
6
citations

CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy

Zhibo Yang, Jun Tang, Zhaohai Li et al.

ICCV 2025posterarXiv:2412.02210
42
citations

C-SEO Bench: Does Conversational SEO Work?

Haritz Puerto, Martin Gubri, Tommaso Green et al.

NeurIPS 2025posterarXiv:2506.11097
2
citations

DGCBench: A Deep Graph Clustering Benchmark

Benyu Wu, Yue Liu, Qiaoyu Tan et al.

NeurIPS 2025poster

DiscoveryBench: Towards Data-Driven Discovery with Large Language Models

Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal et al.

ICLR 2025posterarXiv:2407.01725
36
citations

HELMET: How to Evaluate Long-context Models Effectively and Thoroughly

Howard Yen, Tianyu Gao, Minmin Hou et al.

ICLR 2025poster
23
citations

HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?

Yusen Zhang, Wenliang Zheng, Aashrith Madasu et al.

ICCV 2025posterarXiv:2504.18406

IDEA-Bench: How Far are Generative Models from Professional Designing?

Chen Liang, Lianghua Huang, Jingwu Fang et al.

CVPR 2025posterarXiv:2412.11767
4
citations

Is Artificial Intelligence Generated Image Detection a Solved Problem?

Ziqiang Li, Jiazhen Yan, Ziwen He et al.

NeurIPS 2025posterarXiv:2505.12335
15
citations

LabUtopia: High-Fidelity Simulation and Hierarchical Benchmark for Scientific Embodied Agents

Rui Li, Zixuan Hu, Wenxi Qu et al.

NeurIPS 2025posterarXiv:2505.22634
2
citations

LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs

Yuhao Wu, Ming Shan Hee, Zhiqiang Hu et al.

ICLR 2025posterarXiv:2409.02076
34
citations

Massive Sound Embedding Benchmark (MSEB)

Georg Heigold, Ehsan Variani, Tom Bagby et al.

NeurIPS 2025poster

MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks

Yinghao Zhu, Ziyi He, Haoran Hu et al.

NeurIPS 2025posterarXiv:2505.12371
13
citations

MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

Tianhao Peng, Haochen Wang, Yuanxing Zhang et al.

NeurIPS 2025posterarXiv:2511.07250
2
citations

OGBench: Benchmarking Offline Goal-Conditioned RL

Seohong Park, Kevin Frans, Benjamin Eysenbach et al.

ICLR 2025posterarXiv:2410.20092
74
citations

OpenAnimals: Revisiting Person Re-Identification for Animals Towards Better Generalization

Saihui Hou, Panjian Huang, Zengbin Wang et al.

ICCV 2025posterarXiv:2410.00204
2
citations

OverLayBench: A Benchmark for Layout-to-Image Generation with Dense Overlaps

Bingnan Li, Chen-Yu Wang, Haiyang Xu et al.

NeurIPS 2025posterarXiv:2509.19282
1
citations

PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies?

Atharva Gundawar, Som Sagar, Ransalu Senanayake

NeurIPS 2025posterarXiv:2506.23725
3
citations

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

Shi Qiu, Shaoyang Guo, Zhuo-Yang Song et al.

NeurIPS 2025posterarXiv:2504.16074
26
citations

RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

Yantao Liu, Zijun Yao, Rui Min et al.

ICLR 2025posterarXiv:2410.16184
97
citations

Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to Advances

Shilin Lu, Zihan Zhou, Jiayou Lu et al.

ICLR 2025posterarXiv:2410.18775
82
citations

ScImage: How good are multimodal large language models at scientific text-to-image generation?

Leixin Zhang, Steffen Eger, Yinjie Cheng et al.

ICLR 2025posterarXiv:2412.02368
4
citations

TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine

Jiacheng Xie, Yang Yu, Ziyang Zhang et al.

NeurIPS 2025posterarXiv:2505.24063
2
citations

This Time is Different: An Observability Perspective on Time Series Foundation Models

Ben Cohen, Emaad Khwaja, Youssef Doubli et al.

NeurIPS 2025posterarXiv:2505.14766
11
citations

UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models

Xin Xu, Jiaxin ZHANG, Tianhao Chen et al.

ICLR 2025posterarXiv:2501.13766
13
citations

WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

Eun Chang, Zhuangqun Huang, Yiwei Liao et al.

NeurIPS 2025posterarXiv:2511.22154