ICLR 2025 "benchmark evaluation" Papers
10 papers found
AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
Kim Sung-Bin, Oh Hyun-Bin, Lee Jung-Mok et al.
ICLR 2025posterarXiv:2410.18325
17
citations
Beyond Graphs: Can Large Language Models Comprehend Hypergraphs?
Yifan Feng, Chengwu Yang, Xingliang Hou et al.
ICLR 2025posterarXiv:2410.10083
10
citations
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models
Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal et al.
ICLR 2025posterarXiv:2407.01725
36
citations
HELMET: How to Evaluate Long-context Models Effectively and Thoroughly
Howard Yen, Tianyu Gao, Minmin Hou et al.
ICLR 2025poster
23
citations
LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs
Yuhao Wu, Ming Shan Hee, Zhiqiang Hu et al.
ICLR 2025posterarXiv:2409.02076
34
citations
OGBench: Benchmarking Offline Goal-Conditioned RL
Seohong Park, Kevin Frans, Benjamin Eysenbach et al.
ICLR 2025posterarXiv:2410.20092
74
citations
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style
Yantao Liu, Zijun Yao, Rui Min et al.
ICLR 2025posterarXiv:2410.16184
97
citations
Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to Advances
Shilin Lu, Zihan Zhou, Jiayou Lu et al.
ICLR 2025posterarXiv:2410.18775
82
citations
ScImage: How good are multimodal large language models at scientific text-to-image generation?
Leixin Zhang, Steffen Eger, Yinjie Cheng et al.
ICLR 2025posterarXiv:2412.02368
4
citations
UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models
Xin Xu, Jiaxin ZHANG, Tianhao Chen et al.
ICLR 2025posterarXiv:2501.13766
13
citations