Poster "benchmark evaluation" Papers
39 papers found
A Technical Report on “Erasing the Invisible”: The 2024 NeurIPS Competition on Stress Testing Image Watermarks
Mucong Ding, Bang An, Tahseen Rabbani et al.
AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
Kim Sung-Bin, Oh Hyun-Bin, Lee Jung-Mok et al.
Beyond Graphs: Can Large Language Models Comprehend Hypergraphs?
Yifan Feng, Chengwu Yang, Xingliang Hou et al.
BOOM: Benchmarking Out-Of-distribution Molecular Property Predictions of Machine Learning Models
Evan Antoniuk, Shehtab Zaman, Tal Ben-Nun et al.
CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy
Zhibo Yang, Jun Tang, Zhaohai Li et al.
C-SEO Bench: Does Conversational SEO Work?
Haritz Puerto, Martin Gubri, Tommaso Green et al.
DGCBench: A Deep Graph Clustering Benchmark
Benyu Wu, Yue Liu, Qiaoyu Tan et al.
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models
Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal et al.
HELMET: How to Evaluate Long-context Models Effectively and Thoroughly
Howard Yen, Tianyu Gao, Minmin Hou et al.
HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?
Yusen Zhang, Wenliang Zheng, Aashrith Madasu et al.
IDEA-Bench: How Far are Generative Models from Professional Designing?
Chen Liang, Lianghua Huang, Jingwu Fang et al.
Is Artificial Intelligence Generated Image Detection a Solved Problem?
Ziqiang Li, Jiazhen Yan, Ziwen He et al.
LabUtopia: High-Fidelity Simulation and Hierarchical Benchmark for Scientific Embodied Agents
Rui Li, Zixuan Hu, Wenxi Qu et al.
LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs
Yuhao Wu, Ming Shan Hee, Zhiqiang Hu et al.
Massive Sound Embedding Benchmark (MSEB)
Georg Heigold, Ehsan Variani, Tom Bagby et al.
MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks
Yinghao Zhu, Ziyi He, Haoran Hu et al.
MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs
Tianhao Peng, Haochen Wang, Yuanxing Zhang et al.
OGBench: Benchmarking Offline Goal-Conditioned RL
Seohong Park, Kevin Frans, Benjamin Eysenbach et al.
OpenAnimals: Revisiting Person Re-Identification for Animals Towards Better Generalization
Saihui Hou, Panjian Huang, Zengbin Wang et al.
OverLayBench: A Benchmark for Layout-to-Image Generation with Dense Overlaps
Bingnan Li, Chen-Yu Wang, Haiyang Xu et al.
PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies?
Atharva Gundawar, Som Sagar, Ransalu Senanayake
PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models
Shi Qiu, Shaoyang Guo, Zhuo-Yang Song et al.
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style
Yantao Liu, Zijun Yao, Rui Min et al.
Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to Advances
Shilin Lu, Zihan Zhou, Jiayou Lu et al.
ScImage: How good are multimodal large language models at scientific text-to-image generation?
Leixin Zhang, Steffen Eger, Yinjie Cheng et al.
TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine
Jiacheng Xie, Yang Yu, Ziyang Zhang et al.
This Time is Different: An Observability Perspective on Time Series Foundation Models
Ben Cohen, Emaad Khwaja, Youssef Doubli et al.
UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models
Xin Xu, Jiaxin ZHANG, Tianhao Chen et al.
WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios
Eun Chang, Zhuangqun Huang, Yiwei Liao et al.
A Comparative Study of Image Restoration Networks for General Backbone Network Design
Xiangyu Chen, Zheyuan Li, Yuandong Pu et al.
Beyond ELBOs: A Large-Scale Evaluation of Variational Methods for Sampling
Denis Blessing, Xiaogang Jia, Johannes Esslinger et al.
CurBench: Curriculum Learning Benchmark
Yuwei Zhou, Zirui Pan, Xin Wang et al.
Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions
Jin Gao, Lei Gan, Yuankai Li et al.
Evaluating and Analyzing Relationship Hallucinations in Large Vision-Language Models
Mingrui Wu, Jiayi Ji, Oucheng Huang et al.
InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks
Xueyu Hu, Ziyu Zhao, Shuang Wei et al.
MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation
Qian Huang, Jian Vora, Percy Liang et al.
Position: Towards Implicit Prompt For Text-To-Image Models
Yue Yang, Yuqi Lin, Hong Liu et al.
Premise Order Matters in Reasoning with Large Language Models
Xinyun Chen, Ryan Chi, Xuezhi Wang et al.
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models
Xiaoxuan Wang, ziniu hu, Pan Lu et al.