"automatic evaluation" Papers
6 papers found
Conference
Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge
Aparna Elangovan, Lei Xu, Jongwoo Ko et al.
ICLR 2025posterarXiv:2410.03775
22
citations
Evaluating the Evaluator: Measuring LLMs’ Adherence to Task Evaluation Instructions
Bhuvanashree Murugadoss, Christian Poelitz, Ian Drosos et al.
AAAI 2025paperarXiv:2408.08781
37
citations
M-Prometheus: A Suite of Open Multilingual LLM Judges
José Pombal, Dongkeun Yoon, Patrick Fernandes et al.
COLM 2025paperarXiv:2504.04953
23
citations
Multimodal LLMs as Customized Reward Models for Text-to-Image Generation
Shijie Zhou, Ruiyi Zhang, Huaisheng Zhu et al.
ICCV 2025posterarXiv:2507.21391
7
citations
Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models
José Pombal, Nuno M Guerreiro, Ricardo Rei et al.
COLM 2025paperarXiv:2504.01001
8
citations
InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks
Xueyu Hu, Ziyu Zhao, Shuang Wei et al.
ICML 2024posterarXiv:2401.05507