"llm-as-a-judge" Papers
15 papers found
Conference
An Empirical Analysis of Uncertainty in Large Language Model Evaluations
Qiujie Xie, Qingqiu Li, Zhuohao Yu et al.
ICLR 2025posterarXiv:2502.10709
14
citations
Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge
Aparna Elangovan, Lei Xu, Jongwoo Ko et al.
ICLR 2025posterarXiv:2410.03775
22
citations
Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations
Peng Lai, Jianjie Zheng, Sijie Cheng et al.
NEURIPS 2025posterarXiv:2508.03550
2
citations
Bridging Human and LLM Judgments: Understanding and Narrowing the Gap
Felipe Maia Polo, Xinhe Wang, Mikhail Yurochkin et al.
NEURIPS 2025posterarXiv:2508.12792
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
Jiayi Ye, Yanbo Wang, Yue Huang et al.
ICLR 2025posterarXiv:2410.02736
207
citations
MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models
Zimeng Huang, Jinxin Ke, Xiaoxuan Fan et al.
NEURIPS 2025posterarXiv:2510.26937
M-Prometheus: A Suite of Open Multilingual LLM Judges
José Pombal, Dongkeun Yoon, Patrick Fernandes et al.
COLM 2025paperarXiv:2504.04953
23
citations
ReMA: Learning to Meta-Think for LLMs with Multi-agent Reinforcement Learning
Ziyu Wan, Yunxiang Li, Xiaoyu Wen et al.
NEURIPS 2025posterarXiv:2503.09501
36
citations
Reverse Engineering Human Preferences with Reinforcement Learning
Lisa Alazraki, Yi-Chern Tan, Jon Ander Campos et al.
NEURIPS 2025spotlightarXiv:2505.15795
RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
Qiyuan Zhang, Yufei Wang, Tiezheng YU et al.
ICLR 2025posterarXiv:2410.05193
16
citations
Semantic-KG: Using Knowledge Graphs to Construct Benchmarks for Measuring Semantic Similarity
Qiyao Wei, Edward R Morrell, Lea Goetz et al.
NEURIPS 2025posterarXiv:2511.19925
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
Tinghao Xie, Xiangyu Qi, Yi Zeng et al.
ICLR 2025posterarXiv:2406.14598
141
citations
Varying Shades of Wrong: Aligning LLMs with Wrong Answers Only
Jihan Yao, Wenxuan Ding, Shangbin Feng et al.
ICLR 2025posterarXiv:2410.11055
4
citations
Why Do Multi-Agent LLM Systems Fail?
Mert Cemri, Melissa Z Pan, Shuyi Yang et al.
NEURIPS 2025spotlightarXiv:2503.13657
188
citations
Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models
José Pombal, Nuno M Guerreiro, Ricardo Rei et al.
COLM 2025paperarXiv:2504.01001
8
citations