Poster "llm-as-a-judge" Papers
11 papers found
Conference
An Empirical Analysis of Uncertainty in Large Language Model Evaluations
Qiujie Xie, Qingqiu Li, Zhuohao Yu et al.
ICLR 2025posterarXiv:2502.10709
14
citations
Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge
Aparna Elangovan, Lei Xu, Jongwoo Ko et al.
ICLR 2025posterarXiv:2410.03775
22
citations
Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations
Peng Lai, Jianjie Zheng, Sijie Cheng et al.
NEURIPS 2025posterarXiv:2508.03550
2
citations
Bridging Human and LLM Judgments: Understanding and Narrowing the Gap
Felipe Maia Polo, Xinhe Wang, Mikhail Yurochkin et al.
NEURIPS 2025posterarXiv:2508.12792
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
Jiayi Ye, Yanbo Wang, Yue Huang et al.
ICLR 2025posterarXiv:2410.02736
207
citations
MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models
Zimeng Huang, Jinxin Ke, Xiaoxuan Fan et al.
NEURIPS 2025posterarXiv:2510.26937
ReMA: Learning to Meta-Think for LLMs with Multi-agent Reinforcement Learning
Ziyu Wan, Yunxiang Li, Xiaoyu Wen et al.
NEURIPS 2025posterarXiv:2503.09501
36
citations
RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
Qiyuan Zhang, Yufei Wang, Tiezheng YU et al.
ICLR 2025posterarXiv:2410.05193
16
citations
Semantic-KG: Using Knowledge Graphs to Construct Benchmarks for Measuring Semantic Similarity
Qiyao Wei, Edward R Morrell, Lea Goetz et al.
NEURIPS 2025posterarXiv:2511.19925
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
Tinghao Xie, Xiangyu Qi, Yi Zeng et al.
ICLR 2025posterarXiv:2406.14598
141
citations
Varying Shades of Wrong: Aligning LLMs with Wrong Answers Only
Jihan Yao, Wenxuan Ding, Shangbin Feng et al.
ICLR 2025posterarXiv:2410.11055
4
citations