Poster "llm-as-a-judge" Papers

11 papers found

An Empirical Analysis of Uncertainty in Large Language Model Evaluations

Qiujie Xie, Qingqiu Li, Zhuohao Yu et al.

ICLR 2025posterarXiv:2502.10709
14
citations

Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge

Aparna Elangovan, Lei Xu, Jongwoo Ko et al.

ICLR 2025posterarXiv:2410.03775
22
citations

Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations

Peng Lai, Jianjie Zheng, Sijie Cheng et al.

NEURIPS 2025posterarXiv:2508.03550
2
citations

Bridging Human and LLM Judgments: Understanding and Narrowing the Gap

Felipe Maia Polo, Xinhe Wang, Mikhail Yurochkin et al.

NEURIPS 2025posterarXiv:2508.12792

Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

Jiayi Ye, Yanbo Wang, Yue Huang et al.

ICLR 2025posterarXiv:2410.02736
207
citations

MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models

Zimeng Huang, Jinxin Ke, Xiaoxuan Fan et al.

NEURIPS 2025posterarXiv:2510.26937

ReMA: Learning to Meta-Think for LLMs with Multi-agent Reinforcement Learning

Ziyu Wan, Yunxiang Li, Xiaoyu Wen et al.

NEURIPS 2025posterarXiv:2503.09501
36
citations

RevisEval: Improving LLM-as-a-Judge via Response-Adapted References

Qiyuan Zhang, Yufei Wang, Tiezheng YU et al.

ICLR 2025posterarXiv:2410.05193
16
citations

Semantic-KG: Using Knowledge Graphs to Construct Benchmarks for Measuring Semantic Similarity

Qiyao Wei, Edward R Morrell, Lea Goetz et al.

NEURIPS 2025posterarXiv:2511.19925

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal

Tinghao Xie, Xiangyu Qi, Yi Zeng et al.

ICLR 2025posterarXiv:2406.14598
141
citations

Varying Shades of Wrong: Aligning LLMs with Wrong Answers Only

Jihan Yao, Wenxuan Ding, Shangbin Feng et al.

ICLR 2025posterarXiv:2410.11055
4
citations