"evaluation" Papers
13 papers found
Conference
Breakpoint: Stress-testing systems-level reasoning in LLM agents
Kaivalya Hariharan, Uzay Girit, Zifan Wang et al.
COLM 2025paper
Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation
Julia Kreutzer, Eleftheria Briakou, Sweta Agrawal et al.
COLM 2025paperarXiv:2504.11829
6
citations
Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering
Patrick Fernandes, Sweta Agrawal, Emmanouil Zaranis et al.
COLM 2025paperarXiv:2504.07583
8
citations
EvalAgents: Discovering Implicit Evaluation Criteria from the Web
Manya Wadhwa, Zayne Rea Sprague, Chaitanya Malaviya et al.
COLM 2025paperarXiv:2504.15219
4
citations
Finding Flawed Fictions: Evaluating Complex Reasoning in Language Models via Plot Hole Detection
Kabir Ahuja, Melanie Sclar, Yulia Tsvetkov
COLM 2025paperarXiv:2504.11900
15
citations
Fluid Language Model Benchmarking
Valentin Hofmann, David Heineman, Ian Magnusson et al.
COLM 2025paperarXiv:2509.11106
10
citations
LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models
Minqian Liu, Zhiyang Xu, Xinyi Zhang et al.
COLM 2025paperarXiv:2504.10430
10
citations
NoveltyBench: Evaluating Language Models for Humanlike Diversity
Yiming Zhang, Harshita Diddee, Susan Holm et al.
COLM 2025paperarXiv:2504.05228
28
citations
Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation
Tuhina Tripathi, Manya Wadhwa, Greg Durrett et al.
COLM 2025paperarXiv:2504.14716
9
citations
Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation
Lajanugen Logeswaran, Jaekyeom Kim, Sungryull Sohn et al.
COLM 2025paper
Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning
Aleksander Ficek, Somshubra Majumdar, Vahid Noroozi et al.
COLM 2025paperarXiv:2502.13820
5
citations
The Negation Bias in Large Language Models: Investigating bias reflected in linguistic markers
Yishan Wang, Pia Sommerauer, Jelke Bloem
COLM 2025paper
1
citations
ThoughtTerminator: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models
Xiao Pu, Michael Saxon, Wenyue Hua et al.
COLM 2025paperarXiv:2504.13367
23
citations