Paper "evaluation" Papers

13 papers found

Breakpoint: Stress-testing systems-level reasoning in LLM agents

Kaivalya Hariharan, Uzay Girit, Zifan Wang et al.

COLM 2025paper

Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation

Julia Kreutzer, Eleftheria Briakou, Sweta Agrawal et al.

COLM 2025paperarXiv:2504.11829
6
citations

Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering

Patrick Fernandes, Sweta Agrawal, Emmanouil Zaranis et al.

COLM 2025paperarXiv:2504.07583
8
citations

EvalAgents: Discovering Implicit Evaluation Criteria from the Web

Manya Wadhwa, Zayne Rea Sprague, Chaitanya Malaviya et al.

COLM 2025paperarXiv:2504.15219
4
citations

Finding Flawed Fictions: Evaluating Complex Reasoning in Language Models via Plot Hole Detection

Kabir Ahuja, Melanie Sclar, Yulia Tsvetkov

COLM 2025paperarXiv:2504.11900
15
citations

Fluid Language Model Benchmarking

Valentin Hofmann, David Heineman, Ian Magnusson et al.

COLM 2025paperarXiv:2509.11106
10
citations

LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models

Minqian Liu, Zhiyang Xu, Xinyi Zhang et al.

COLM 2025paperarXiv:2504.10430
10
citations

NoveltyBench: Evaluating Language Models for Humanlike Diversity

Yiming Zhang, Harshita Diddee, Susan Holm et al.

COLM 2025paperarXiv:2504.05228
28
citations

Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation

Tuhina Tripathi, Manya Wadhwa, Greg Durrett et al.

COLM 2025paperarXiv:2504.14716
9
citations

Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation

Lajanugen Logeswaran, Jaekyeom Kim, Sungryull Sohn et al.

COLM 2025paper

Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning

Aleksander Ficek, Somshubra Majumdar, Vahid Noroozi et al.

COLM 2025paperarXiv:2502.13820
5
citations

The Negation Bias in Large Language Models: Investigating bias reflected in linguistic markers

Yishan Wang, Pia Sommerauer, Jelke Bloem

COLM 2025paper
1
citations

ThoughtTerminator: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models

Xiao Pu, Michael Saxon, Wenyue Hua et al.

COLM 2025paperarXiv:2504.13367
23
citations