Paper "benchmark" Papers
4 papers found
Conference
BEARCUBS: A benchmark for computer-using web agents
Yixiao Song, Katherine Thai, Chau Minh Pham et al.
COLM 2025paperarXiv:2503.07919
14
citations
Hidden in plain sight: VLMs overlook their visual representations
Stephanie Fu, tyler bonnen, Devin Guillory et al.
COLM 2025paperarXiv:2506.08008
PersonaEval: Are LLM Evaluators Human Enough to Judge Role-Play?
Lingfeng Zhou, Jialing Zhang, Jin Gao et al.
COLM 2025paperarXiv:2508.10014
5
citations
ThoughtTerminator: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models
Xiao Pu, Michael Saxon, Wenyue Hua et al.
COLM 2025paperarXiv:2504.13367
23
citations