NEURIPS 2025 "data contamination" Papers
5 papers found
An Evidence-Based Post-Hoc Adjustment Framework for Anomaly Detection Under Data Contamination
Sukanya Patra, Souhaib Ben Taieb
NEURIPS 2025spotlightarXiv:2510.21296
PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models
Shi Qiu, Shaoyang Guo, Zhuo-Yang Song et al.
NEURIPS 2025posterarXiv:2504.16074
26
citations
Position: Benchmarking is Broken - Don't Let AI be Its Own Judge
Zerui Cheng, Stella Wohnig, Ruchika Gupta et al.
NEURIPS 2025posterarXiv:2510.07575
1
citations
SWE-bench Goes Live!
Linghao Zhang, Shilin He, Chaoyun Zhang et al.
NEURIPS 2025posterarXiv:2505.23419
22
citations
ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning
Shulin Huang, Linyi Yang, Yan Song et al.
NEURIPS 2025posterarXiv:2502.16268
14
citations