"reasoning robustness" Papers
2 papers found
PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models
Shi Qiu, Shaoyang Guo, Zhuo-Yang Song et al.
NeurIPS 2025posterarXiv:2504.16074
26
citations
UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models
Xin Xu, Jiaxin ZHANG, Tianhao Chen et al.
ICLR 2025posterarXiv:2501.13766
13
citations