Zhoufutu Wen
4
papers
175
total citations
papers (4)
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
NeurIPS 2025arXiv
118
citations
KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks
ICLR 2025arXiv
53
citations
KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation
NeurIPS 2025arXiv
4
citations
SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models
ICCV 2025
0
citations