Zhoufutu Wen

4

papers

175

total citations

papers (4)

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

NeurIPS 2025arXiv

KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks

KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

NeurIPS 2025arXiv

SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models