NEURIPS 2025 "benchmark design" Papers
3 papers found
EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving
Shihan Dou, Ming Zhang, Chenhao Huang et al.
NEURIPS 2025posterarXiv:2506.02672
4
citations
The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements
Bingchen Zhao, Despoina Magka, Minqi Jiang et al.
NEURIPS 2025posterarXiv:2506.22419
2
citations
WorldModelBench: Judging Video Generation Models As World Models
Dacheng Li, Yunhao Fang, Yukang Chen et al.
NEURIPS 2025posterarXiv:2502.20694
33
citations