by Michael Brenner Papers
3 papers found
CURIE: Evaluating LLMs on Multitask Scientific Long-Context Understanding and Reasoning
Hao Cui, Zahra Shamsi, Gowoon Cheon et al.
ICLR 2025posterarXiv:2503.13517
HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class
James Roggeveen, Erik Wang, David Ettel et al.
NEURIPS 2025posterarXiv:2505.11774
HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics
Fan, Sarah Martinson, Erik Wang et al.
ICLR 2025posterarXiv:2410.09988