by Cozmin Ududec Papers
2 papers found
Establishing Best Practices in Building Rigorous Agentic Benchmarks
Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun et al.
NeurIPS 2025posterarXiv:2507.02825
12
citations
Measuring what Matters: Construct Validity in Large Language Model Benchmarks
Andrew M. Bean, Ryan Othniel Kearns, Angelika Romanou et al.
NeurIPS 2025posterarXiv:2511.04703
8
citations