"llm-as-a-judge evaluation" Papers
3 papers found
AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy
Sebastian Joseph, Syed M. Husain, Stella Offner et al.
NeurIPS 2025posterarXiv:2505.20538
2
citations
MINERVA: Evaluating Complex Video Reasoning
Arsha Nagrani, Sachit Menon, Ahmet Iscen et al.
ICCV 2025posterarXiv:2505.00681
9
citations
To Code or Not To Code? Exploring Impact of Code in Pre-training
Viraat Aryabumi, Yixuan Su, Raymond Ma et al.
ICLR 2025posterarXiv:2408.10914
40
citations