2025 "large language model evaluation" Papers
6 papers found
Auto-Vocabulary Semantic Segmentation
Osman Ülger, Maksymilian Kulicki, Yuki Asano et al.
ICCV 2025posterarXiv:2312.04539
4
citations
BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks
Anna Sokol, Elizabeth Daly, Michael Hind et al.
NeurIPS 2025posterarXiv:2410.12974
2
citations
BenTo: Benchmark Reduction with In-Context Transferability
Hongyu Zhao, Ming Li, Lichao Sun et al.
ICLR 2025poster
FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows"
Yifei Ming, Senthil Purushwalkam, Shrey Pandit et al.
ICLR 2025poster
45
citations
How Benchmark Prediction from Fewer Data Misses the Mark
Guanhua Zhang, Florian E. Dorner, Moritz Hardt
NeurIPS 2025posterarXiv:2506.07673
4
citations
MMQA: Evaluating LLMs with Multi-Table Multi-Hop Complex Questions
Jian Wu, Linyi Yang, Dongyuan Li et al.
ICLR 2025poster
23
citations