"large language model evaluation" Papers
4 papers found
BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks
Anna Sokol, Elizabeth Daly, Michael Hind et al.
NeurIPS 2025posterarXiv:2410.12974
2
citations
FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows"
Yifei Ming, Senthil Purushwalkam, Shrey Pandit et al.
ICLR 2025poster
45
citations
MMQA: Evaluating LLMs with Multi-Table Multi-Hop Complex Questions
Jian Wu, Linyi Yang, Dongyuan Li et al.
ICLR 2025poster
23
citations
Dynamic Evaluation of Large Language Models by Meta Probing Agents
Kaijie Zhu, Jindong Wang, Qinlin Zhao et al.
ICML 2024poster