Poster "needle-in-a-haystack tasks" Papers
2 papers found
HELMET: How to Evaluate Long-context Models Effectively and Thoroughly
Howard Yen, Tianyu Gao, Minmin Hou et al.
ICLR 2025poster
23
citations
VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance
Mohammad Reza Taesiri, Abhijay Ghildyal, Saman Zadtootaghaj et al.
NEURIPS 2025posterarXiv:2505.15952
4
citations