RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

5citations

arXiv:2505.02064 Project

Citations

#512

in NeurIPS 2025

of 5858 papers

Authors

Data Points

Authors

ShuHang Xun Sicheng Tao Jungang Li Yibo Shi Zhixin Lin Zhanhui Zhu Yibo Yan Hanqian Li LingHao Zhang Shikang Wang Yixin Liu Hanbo Zhang Ying Ma Xuming Hu

Topics

real-time video analysis multimodal large language models continuous perception hierarchical question structures streaming video processing dynamic scene understanding multi-timestamp question answering video understanding benchmarks

Abstract

Multimodal Large Language Models (MLLMs) have made rapid progress in perception, understanding, and reasoning, yet existing benchmarks fall short in evaluating these abilities under continuous and dynamic real-world video streams. Such settings require models to maintain coherent understanding and reasoning as visual scenes evolve over time. **We introduce RTV-Bench, a fine-grained benchmark for real-time video analysis with MLLMs**. It is built upon three key principles: multi-timestamp question answering, hierarchical question structures spanning perception and reasoning, and multi-dimensional evaluation of continuous perception, understanding, and reasoning. RTV-Bench comprises 552 diverse videos and 4,608 carefully curated QA pairs covering a wide range of dynamic scenarios. We evaluate a broad range of state-of-the-art MLLMs, including proprietary, open-source offline, and open-source real-time models. Our results show that real-time models generally outperform offline counterparts but still lag behind leading proprietary systems. While scaling model capacity generally yields performance gains, simply increasing the density of sampled input frames does not consistently translate into improved results. These observations suggest inherent limitations in current architectures when handling long-horizon video streams, underscoring the need for models explicitly designed for streaming video processing and analysis.

Citation History

Jan 25, 2026

Jan 27, 2026

Jan 31, 2026

5+5