Flexible Frame Selection for Efficient Video Reasoning

10citations
10
Citations
#417
in CVPR 2025
of 2873 papers
4
Authors
4
Data Points

Abstract

Video-language models have shown promise for addressing a range of multimodal tasks for video reasoning, such as video question-answering. However, the inherent computational challenges of processing long video data and increasing model sizes have led to standard approaches that are limited by the number of frames they can process. In this work, we propose the Flexible Frame Selector (FFS), a learnable policy model with a new flexible selection operation, that helps alleviate input context restrictions by enabling video-language models to focus on the most informative frames for the downstream multimodal task, without adding undue processing cost. Our method differentiates from prior work due to its learnability, efficiency, and flexibility. We verify the efficacy of our method on standard video-question answering and reasoning benchmarks, and observe our model can maintain or improve base video-language model accuracy while significantly reducing the number of downstream processed frames.

Citation History

Jan 25, 2026
0
Jan 27, 2026
0
Jan 27, 2026
0
Jan 31, 2026
10+10