Oral "video understanding" Papers

16 papers found

$F^3Set$: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos

Zhaoyu Liu, Kan Jiang, Murong Ma et al.

ICLR 2025oral
3
citations

Accident Anticipation via Temporal Occurrence Prediction

Tianhao Zhao, Yiyang Zou, Zihao Mao et al.

NEURIPS 2025oralarXiv:2510.22260

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Wenhao Chai, Enxin Song, Yilun Du et al.

ICLR 2025oralarXiv:2410.03051
105
citations

FastVID: Dynamic Density Pruning for Fast Video Large Language Models

Leqi Shen, Guoqiang Gong, Tao He et al.

NEURIPS 2025oralarXiv:2503.11187
16
citations

HoPE: Hybrid of Position Embedding for Long Context Vision-Language Models

Haoran Li, Yingjie Qin, Baoyuan Ou et al.

NEURIPS 2025oralarXiv:2505.20444
2
citations

Improve Temporal Reasoning in Multimodal Large Language Models via Video Contrastive Decoding

Daiqing Qi, Dongliang Guo, Hanzhang Yuan et al.

NEURIPS 2025oral

Mitigating Hallucination in VideoLLMs via Temporal-Aware Activation Engineering

JIANFENG CAI, Jiale Hong, Zongmeng Zhang et al.

NEURIPS 2025oralarXiv:2505.12826
3
citations

Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs

Zijia Zhao, Haoyu Lu, Yuqi Huo et al.

ICLR 2025oralarXiv:2406.09367
15
citations

Perception Encoder: The best visual embeddings are not at the output of the network

Daniel Bolya, Po-Yao Huang, Peize Sun et al.

NEURIPS 2025oralarXiv:2504.13181
129
citations

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi et al.

NEURIPS 2025oralarXiv:2504.13180
47
citations

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Ye Wang, Ziheng Wang, Boshen Xu et al.

NEURIPS 2025oralarXiv:2503.13377
49
citations

TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models

Ziyao Shangguan, Chuhan Li, Yuxuan Ding et al.

ICLR 2025oralarXiv:2410.23266
37
citations

EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens

Sunil Hwang, Jaehong Yoon, Youngwan Lee et al.

ICML 2024oralarXiv:2211.10636
12
citations

Parallelized Spatiotemporal Slot Binding for Videos

Gautam Singh, Yue Wang, Jiawei Yang et al.

ICML 2024oral

Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition

Hao Fei, Shengqiong Wu, Wei Ji et al.

ICML 2024oralarXiv:2501.03230
146
citations

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Guangzhi Sun, Wenyi Yu, Changli Tang et al.

ICML 2024oralarXiv:2406.15704
76
citations