Oral "video understanding" Papers
16 papers found
Conference
$F^3Set$: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos
Zhaoyu Liu, Kan Jiang, Murong Ma et al.
ICLR 2025oral
3
citations
Accident Anticipation via Temporal Occurrence Prediction
Tianhao Zhao, Yiyang Zou, Zihao Mao et al.
NEURIPS 2025oralarXiv:2510.22260
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
Wenhao Chai, Enxin Song, Yilun Du et al.
ICLR 2025oralarXiv:2410.03051
105
citations
FastVID: Dynamic Density Pruning for Fast Video Large Language Models
Leqi Shen, Guoqiang Gong, Tao He et al.
NEURIPS 2025oralarXiv:2503.11187
16
citations
HoPE: Hybrid of Position Embedding for Long Context Vision-Language Models
Haoran Li, Yingjie Qin, Baoyuan Ou et al.
NEURIPS 2025oralarXiv:2505.20444
2
citations
Improve Temporal Reasoning in Multimodal Large Language Models via Video Contrastive Decoding
Daiqing Qi, Dongliang Guo, Hanzhang Yuan et al.
NEURIPS 2025oral
Mitigating Hallucination in VideoLLMs via Temporal-Aware Activation Engineering
JIANFENG CAI, Jiale Hong, Zongmeng Zhang et al.
NEURIPS 2025oralarXiv:2505.12826
3
citations
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs
Zijia Zhao, Haoyu Lu, Yuqi Huo et al.
ICLR 2025oralarXiv:2406.09367
15
citations
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya, Po-Yao Huang, Peize Sun et al.
NEURIPS 2025oralarXiv:2504.13181
129
citations
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi et al.
NEURIPS 2025oralarXiv:2504.13180
47
citations
Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
Ye Wang, Ziheng Wang, Boshen Xu et al.
NEURIPS 2025oralarXiv:2503.13377
49
citations
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
Ziyao Shangguan, Chuhan Li, Yuxuan Ding et al.
ICLR 2025oralarXiv:2410.23266
37
citations
EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens
Sunil Hwang, Jaehong Yoon, Youngwan Lee et al.
ICML 2024oralarXiv:2211.10636
12
citations
Parallelized Spatiotemporal Slot Binding for Videos
Gautam Singh, Yue Wang, Jiawei Yang et al.
ICML 2024oral
Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
Hao Fei, Shengqiong Wu, Wei Ji et al.
ICML 2024oralarXiv:2501.03230
146
citations
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
Guangzhi Sun, Wenyi Yu, Changli Tang et al.
ICML 2024oralarXiv:2406.15704
76
citations