"video understanding" Papers

37 papers found

Accident Anticipation via Temporal Occurrence Prediction

Tianhao Zhao, Yiyang Zou, Zihao Mao et al.

NeurIPS 2025oralarXiv:2510.22260

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Wenhao Chai, Enxin Song, Yilun Du et al.

ICLR 2025oralarXiv:2410.03051
102
citations

Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model

Benlin Liu, Yuhao Dong, Yiqin Wang et al.

CVPR 2025posterarXiv:2408.00754
9
citations

FastVID: Dynamic Density Pruning for Fast Video Large Language Models

Leqi Shen, Guoqiang Gong, Tao He et al.

NeurIPS 2025oralarXiv:2503.11187
16
citations

Glance2Gaze: Efficient Vision-Language Models from Glance Fusion to Gaze Compression

Juan Chen, Honglin liu, Yingying Ao et al.

NeurIPS 2025poster

HoPE: Hybrid of Position Embedding for Long Context Vision-Language Models

Haoran Li, Yingjie Qin, Baoyuan Ou et al.

NeurIPS 2025oralarXiv:2505.20444
2
citations

Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models

Zhihang Liu, Chen-Wei Xie, Pandeng Li et al.

CVPR 2025posterarXiv:2503.16036
14
citations

Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors

Duo Zheng, shijia Huang, Yanyang Li et al.

NeurIPS 2025posterarXiv:2505.24625
24
citations

Mitigating Hallucination in VideoLLMs via Temporal-Aware Activation Engineering

JIANFENG CAI, Jiale Hong, Zongmeng Zhang et al.

NeurIPS 2025oralarXiv:2505.12826
1
citations

Multiple Object Tracking as ID Prediction

Ruopeng Gao, Ji Qi, Limin Wang

CVPR 2025posterarXiv:2403.16848
53
citations

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

Miran Heo, Min-Hung Chen, De-An Huang et al.

CVPR 2025posterarXiv:2501.08326
9
citations

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi et al.

NeurIPS 2025oralarXiv:2504.13180
40
citations

ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

Yiyang Zhou, Yangfan He, Yaofeng Su et al.

NeurIPS 2025posterarXiv:2506.01300
28
citations

The Devil is in the Spurious Correlations: Boosting Moment Retrieval with Dynamic Learning

Xinyang Zhou, Fanyue Wei, Lixin Duan et al.

ICCV 2025posterarXiv:2501.07305

TRKT: Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring

Zhu Xu, Ting Lei, Zhimin Li et al.

ICCV 2025posterarXiv:2508.04943

VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding

Zongxia Li, Xiyang Wu, Guangyao Shi et al.

NeurIPS 2025posterarXiv:2505.01481
13
citations

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Liang Chen, Haozhe Zhao, Tianyu Liu et al.

ECCV 2024posterarXiv:2403.06764
343
citations

Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video

Zhaobo Qi, Yibo Yuan, Xiaowen Ruan et al.

AAAI 2024paperarXiv:2401.07567
11
citations

ContPhy: Continuum Physical Concept Learning and Reasoning from Videos

Zhicheng Zheng, Xin Yan, Zhenfang Chen et al.

ICML 2024poster

DEVIAS: Learning Disentangled Video Representations of Action and Scene

Kyungho Bae, Youngrae Kim, Geo Ahn et al.

ECCV 2024posterarXiv:2312.00826
6
citations

EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens

Sunil Hwang, Jaehong Yoon, Youngwan Lee et al.

ICML 2024oral

Fine-grained Dynamic Network for Generic Event Boundary Detection

Ziwei Zheng, Lijun He, Le Yang et al.

ECCV 2024posterarXiv:2407.04274
2
citations

FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition

Ishan Rajendrakumar Dave, Mamshad Nayeem Rizve, Shah Mubarak

ECCV 2024posterarXiv:2409.01448
5
citations

Learning Video Context as Interleaved Multimodal Sequences

Qinghong Lin, Pengchuan Zhang, Difei Gao et al.

ECCV 2024posterarXiv:2407.21757
12
citations

LongVLM: Efficient Long Video Understanding via Large Language Models

Yuetian Weng, Mingfei Han, Haoyu He et al.

ECCV 2024posterarXiv:2404.03384
128
citations

MULTISCRIPT: Multimodal Script Learning for Supporting Open Domain Everyday Tasks

Jingyuan Qi, Minqian Liu, Ying Shen et al.

AAAI 2024paperarXiv:2310.04965
3
citations

No More Shortcuts: Realizing the Potential of Temporal Self-Supervision

Ishan Rajendrakumar Dave, Simon Jenni, Mubarak Shah

AAAI 2024paperarXiv:2312.13008

Open Vocabulary Multi-Label Video Classification

Rohit Gupta, Mamshad Nayeem Rizve, Jayakrishnan Unnikrishnan et al.

ECCV 2024posterarXiv:2407.09073
5
citations

Parallelized Spatiotemporal Slot Binding for Videos

Gautam Singh, Yue Wang, Jiawei Yang et al.

ICML 2024oral

Self-Supervised Any-Point Tracking by Contrastive Random Walks

Ayush Shrivastava, Andrew Owens

ECCV 2024posterarXiv:2409.16288
11
citations

Semantically Guided Representation Learning For Action Anticipation

Anxhelo Diko, Danilo Avola, Bardh Prenkaj et al.

ECCV 2024posterarXiv:2407.02309
6
citations

Stepwise Multi-grained Boundary Detector for Point-supervised Temporal Action Localization

Mengnan Liu, Le Wang, Sanping Zhou et al.

ECCV 2024poster
3
citations

Text-Conditioned Resampler For Long Form Video Understanding

Bruno Korbar, Yongqin Xian, Alessio Tonioni et al.

ECCV 2024posterarXiv:2312.11897
24
citations

Vamos: Versatile Action Models for Video Understanding

Shijie Wang, Qi Zhao, Minh Quan et al.

ECCV 2024posterarXiv:2311.13627
36
citations

Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition

Hao Fei, Shengqiong Wu, Wei Ji et al.

ICML 2024oral

VideoPrism: A Foundational Visual Encoder for Video Understanding

Long Zhao, Nitesh Bharadwaj Gundavarapu, Liangzhe Yuan et al.

ICML 2024poster

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Guangzhi Sun, Wenyi Yu, Changli Tang et al.

ICML 2024oral