Poster "video understanding" Papers

49 papers found

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Orr Zohar, Xiaohan Wang, Yann Dubois et al.

CVPR 2025arXiv:2412.10360
55
citations

Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models

Eunseop Yoon, Hee Suk Yoon, Mark Hasegawa-Johnson et al.

ICLR 2025arXiv:2507.04976
4
citations

Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model

Benlin Liu, Yuhao Dong, Yiqin Wang et al.

CVPR 2025arXiv:2408.00754
9
citations

DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding

Xiaoyi Bao, Chen-Wei Xie, Hao Tang et al.

ICCV 2025arXiv:2507.15569
1
citations

Glance2Gaze: Efficient Vision-Language Models from Glance Fusion to Gaze Compression

Juan Chen, Honglin liu, Yingying Ao et al.

NEURIPS 2025

HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding

Shehreen Azad, Vibhav Vineet, Yogesh S. Rawat

CVPR 2025arXiv:2503.08585
13
citations

High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation

Runyang Feng, Hyung Jin Chang, Tze Ho Elden Tse et al.

ICCV 2025arXiv:2510.11017

Human Motion Instruction Tuning

Lei Li, Sen Jia, Jianhao Wang et al.

CVPR 2025arXiv:2411.16805
14
citations

Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models

Zhihang Liu, Chen-Wei Xie, Pandeng Li et al.

CVPR 2025arXiv:2503.16036
15
citations

Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors

Duo Zheng, shijia Huang, Yanyang Li et al.

NEURIPS 2025arXiv:2505.24625
29
citations

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

Yilun Zhao, Lujing Xie, Haowei Zhang et al.

CVPR 2025arXiv:2501.12380
78
citations

Multiple Object Tracking as ID Prediction

Ruopeng Gao, Ji Qi, Limin Wang

CVPR 2025arXiv:2403.16848
55
citations

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

Miran Heo, Min-Hung Chen, De-An Huang et al.

CVPR 2025arXiv:2501.08326
9
citations

Progress-Aware Video Frame Captioning

Zihui Xue, Joungbin An, Xitong Yang et al.

CVPR 2025arXiv:2412.02071
7
citations

ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

Yiyang Zhou, Yangfan He, Yaofeng Su et al.

NEURIPS 2025arXiv:2506.01300
29
citations

Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation

Junyu Xie, Tengda Han, Max Bain et al.

ICCV 2025arXiv:2504.01020
3
citations

The Devil is in the Spurious Correlations: Boosting Moment Retrieval with Dynamic Learning

Xinyang Zhou, Fanyue Wei, Lixin Duan et al.

ICCV 2025arXiv:2501.07305

TRKT: Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring

Zhu Xu, Ting Lei, Zhimin Li et al.

ICCV 2025arXiv:2508.04943

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

Junqi Ge, Ziyi Chen, Jintao Lin et al.

ICCV 2025arXiv:2412.09616
17
citations

VCA: Video Curious Agent for Long Video Understanding

Zeyuan Yang, Delin Chen, Xueyang Yu et al.

ICCV 2025arXiv:2412.10471
31
citations

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

Ziyang Luo, Haoning Wu, Dongxu Li et al.

CVPR 2025arXiv:2411.13281
15
citations

VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding

Zongxia Li, Xiyang Wu, Guangyao Shi et al.

NEURIPS 2025arXiv:2505.01481
15
citations

VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks

Lawrence Jang, Yinheng Li, Dan Zhao et al.

ICLR 2025arXiv:2410.19100
26
citations

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

Zeyue Tian, Zhaoyang Liu, Ruibin Yuan et al.

CVPR 2025arXiv:2406.04321
32
citations

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Liang Chen, Haozhe Zhao, Tianyu Liu et al.

ECCV 2024arXiv:2403.06764
368
citations

ContPhy: Continuum Physical Concept Learning and Reasoning from Videos

Zhicheng Zheng, Xin Yan, Zhenfang Chen et al.

ICML 2024arXiv:2402.06119
15
citations

DEVIAS: Learning Disentangled Video Representations of Action and Scene

Kyungho Bae, Youngrae Kim, Geo Ahn et al.

ECCV 2024arXiv:2312.00826
6
citations

DyFADet: Dynamic Feature Aggregation for Temporal Action Detection

Le Yang, Ziwei Zheng, Yizeng Han et al.

ECCV 2024arXiv:2407.03197
24
citations

End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames

Shuming Liu, Chenlin Zhang, Chen Zhao et al.

CVPR 2024arXiv:2311.17241
54
citations

Fine-grained Dynamic Network for Generic Event Boundary Detection

Ziwei Zheng, Lijun He, Le Yang et al.

ECCV 2024arXiv:2407.04274
2
citations

FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition

Ishan Rajendrakumar Dave, Mamshad Nayeem Rizve, Shah Mubarak

ECCV 2024arXiv:2409.01448
5
citations

HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization

Sakib Reza, Yuexi Zhang, Mohsen Moghaddam et al.

ECCV 2024arXiv:2408.06437
5
citations

Learning Object State Changes in Videos: An Open-World Perspective

Zihui Xue, Kumar Ashutosh, Kristen Grauman

CVPR 2024arXiv:2312.11782
34
citations

Learning Video Context as Interleaved Multimodal Sequences

Qinghong Lin, Pengchuan Zhang, Difei Gao et al.

ECCV 2024arXiv:2407.21757
12
citations

LongVLM: Efficient Long Video Understanding via Large Language Models

Yuetian Weng, Mingfei Han, Haoyu He et al.

ECCV 2024arXiv:2404.03384
131
citations

Multiscale Vision Transformers Meet Bipartite Matching for Efficient Single-stage Action Localization

Ioanna Ntinou, Enrique Sanchez, Georgios Tzimiropoulos

CVPR 2024arXiv:2312.17686
7
citations

Open Vocabulary Multi-Label Video Classification

Rohit Gupta, Mamshad Nayeem Rizve, Jayakrishnan Unnikrishnan et al.

ECCV 2024arXiv:2407.09073
5
citations

Rethinking Image-to-Video Adaptation: An Object-centric Perspective

Rui Qian, Shuangrui Ding, Dahua Lin

ECCV 2024arXiv:2407.06871
8
citations

Self-Supervised Any-Point Tracking by Contrastive Random Walks

Ayush Shrivastava, Andrew Owens

ECCV 2024arXiv:2409.16288
11
citations

Semantically Guided Representation Learning For Action Anticipation

Anxhelo Diko, Danilo Avola, Bardh Prenkaj et al.

ECCV 2024arXiv:2407.02309
7
citations

Stepwise Multi-grained Boundary Detector for Point-supervised Temporal Action Localization

Mengnan Liu, Le Wang, Sanping Zhou et al.

ECCV 2024
3
citations

ST-LLM: Large Language Models Are Effective Temporal Learners

Ruyang Liu, Chen Li, Haoran Tang et al.

ECCV 2024arXiv:2404.00308
129
citations

Text-Conditioned Resampler For Long Form Video Understanding

Bruno Korbar, Yongqin Xian, Alessio Tonioni et al.

ECCV 2024arXiv:2312.11897
24
citations

Towards More Practical Group Activity Detection: A New Benchmark and Model

Dongkeun Kim, Youngkil Song, Minsu Cho et al.

ECCV 2024arXiv:2312.02878
10
citations

Towards Neuro-Symbolic Video Understanding

Minkyu Choi, Harsh Goel, Mohammad Omama et al.

ECCV 2024arXiv:2403.11021
19
citations

Vamos: Versatile Action Models for Video Understanding

Shijie Wang, Qi Zhao, Minh Quan et al.

ECCV 2024arXiv:2311.13627
36
citations

VideoMamba: State Space Model for Efficient Video Understanding

Kunchang Li, Xinhao Li, Yi Wang et al.

ECCV 2024arXiv:2403.06977
407
citations

VideoPrism: A Foundational Visual Encoder for Video Understanding

Long Zhao, Nitesh Bharadwaj Gundavarapu, Liangzhe Yuan et al.

ICML 2024arXiv:2402.13217
73
citations

Video Question Answering with Procedural Programs

Rohan Choudhury, Koichiro Niinuma, Kris Kitani et al.

ECCV 2024arXiv:2312.00937
37
citations