"video understanding" Papers

80 papers found • Page 1 of 2

$F^3Set$: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos

Zhaoyu Liu, Kan Jiang, Murong Ma et al.

ICLR 2025oral
3
citations

Accident Anticipation via Temporal Occurrence Prediction

Tianhao Zhao, Yiyang Zou, Zihao Mao et al.

NEURIPS 2025oralarXiv:2510.22260

Action-Agnostic Point-Level Supervision for Temporal Action Detection

Shuhei M. Yoshida, Takashi Shibata, Makoto Terao et al.

AAAI 2025paperarXiv:2412.21205
5
citations

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Orr Zohar, Xiaohan Wang, Yann Dubois et al.

CVPR 2025arXiv:2412.10360
55
citations

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Wenhao Chai, Enxin Song, Yilun Du et al.

ICLR 2025oralarXiv:2410.03051
105
citations

Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models

Eunseop Yoon, Hee Suk Yoon, Mark Hasegawa-Johnson et al.

ICLR 2025arXiv:2507.04976
4
citations

Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model

Benlin Liu, Yuhao Dong, Yiqin Wang et al.

CVPR 2025arXiv:2408.00754
9
citations

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

Nikita Karaev, Iurii Makarov, Jianyuan Wang et al.

ICCV 2025highlightarXiv:2410.11831
223
citations

DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding

Xiaoyi Bao, Chen-Wei Xie, Hao Tang et al.

ICCV 2025arXiv:2507.15569
1
citations

FastVID: Dynamic Density Pruning for Fast Video Large Language Models

Leqi Shen, Guoqiang Gong, Tao He et al.

NEURIPS 2025oralarXiv:2503.11187
16
citations

Glance2Gaze: Efficient Vision-Language Models from Glance Fusion to Gaze Compression

Juan Chen, Honglin liu, Yingying Ao et al.

NEURIPS 2025

HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding

Shehreen Azad, Vibhav Vineet, Yogesh S. Rawat

CVPR 2025arXiv:2503.08585
13
citations

High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation

Runyang Feng, Hyung Jin Chang, Tze Ho Elden Tse et al.

ICCV 2025arXiv:2510.11017

HIPPO-VIDEO : Simulating Watch Histories with Large Language Models for History-Driven Video Highlighting

Jeongeun Lee, Youngjae Yu, Dongha Lee

COLM 2025paper

HoPE: Hybrid of Position Embedding for Long Context Vision-Language Models

Haoran Li, Yingjie Qin, Baoyuan Ou et al.

NEURIPS 2025oralarXiv:2505.20444
2
citations

Human Motion Instruction Tuning

Lei Li, Sen Jia, Jianhao Wang et al.

CVPR 2025arXiv:2411.16805
14
citations

Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models

Zhihang Liu, Chen-Wei Xie, Pandeng Li et al.

CVPR 2025arXiv:2503.16036
15
citations

Improve Temporal Reasoning in Multimodal Large Language Models via Video Contrastive Decoding

Daiqing Qi, Dongliang Guo, Hanzhang Yuan et al.

NEURIPS 2025oral

Interacted Object Grounding in Spatio-Temporal Human-Object Interactions

Xiaoyang Liu, Boran Wen, Xinpeng Liu et al.

AAAI 2025paperarXiv:2412.19542
4
citations

Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors

Duo Zheng, shijia Huang, Yanyang Li et al.

NEURIPS 2025arXiv:2505.24625
29
citations

Mitigating Hallucination in VideoLLMs via Temporal-Aware Activation Engineering

JIANFENG CAI, Jiale Hong, Zongmeng Zhang et al.

NEURIPS 2025oralarXiv:2505.12826
3
citations

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

Yilun Zhao, Lujing Xie, Haowei Zhang et al.

CVPR 2025arXiv:2501.12380
78
citations

Motion-aware Contrastive Learning for Temporal Panoptic Scene Graph Generation

Thong Thanh Nguyen, Xiaobao Wu, Yi Bin et al.

AAAI 2025paperarXiv:2412.07160
7
citations

Multiple Object Tracking as ID Prediction

Ruopeng Gao, Ji Qi, Limin Wang

CVPR 2025arXiv:2403.16848
55
citations

Multi-Scale Contrastive Learning for Video Temporal Grounding

Thong Thanh Nguyen, Yi Bin, Xiaobao Wu et al.

AAAI 2025paperarXiv:2412.07157
3
citations

Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs

Zijia Zhao, Haoyu Lu, Yuqi Huo et al.

ICLR 2025oralarXiv:2406.09367
15
citations

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

Miran Heo, Min-Hung Chen, De-An Huang et al.

CVPR 2025arXiv:2501.08326
9
citations

Perception Encoder: The best visual embeddings are not at the output of the network

Daniel Bolya, Po-Yao Huang, Peize Sun et al.

NEURIPS 2025oralarXiv:2504.13181
129
citations

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi et al.

NEURIPS 2025oralarXiv:2504.13180
47
citations

Prediction-Feedback DETR for Temporal Action Detection

Jihwan Kim, Miso Lee, Cheol-Ho Cho et al.

AAAI 2025paperarXiv:2408.16729
6
citations

Progress-Aware Video Frame Captioning

Zihui Xue, Joungbin An, Xitong Yang et al.

CVPR 2025arXiv:2412.02071
7
citations

ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

Yiyang Zhou, Yangfan He, Yaofeng Su et al.

NEURIPS 2025arXiv:2506.01300
29
citations

Rethinking Pseudo-Label Guided Learning for Weakly Supervised Temporal Action Localization from the Perspective of Noise Correction

Quan Zhang, Yuxin Qi, Xi Tang et al.

AAAI 2025paperarXiv:2501.11124
8
citations

Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation

Junyu Xie, Tengda Han, Max Bain et al.

ICCV 2025arXiv:2504.01020
3
citations

Temporal Action Localization with Cross Layer Task Decoupling and Refinement

Qiang Li, Di Liu, Jun Kong et al.

AAAI 2025paperarXiv:2412.09202
1
citations

The Devil is in the Spurious Correlations: Boosting Moment Retrieval with Dynamic Learning

Xinyang Zhou, Fanyue Wei, Lixin Duan et al.

ICCV 2025arXiv:2501.07305

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Ye Wang, Ziheng Wang, Boshen Xu et al.

NEURIPS 2025oralarXiv:2503.13377
49
citations

TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models

Ziyao Shangguan, Chuhan Li, Yuxuan Ding et al.

ICLR 2025oralarXiv:2410.23266
37
citations

TRKT: Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring

Zhu Xu, Ting Lei, Zhimin Li et al.

ICCV 2025arXiv:2508.04943

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

Junqi Ge, Ziyi Chen, Jintao Lin et al.

ICCV 2025arXiv:2412.09616
17
citations

VCA: Video Curious Agent for Long Video Understanding

Zeyuan Yang, Delin Chen, Xueyang Yu et al.

ICCV 2025arXiv:2412.10471
31
citations

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

Ziyang Luo, Haoning Wu, Dongxu Li et al.

CVPR 2025arXiv:2411.13281
15
citations

VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding

Zongxia Li, Xiyang Wu, Guangyao Shi et al.

NEURIPS 2025arXiv:2505.01481
15
citations

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo et al.

CVPR 2025highlightarXiv:2405.21075
917
citations

VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks

Lawrence Jang, Yinheng Li, Dan Zhao et al.

ICLR 2025arXiv:2410.19100
26
citations

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

Zeyue Tian, Zhaoyang Liu, Ruibin Yuan et al.

CVPR 2025arXiv:2406.04321
32
citations

Watch Video, Catch Keyword: Context-aware Keyword Attention for Moment Retrieval and Highlight Detection

Sung Jin Um, Dongjin Kim, Sangmin Lee et al.

AAAI 2025paperarXiv:2501.02504
4
citations

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Liang Chen, Haozhe Zhao, Tianyu Liu et al.

ECCV 2024arXiv:2403.06764
368
citations

Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video

Zhaobo Qi, Yibo Yuan, Xiaowen Ruan et al.

AAAI 2024paperarXiv:2401.07567
15
citations

ContPhy: Continuum Physical Concept Learning and Reasoning from Videos

Zhicheng Zheng, Xin Yan, Zhenfang Chen et al.

ICML 2024arXiv:2402.06119
15
citations
PreviousNext