"video understanding" Papers

80 papers found • Page 2 of 2

DEVIAS: Learning Disentangled Video Representations of Action and Scene

Kyungho Bae, Youngrae Kim, Geo Ahn et al.

ECCV 2024arXiv:2312.00826
6
citations

DyFADet: Dynamic Feature Aggregation for Temporal Action Detection

Le Yang, Ziwei Zheng, Yizeng Han et al.

ECCV 2024arXiv:2407.03197
24
citations

End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames

Shuming Liu, Chenlin Zhang, Chen Zhao et al.

CVPR 2024arXiv:2311.17241
54
citations

EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens

Sunil Hwang, Jaehong Yoon, Youngwan Lee et al.

ICML 2024oralarXiv:2211.10636
12
citations

Fine-grained Dynamic Network for Generic Event Boundary Detection

Ziwei Zheng, Lijun He, Le Yang et al.

ECCV 2024arXiv:2407.04274
2
citations

FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition

Ishan Rajendrakumar Dave, Mamshad Nayeem Rizve, Shah Mubarak

ECCV 2024arXiv:2409.01448
5
citations

HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization

Sakib Reza, Yuexi Zhang, Mohsen Moghaddam et al.

ECCV 2024arXiv:2408.06437
5
citations

Learning Object State Changes in Videos: An Open-World Perspective

Zihui Xue, Kumar Ashutosh, Kristen Grauman

CVPR 2024arXiv:2312.11782
34
citations

Learning Video Context as Interleaved Multimodal Sequences

Qinghong Lin, Pengchuan Zhang, Difei Gao et al.

ECCV 2024arXiv:2407.21757
12
citations

LongVLM: Efficient Long Video Understanding via Large Language Models

Yuetian Weng, Mingfei Han, Haoyu He et al.

ECCV 2024arXiv:2404.03384
131
citations

Multiscale Vision Transformers Meet Bipartite Matching for Efficient Single-stage Action Localization

Ioanna Ntinou, Enrique Sanchez, Georgios Tzimiropoulos

CVPR 2024arXiv:2312.17686
7
citations

MULTISCRIPT: Multimodal Script Learning for Supporting Open Domain Everyday Tasks

Jingyuan Qi, Minqian Liu, Ying Shen et al.

AAAI 2024paperarXiv:2310.04965
3
citations

No More Shortcuts: Realizing the Potential of Temporal Self-Supervision

Ishan Rajendrakumar Dave, Simon Jenni, Mubarak Shah

AAAI 2024paperarXiv:2312.13008
12
citations

Open Vocabulary Multi-Label Video Classification

Rohit Gupta, Mamshad Nayeem Rizve, Jayakrishnan Unnikrishnan et al.

ECCV 2024arXiv:2407.09073
5
citations

Parallelized Spatiotemporal Slot Binding for Videos

Gautam Singh, Yue Wang, Jiawei Yang et al.

ICML 2024oral

Rethinking Image-to-Video Adaptation: An Object-centric Perspective

Rui Qian, Shuangrui Ding, Dahua Lin

ECCV 2024arXiv:2407.06871
8
citations

Self-Supervised Any-Point Tracking by Contrastive Random Walks

Ayush Shrivastava, Andrew Owens

ECCV 2024arXiv:2409.16288
11
citations

Self-Supervised Multi-Object Tracking with Path Consistency

Zijia Lu, Bing Shuai, Yanbei Chen et al.

CVPR 2024highlightarXiv:2404.05136
21
citations

Semantically Guided Representation Learning For Action Anticipation

Anxhelo Diko, Danilo Avola, Bardh Prenkaj et al.

ECCV 2024arXiv:2407.02309
7
citations

Stepwise Multi-grained Boundary Detector for Point-supervised Temporal Action Localization

Mengnan Liu, Le Wang, Sanping Zhou et al.

ECCV 2024
3
citations

ST-LLM: Large Language Models Are Effective Temporal Learners

Ruyang Liu, Chen Li, Haoran Tang et al.

ECCV 2024arXiv:2404.00308
129
citations

Text-Conditioned Resampler For Long Form Video Understanding

Bruno Korbar, Yongqin Xian, Alessio Tonioni et al.

ECCV 2024arXiv:2312.11897
24
citations

Towards More Practical Group Activity Detection: A New Benchmark and Model

Dongkeun Kim, Youngkil Song, Minsu Cho et al.

ECCV 2024arXiv:2312.02878
10
citations

Towards Neuro-Symbolic Video Understanding

Minkyu Choi, Harsh Goel, Mohammad Omama et al.

ECCV 2024arXiv:2403.11021
19
citations

Vamos: Versatile Action Models for Video Understanding

Shijie Wang, Qi Zhao, Minh Quan et al.

ECCV 2024arXiv:2311.13627
36
citations

VideoMamba: State Space Model for Efficient Video Understanding

Kunchang Li, Xinhao Li, Yi Wang et al.

ECCV 2024arXiv:2403.06977
407
citations

Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition

Hao Fei, Shengqiong Wu, Wei Ji et al.

ICML 2024oralarXiv:2501.03230
146
citations

VideoPrism: A Foundational Visual Encoder for Video Understanding

Long Zhao, Nitesh Bharadwaj Gundavarapu, Liangzhe Yuan et al.

ICML 2024arXiv:2402.13217
73
citations

Video Question Answering with Procedural Programs

Rohan Choudhury, Koichiro Niinuma, Kris Kitani et al.

ECCV 2024arXiv:2312.00937
37
citations

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Guangzhi Sun, Wenyi Yu, Changli Tang et al.

ICML 2024oralarXiv:2406.15704
76
citations