"video understanding" Papers
80 papers found • Page 2 of 2
Conference
DEVIAS: Learning Disentangled Video Representations of Action and Scene
Kyungho Bae, Youngrae Kim, Geo Ahn et al.
DyFADet: Dynamic Feature Aggregation for Temporal Action Detection
Le Yang, Ziwei Zheng, Yizeng Han et al.
End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames
Shuming Liu, Chenlin Zhang, Chen Zhao et al.
EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens
Sunil Hwang, Jaehong Yoon, Youngwan Lee et al.
Fine-grained Dynamic Network for Generic Event Boundary Detection
Ziwei Zheng, Lijun He, Le Yang et al.
FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition
Ishan Rajendrakumar Dave, Mamshad Nayeem Rizve, Shah Mubarak
HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization
Sakib Reza, Yuexi Zhang, Mohsen Moghaddam et al.
Learning Object State Changes in Videos: An Open-World Perspective
Zihui Xue, Kumar Ashutosh, Kristen Grauman
Learning Video Context as Interleaved Multimodal Sequences
Qinghong Lin, Pengchuan Zhang, Difei Gao et al.
LongVLM: Efficient Long Video Understanding via Large Language Models
Yuetian Weng, Mingfei Han, Haoyu He et al.
Multiscale Vision Transformers Meet Bipartite Matching for Efficient Single-stage Action Localization
Ioanna Ntinou, Enrique Sanchez, Georgios Tzimiropoulos
MULTISCRIPT: Multimodal Script Learning for Supporting Open Domain Everyday Tasks
Jingyuan Qi, Minqian Liu, Ying Shen et al.
No More Shortcuts: Realizing the Potential of Temporal Self-Supervision
Ishan Rajendrakumar Dave, Simon Jenni, Mubarak Shah
Open Vocabulary Multi-Label Video Classification
Rohit Gupta, Mamshad Nayeem Rizve, Jayakrishnan Unnikrishnan et al.
Parallelized Spatiotemporal Slot Binding for Videos
Gautam Singh, Yue Wang, Jiawei Yang et al.
Rethinking Image-to-Video Adaptation: An Object-centric Perspective
Rui Qian, Shuangrui Ding, Dahua Lin
Self-Supervised Any-Point Tracking by Contrastive Random Walks
Ayush Shrivastava, Andrew Owens
Self-Supervised Multi-Object Tracking with Path Consistency
Zijia Lu, Bing Shuai, Yanbei Chen et al.
Semantically Guided Representation Learning For Action Anticipation
Anxhelo Diko, Danilo Avola, Bardh Prenkaj et al.
Stepwise Multi-grained Boundary Detector for Point-supervised Temporal Action Localization
Mengnan Liu, Le Wang, Sanping Zhou et al.
ST-LLM: Large Language Models Are Effective Temporal Learners
Ruyang Liu, Chen Li, Haoran Tang et al.
Text-Conditioned Resampler For Long Form Video Understanding
Bruno Korbar, Yongqin Xian, Alessio Tonioni et al.
Towards More Practical Group Activity Detection: A New Benchmark and Model
Dongkeun Kim, Youngkil Song, Minsu Cho et al.
Towards Neuro-Symbolic Video Understanding
Minkyu Choi, Harsh Goel, Mohammad Omama et al.
Vamos: Versatile Action Models for Video Understanding
Shijie Wang, Qi Zhao, Minh Quan et al.
VideoMamba: State Space Model for Efficient Video Understanding
Kunchang Li, Xinhao Li, Yi Wang et al.
Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
Hao Fei, Shengqiong Wu, Wei Ji et al.
VideoPrism: A Foundational Visual Encoder for Video Understanding
Long Zhao, Nitesh Bharadwaj Gundavarapu, Liangzhe Yuan et al.
Video Question Answering with Procedural Programs
Rohan Choudhury, Koichiro Niinuma, Kris Kitani et al.
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
Guangzhi Sun, Wenyi Yu, Changli Tang et al.