"video understanding" Papers
37 papers found
Accident Anticipation via Temporal Occurrence Prediction
Tianhao Zhao, Yiyang Zou, Zihao Mao et al.
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
Wenhao Chai, Enxin Song, Yilun Du et al.
Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model
Benlin Liu, Yuhao Dong, Yiqin Wang et al.
FastVID: Dynamic Density Pruning for Fast Video Large Language Models
Leqi Shen, Guoqiang Gong, Tao He et al.
Glance2Gaze: Efficient Vision-Language Models from Glance Fusion to Gaze Compression
Juan Chen, Honglin liu, Yingying Ao et al.
HoPE: Hybrid of Position Embedding for Long Context Vision-Language Models
Haoran Li, Yingjie Qin, Baoyuan Ou et al.
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models
Zhihang Liu, Chen-Wei Xie, Pandeng Li et al.
Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors
Duo Zheng, shijia Huang, Yanyang Li et al.
Mitigating Hallucination in VideoLLMs via Temporal-Aware Activation Engineering
JIANFENG CAI, Jiale Hong, Zongmeng Zhang et al.
Multiple Object Tracking as ID Prediction
Ruopeng Gao, Ji Qi, Limin Wang
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Miran Heo, Min-Hung Chen, De-An Huang et al.
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi et al.
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding
Yiyang Zhou, Yangfan He, Yaofeng Su et al.
The Devil is in the Spurious Correlations: Boosting Moment Retrieval with Dynamic Learning
Xinyang Zhou, Fanyue Wei, Lixin Duan et al.
TRKT: Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring
Zhu Xu, Ting Lei, Zhimin Li et al.
VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding
Zongxia Li, Xiyang Wu, Guangyao Shi et al.
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
Liang Chen, Haozhe Zhao, Tianyu Liu et al.
Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video
Zhaobo Qi, Yibo Yuan, Xiaowen Ruan et al.
ContPhy: Continuum Physical Concept Learning and Reasoning from Videos
Zhicheng Zheng, Xin Yan, Zhenfang Chen et al.
DEVIAS: Learning Disentangled Video Representations of Action and Scene
Kyungho Bae, Youngrae Kim, Geo Ahn et al.
EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens
Sunil Hwang, Jaehong Yoon, Youngwan Lee et al.
Fine-grained Dynamic Network for Generic Event Boundary Detection
Ziwei Zheng, Lijun He, Le Yang et al.
FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition
Ishan Rajendrakumar Dave, Mamshad Nayeem Rizve, Shah Mubarak
Learning Video Context as Interleaved Multimodal Sequences
Qinghong Lin, Pengchuan Zhang, Difei Gao et al.
LongVLM: Efficient Long Video Understanding via Large Language Models
Yuetian Weng, Mingfei Han, Haoyu He et al.
MULTISCRIPT: Multimodal Script Learning for Supporting Open Domain Everyday Tasks
Jingyuan Qi, Minqian Liu, Ying Shen et al.
No More Shortcuts: Realizing the Potential of Temporal Self-Supervision
Ishan Rajendrakumar Dave, Simon Jenni, Mubarak Shah
Open Vocabulary Multi-Label Video Classification
Rohit Gupta, Mamshad Nayeem Rizve, Jayakrishnan Unnikrishnan et al.
Parallelized Spatiotemporal Slot Binding for Videos
Gautam Singh, Yue Wang, Jiawei Yang et al.
Self-Supervised Any-Point Tracking by Contrastive Random Walks
Ayush Shrivastava, Andrew Owens
Semantically Guided Representation Learning For Action Anticipation
Anxhelo Diko, Danilo Avola, Bardh Prenkaj et al.
Stepwise Multi-grained Boundary Detector for Point-supervised Temporal Action Localization
Mengnan Liu, Le Wang, Sanping Zhou et al.
Text-Conditioned Resampler For Long Form Video Understanding
Bruno Korbar, Yongqin Xian, Alessio Tonioni et al.
Vamos: Versatile Action Models for Video Understanding
Shijie Wang, Qi Zhao, Minh Quan et al.
Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
Hao Fei, Shengqiong Wu, Wei Ji et al.
VideoPrism: A Foundational Visual Encoder for Video Understanding
Long Zhao, Nitesh Bharadwaj Gundavarapu, Liangzhe Yuan et al.
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
Guangzhi Sun, Wenyi Yu, Changli Tang et al.