🧬Multimodal

Video-Language Understanding

Understanding videos with language

100 papers16,492 total citations

Compare with other topics

Feb '24 — Jan '261278 papers

Top Conferences

CVPR: 43 ICLR: 21 ECCV: 11 AAAI: 9 NeurIPS: 6 ICML: 5

Top Papers

#1

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng et al.

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia et al.

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Kunchang Li, Yali Wang, Yinan He et al.

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo et al.

WorldSimBench: Towards Video Generation Models as World Simulators

Yiran Qin, Zhelun Shi, Jiwen Yu et al.

VILA: On Pre-training for Visual Language Models

Ji Lin, Danny Yin, Wei Ping et al.

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani et al.

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Enxin Song, Wenhao Chai, Guanhong Wang et al.

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Yi Wang, Yinan He, Yizhuo Li et al.

VideoMamba: State Space Model for Efficient Video Understanding

Kunchang Li, Xinhao Li, Yi Wang et al.

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Shuhuai Ren, Linli Yao, Shicheng Li et al.

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Peng Jin, Ryuichi Takanobu, Cai Zhang et al.

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Bin Zhu, Bin Lin, Munan Ning et al.

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu, Peng Jin, ZiangWu ZiangWu et al.

Video-P2P: Video Editing with Cross-attention Control

Shaoteng Liu, Yuechen Zhang, Wenbo Li et al.

VTimeLLM: Empower LLM to Grasp Video Moments

Bin Huang, Xin Wang, Hong Chen et al.

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li et al.

NeurIPS 2025arXiv:2503.21776

rule-based reinforcement learningmultimodal large language modelsvideo reasoningtemporal modeling+3

236

citations

#18

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Jiabo Ye, Haiyang Xu, Haowei Liu et al.

Listen, Think, and Understand

Yuan Gong, Hongyin Luo, Alexander Liu et al.

EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

Linrui Tian, Qi Wang, Bang Zhang et al.

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

Yi Wang, Kunchang Li, Xinhao Li et al.

LVBench: An Extreme Long Video Understanding Benchmark

Weihan Wang, zehai he, Wenyi Hong et al.

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

Qingqing Zhao, Yao Lu, Moo Jin Kim et al.

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

Wenbo Hu, Yifan Xu, Yi Li et al.

AAAI 2024arXiv:2308.09936

vision language modelsvisual question answeringmultimodal large language modelstext-rich image understanding+4

190

citations

#25

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma et al.

Revisiting Feature Prediction for Learning Visual Representations from Video

Quentin Garrido, Yann LeCun, Michael Rabbat et al.

VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection

Peng Wu, Xuerong Zhou, Guansong Pang et al.

AAAI 2024arXiv:2308.11681

vision-language modelsvideo anomaly detectionweakly supervised learningcontrastive language-image pre-training+4

156

citations

#28

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

Roberto Henschel, Levon Khachatryan, Hayk Poghosyan et al.

World Model on Million-Length Video And Language With Blockwise RingAttention

Hao Liu, Wilson Yan, Matei Zaharia et al.

ICLR 2025arXiv:2402.08268

long-context understandingvideo-language modelslanguage retrievallong video understanding+4

144

citations

#30

Video Language Planning

Yilun Du, Sherry Yang, Pete Florence et al.

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

Yan Shu, Zheng Liu, Peitian Zhang et al.

LongVLM: Efficient Long Video Understanding via Large Language Models

Yuetian Weng, Mingfei Han, Haoyu He et al.

ECCV 2024arXiv:2404.03384

video understandinglarge language modelshierarchical token merginglong video processing+4

128

citations

#33

GSVA: Generalized Segmentation via Multimodal Large Language Models

Zhuofan Xia, Dongchen Han, Yizeng Han et al.

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Shengpeng Ji, Ziyue Jiang, Wen Wang et al.

ICLR 2025arXiv:2408.16532

acoustic codec tokenizeraudio language modelingvector quantizationaudio reconstruction+3

125

citations

#35

ST-LLM: Large Language Models Are Effective Temporal Learners

Ruyang Liu, Chen Li, Haoran Tang et al.

VideoLLM-online: Online Video Large Language Model for Streaming Video

Joya Chen, Zhaoyang Lv, Shiwei Wu et al.

VISA: Reasoning Video Object Segmentation via Large Language Model

Cilin Yan, haochen wang, Shilin Yan et al.

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

Jianwen Jiang, Chao Liang, Jiaqi Yang et al.

VidToMe: Video Token Merging for Zero-Shot Video Editing

Xirui Li, Chao Ma, Xiaokang Yang et al.

MLVU: Benchmarking Multi-task Long Video Understanding

Junjie Zhou, Yan Shu, Bo Zhao et al.

Video ReCap: Recursive Captioning of Hour-Long Videos

Md Mohaiminul Islam, Vu Bao Ngan Ho, Xitong Yang et al.

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

Jingkang Yang, Yuhao Dong, Shuai Liu et al.

General Object Foundation Model for Images and Videos at Scale

Junfeng Wu, Yi Jiang, Qihao Liu et al.

Move as You Say Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

Zan Wang, Yixin Chen, Baoxiong Jia et al.

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Sreyan Ghosh, Arushi Goel, Jaehyeon Kim et al.

Towards 3D Molecule-Text Interpretation in Language Models

Sihang Li, Zhiyuan Liu, Yanchen Luo et al.

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Fanqing Meng, Jiaqi Liao, Xinyu Tan et al.

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

Zhiwei Yang, Jing Liu, Peng Wu

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

Yilun Zhao, Lujing Xie, Haowei Zhang et al.

Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension

Yongdong Luo, Xiawu Zheng, Guilin Li et al.

Adaptive Keyframe Sampling for Long Video Understanding

Xi Tang, Jihao Qiu, Lingxi Xie et al.

CVPR 2025arXiv:2502.21271

adaptive keyframe samplinglong video understandingmultimodal large language modelsvideo token selection+2

68

citations

#52

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

Mengcheng Lan, Chaofeng Chen, Yiping Ke et al.

History-Guided Video Diffusion

Kiwhan Song, Boyuan Chen, Max Simchowitz et al.

Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model

Zhen Ye, Peiwen Sun, Jiahe Lei et al.

Open-Vocabulary Video Anomaly Detection

Peng Wu, Xuerong Zhou, Guansong Pang et al.

Unifying 3D Vision-Language Understanding via Promptable Queries

ziyu zhu, Zhuofan Zhang, Xiaojian Ma et al.

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence

Yuchao Gu, Yipin Zhou, Bichen Wu et al.

Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval

Jiamian Wang, Guohao Sun, Pichao Wang et al.

Koala: Key Frame-Conditioned Long Video-LLM

Reuben Tan, Ximeng Sun, Ping Hu et al.

Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models

Hyeonho Jeong, Jong Chul YE

Image and Video Tokenization with Binary Spherical Quantization

Yue Zhao, Yuanjun Xiong, Philipp Krähenbühl

DocFormerv2: Local Features for Document Understanding

Srikar Appalaraju, Peng Tang, Qi Dong et al.

AAAI 2024arXiv:2306.01733

visual document understandingmulti-modal transformerlocal-feature alignmentdocument information extraction+4

58

citations

#63

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

Shilin Yan, Renrui Zhang, Ziyu Guo et al.

AAAI 2024arXiv:2305.16318

video object segmentationmulti-modal referencetemporal transformersemantic alignment+4

58

citations

#64

Long Context Tuning for Video Generation

Yuwei Guo, Ceyuan Yang, Ziyan Yang et al.

Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment

Lijie Liu, Tianxiang Ma, Bingchuan Li et al.

ICCV 2025arXiv:2502.11079

subject-consistent video generationcross-modal alignmenttext-to-video architectureimage-to-video architecture+4

55

citations

#66

Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models

Lvmin Zhang, Shengqu Cai, Muyang Li et al.

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

Songhao Han, Wei Huang, Hairong Shi et al.

VLCounter: Text-Aware Visual Representation for Zero-Shot Object Counting

Seunggu Kang, WonJun Moon, Euiyeon Kim et al.

AAAI 2024arXiv:2312.16580

zero-shot object countingsemantic-patch embeddingsvisual-language representationsemantic-conditioned prompt tuning+3

54

citations

#69

Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology

Xiangyu Wang, Donglin Yang, ziqin wang et al.

ICLR 2025arXiv:2410.07087

vision-language navigationuav navigationtrajectory generationmultimodal understanding+4

52

citations

#70

Discovering and Mitigating Visual Biases through Keyword Explanation

Younghyun Kim, Sangwoo Mo, Minkyu Kim et al.

You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale

Baorui Ma, Huachen Gao, Haoge Deng et al.

CVPR 2025arXiv:2412.06699

3d generation modelsmulti-view diffusion modelpose-free videoslarge-scale video data+4

49

citations

#72

FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation

Shuai Yang, Yifan Zhou, Ziwei Liu et al.

VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models

Shicheng Li, Lei Li, Yi Liu et al.

Describe Anything: Detailed Localized Image and Video Captioning

Long Lian, Yifan Ding, Yunhao Ge et al.

MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning

Chaoyi Zhang, Kevin Lin, Zhengyuan Yang et al.

VCoder: Versatile Vision Encoders for Multimodal Large Language Models

Jitesh Jain, Jianwei Yang, Humphrey Shi

V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

Hang Hua, Yunlong Tang, Chenliang Xu et al.

Towards Interpreting Visual Information Processing in Vision-Language Models

Clement Neo, Luke Ong, Philip Torr et al.

Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better

Danny Driess, Jost Springenberg, Brian Ichter et al.

NeurIPS 2025arXiv:2505.23705

vision-language-action modelscontinuous control policiesdiffusion action expertflow matching+4

46

citations

#80

M-LLM Based Video Frame Selection for Efficient Video Understanding

Kai Hu, Feng Gao, Xiaohan Nie et al.

TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data

Jeremy Irvin, Emily Liu, Joyce Chen et al.

ICLR 2025arXiv:2410.06234

vision-language assistanttemporal earth observationinstruction-following datasetchange detection+4

45

citations

#82

SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

Katrin Renz, Long Chen, Elahe Arani et al.

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

Xingyu Fu, Minqian Liu, Zhengyuan Yang et al.

Towards Surveillance Video-and-Language Understanding: New Dataset Baselines and Challenges

Tongtong Yuan, Xuange Zhang, Kun Liu et al.

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Kai Chen, Yunhao Gou, Runhui Huang et al.

DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis

Pan Wang, Qiang Zhou, Yawen Wu et al.

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Ye Wang, Ziheng Wang, Boshen Xu et al.

Frame-Voyager: Learning to Query Frames for Video Large Language Models

Sicheng Yu, CHENGKAI JIN, Huanyu Wang et al.

Streaming Video Question-Answering with In-context Video KV-Cache Retrieval

Shangzhe Di, Zhelun Yu, Guanghao Zhang et al.

Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval

Zhihang Liu, Jun Li, Hongtao Xie et al.

AAAI 2024arXiv:2312.12155

video moment retrievalcross-modal alignmentmodality imbalancesemantic modeling+4

40

citations

#91

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Yuqian Yuan, Hang Zhang, Wentong Li et al.

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding

Guo Chen, Yicheng Liu, Yifei Huang et al.

A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames

Pinelopi Papalampidi, Skanda Koppula, Shreya Pathak et al.

DrVideo: Document Retrieval Based Long Video Understanding

Ziyu Ma, Chenhui Gou, Hengcan Shi et al.

Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning

Zhiyuan Yan, Yandan Zhao, Shen Chen et al.

EmoVIT: Revolutionizing Emotion Insights with Visual Instruction Tuning

Hongxia Xie, Chu-Jun Peng, Yu-Wen Tseng et al.

VLAS: Vision-Language-Action Model with Speech Instructions for Customized Robot Manipulation

Wei Zhao, Pengxiang Ding, Zhang Min et al.

Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models

Xin Li, Yunfei Wu, Xinghua Jiang et al.

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Junbo Niu, Yifei Li, Ziyang Miao et al.

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

Kirolos Ataallah, Xiaoqian Shen, Eslam mohamed abdelrahman et al.

ECCV 2024

36

citations

Video-Language Understanding

Top Conferences

Related Topics (Multimodal)

Top Papers

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

WorldSimBench: Towards Video Generation Models as World Simulators

VILA: On Pre-training for Visual Language Models

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

VideoMamba: State Space Model for Efficient Video Understanding

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Video-P2P: Video Editing with Cross-attention Control

VTimeLLM: Empower LLM to Grasp Video Moments

Video-R1: Reinforcing Video Reasoning in MLLMs

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Listen, Think, and Understand

EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

LVBench: An Extreme Long Video Understanding Benchmark

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Revisiting Feature Prediction for Learning Visual Representations from Video

VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

World Model on Million-Length Video And Language With Blockwise RingAttention

Video Language Planning

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

LongVLM: Efficient Long Video Understanding via Large Language Models

GSVA: Generalized Segmentation via Multimodal Large Language Models

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

ST-LLM: Large Language Models Are Effective Temporal Learners

VideoLLM-online: Online Video Large Language Model for Streaming Video

VISA: Reasoning Video Object Segmentation via Large Language Model

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

VidToMe: Video Token Merging for Zero-Shot Video Editing

MLVU: Benchmarking Multi-task Long Video Understanding

Video ReCap: Recursive Captioning of Hour-Long Videos

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

General Object Foundation Model for Images and Videos at Scale

Move as You Say Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Towards 3D Molecule-Text Interpretation in Language Models

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension

Adaptive Keyframe Sampling for Long Video Understanding

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

History-Guided Video Diffusion

Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model

Open-Vocabulary Video Anomaly Detection

Unifying 3D Vision-Language Understanding via Promptable Queries

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence

Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval

Koala: Key Frame-Conditioned Long Video-LLM

Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models

Image and Video Tokenization with Binary Spherical Quantization

DocFormerv2: Local Features for Document Understanding

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

Long Context Tuning for Video Generation

Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment

Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

VLCounter: Text-Aware Visual Representation for Zero-Shot Object Counting

Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology

Discovering and Mitigating Visual Biases through Keyword Explanation

You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale

FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation

VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models

Describe Anything: Detailed Localized Image and Video Captioning

MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning

VCoder: Versatile Vision Encoders for Multimodal Large Language Models