Yali Wang

18
Papers
2,060
Total Citations

Papers (18)

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

CVPR 2024
864
citations

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

ICLR 2024
408
citations

VideoMamba: State Space Model for Efficient Video Understanding

ECCV 2024
396
citations

SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

ICLR 2024
209
citations

EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World

CVPR 2024
84
citations

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding

ICLR 2025arXiv
41
citations

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

CVPR 2025arXiv
19
citations

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning

ICLR 2025
11
citations

Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

ICLR 2025
9
citations

VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

ICCV 2025
8
citations

H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving

AAAI 2025
6
citations

V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents

CVPR 2025
5
citations

M-BEV: Masked BEV Perception for Robust Autonomous Driving

AAAI 2024arXiv
0
citations

Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

AAAI 2025
0
citations

Vlogger: Make Your Dream A Vlog

CVPR 2024
0
citations

LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents

ICCV 2025
0
citations

WeGen: A Unified Model for Interactive Multimodal Generation as We Chat

CVPR 2025
0
citations

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

ICML 2024
0
citations