Yali Wang

37
Papers
2,058
Total Citations

Papers (37)

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

CVPR 2024
864
citations

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

ICLR 2024
408
citations

VideoMamba: State Space Model for Efficient Video Understanding

ECCV 2024
396
citations

SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

ICLR 2024
209
citations

EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World

CVPR 2024
84
citations

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding

ICLR 2025
39
citations

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

CVPR 2025arXiv
19
citations

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning

ICLR 2025
11
citations

Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

ICLR 2025
9
citations

VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

ICCV 2025
8
citations

H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving

AAAI 2025
6
citations

V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents

CVPR 2025
5
citations

Target-Relevant Knowledge Preservation for Multi-Source Domain Adaptive Object Detection

CVPR 2022arXiv
0
citations

Dual-AI: Dual-Path Actor Interaction Learning for Group Activity Recognition

CVPR 2022
0
citations

Cross Domain Object Detection by Target-Perceived Dual Branch Distillation

CVPR 2022arXiv
0
citations

VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking

CVPR 2023arXiv
0
citations

Starting From Non-Parametric Networks for 3D Point Cloud Analysis

CVPR 2023arXiv
0
citations

MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling With Informative-Preserved Reconstruction and Self-Distilled Consistency

CVPR 2023
0
citations

RPAN: An End-To-End Recurrent Pose-Attention Network for Action Recognition in Videos

ICCV 2017
0
citations

Digging Into Uncertainty in Self-Supervised Multi-View Stereo

ICCV 2021arXiv
0
citations

UniFormerV2: Unlocking the Potential of Image ViTs for Video Understanding

ICCV 2023
0
citations

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

ICCV 2023arXiv
0
citations

HTML: Hybrid Temporal-scale Multimodal Learning Framework for Referring Video Object Segmentation

ICCV 2023
0
citations

Mining Inter-Video Proposal Relations for Video Object Detection

ECCV 2020
0
citations

Self-Slimmed Vision Transformer

ECCV 2022
0
citations

MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

ECCV 2022
0
citations

WeGen: A Unified Model for Interactive Multimodal Generation as We Chat

CVPR 2025
0
citations

LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents

ICCV 2025
0
citations

Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

AAAI 2025
0
citations

M-BEV: Masked BEV Perception for Robust Autonomous Driving

AAAI 2024arXiv
0
citations

Vlogger: Make Your Dream A Vlog

CVPR 2024
0
citations

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

ICML 2024
0
citations

Temporal Hallucinating for Action Recognition With Few Still Images

CVPR 2018
0
citations

MetaCleaner: Learning to Hallucinate Clean Representations for Noisy-Labeled Visual Recognition

CVPR 2019
0
citations

Adaptive Pyramid Context Network for Semantic Segmentation

CVPR 2019
0
citations

PA3D: Pose-Action 3D Machine for Video Recognition

CVPR 2019
0
citations

SmallBigNet: Integrating Core and Contextual Views for Video Classification

CVPR 2020arXiv
0
citations