Yali Wang
37
Papers
2,058
Total Citations
Papers (37)
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
CVPR 2024
864
citations
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
ICLR 2024
408
citations
VideoMamba: State Space Model for Efficient Video Understanding
ECCV 2024
396
citations
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction
ICLR 2024
209
citations
EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World
CVPR 2024
84
citations
CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
ICLR 2025
39
citations
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
CVPR 2025arXiv
19
citations
Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning
ICLR 2025
11
citations
Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel
ICLR 2025
9
citations
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
ICCV 2025
8
citations
H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving
AAAI 2025
6
citations
V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents
CVPR 2025
5
citations
Target-Relevant Knowledge Preservation for Multi-Source Domain Adaptive Object Detection
CVPR 2022arXiv
0
citations
Dual-AI: Dual-Path Actor Interaction Learning for Group Activity Recognition
CVPR 2022
0
citations
Cross Domain Object Detection by Target-Perceived Dual Branch Distillation
CVPR 2022arXiv
0
citations
VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking
CVPR 2023arXiv
0
citations
Starting From Non-Parametric Networks for 3D Point Cloud Analysis
CVPR 2023arXiv
0
citations
MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling With Informative-Preserved Reconstruction and Self-Distilled Consistency
CVPR 2023
0
citations
RPAN: An End-To-End Recurrent Pose-Attention Network for Action Recognition in Videos
ICCV 2017
0
citations
Digging Into Uncertainty in Self-Supervised Multi-View Stereo
ICCV 2021arXiv
0
citations
UniFormerV2: Unlocking the Potential of Image ViTs for Video Understanding
ICCV 2023
0
citations
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
ICCV 2023arXiv
0
citations
HTML: Hybrid Temporal-scale Multimodal Learning Framework for Referring Video Object Segmentation
ICCV 2023
0
citations
Mining Inter-Video Proposal Relations for Video Object Detection
ECCV 2020
0
citations
Self-Slimmed Vision Transformer
ECCV 2022
0
citations
MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning
ECCV 2022
0
citations
WeGen: A Unified Model for Interactive Multimodal Generation as We Chat
CVPR 2025
0
citations
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
ICCV 2025
0
citations
Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration
AAAI 2025
0
citations
M-BEV: Masked BEV Perception for Robust Autonomous Driving
AAAI 2024arXiv
0
citations
Vlogger: Make Your Dream A Vlog
CVPR 2024
0
citations
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
ICML 2024
0
citations
Temporal Hallucinating for Action Recognition With Few Still Images
CVPR 2018
0
citations
MetaCleaner: Learning to Hallucinate Clean Representations for Noisy-Labeled Visual Recognition
CVPR 2019
0
citations
Adaptive Pyramid Context Network for Semantic Segmentation
CVPR 2019
0
citations
PA3D: Pose-Action 3D Machine for Video Recognition
CVPR 2019
0
citations
SmallBigNet: Integrating Core and Contextual Views for Video Classification
CVPR 2020arXiv
0
citations