Yizhuo Li

10

Papers

1,279

Total Citations

Papers (10)

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

UniFormerV2: Unlocking the Potential of Image ViTs for Video Understanding

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

PGT: A Progressive Method for Training Models on Long Videos

Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos

TubeTK: Adopting Tubes to Track Multi-Object in a One-Step Training Model

HOI Analysis: Integrating and Decomposing Human-Object Interaction

Test-Time Personalization with a Transformer for Human Pose Estimation