Mike Zheng Shou

29

Papers

852

Total Citations

Papers (29)

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

VideoLLM-online: Online Video Large Language Model for Streaming Video

Show-o2: Improved Native Unified Multimodal Models

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence

DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models

LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer

AssistGUI: Task-Oriented PC Graphical User Interface Automation

IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation

Impossible Videos

ROICtrl: Boosting Instance Control for Visual Generation

DiffSim: Taming Diffusion Models for Evaluating Visual Similarity

DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles

LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale

SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost

ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning

L4D-Track: Language-to-4D Modeling Towards 6-DoF Tracking and Shape Reconstruction in 3D Point Cloud Stream

DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing

Tune-An-Ellipse: CLIP Has Potential to Find What You Want

Balanced Image Stylization with Style Matching Score

X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model

Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis

Bootstrapping SparseFormers from Vision Foundation Models

VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

ViT-Lens: Towards Omni-modal Representations

MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation

Factorized Learning for Temporally Grounded Video-Language Models

VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting