Mike Zheng Shou

64

Papers

852

Total Citations

Papers (64)

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

VideoLLM-online: Online Video Large Language Model for Streaming Video

Show-o2: Improved Native Unified Multimodal Models

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence

DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models

LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer

AssistGUI: Task-Oriented PC Graphical User Interface Automation

IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation

Impossible Videos

DiffSim: Taming Diffusion Models for Evaluating Visual Similarity

ROICtrl: Boosting Instance Control for Visual Generation

DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles

LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale

SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost

Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization

Object-Aware Video-Language Pre-Training for Retrieval

Unified Transformer Tracker for Object Tracking

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Position-Guided Text Prompt for Vision-Language Pre-Training

All in One: Exploring Unified Video-Language Pre-Training

Making Vision Transformers Efficient From a Token Sparsification View

Affordance Grounding From Demonstration Video To Target Image

MIST: Multi-Modal Iterative Spatial-Temporal Transformer for Long-Form Video Question Answering

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-Channel Video-Language Retrieval

Channel Augmented Joint Learning for Visible-Infrared Recognition

Searching for Two-Stream Models in Multivariate Space for Video Recognition

Generic Event Boundary Detection: A Benchmark for Event Segmentation

DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models

Too Large; Data Reduction for Vision-Language Pre-Training

STPrivacy: Spatio-Temporal Privacy-Preserving Action Recognition

Unsupervised Open-Vocabulary Object Localization in Videos

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

Learning to Learn: How to Continuously Teach Humans and Machines

UniVTG: Towards Unified Video-Language Temporal Grounding

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone

BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion

HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video

Label-Efficient Online Continual Object Detection in Streaming Video

MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

"GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval"

AssistQ: Affordance-Centric Question-Driven Task Completion for Egocentric Assistant

Revisiting Vision Transformer from the View of Path Ensemble

VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary

MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation

ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning

Factorized Learning for Temporally Grounded Video-Language Models

Balanced Image Stylization with Style Matching Score

VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting

L4D-Track: Language-to-4D Modeling Towards 6-DoF Tracking and Shape Reconstruction in 3D Point Cloud Stream

DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing

Tune-An-Ellipse: CLIP Has Potential to Find What You Want

X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model

Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis

Bootstrapping SparseFormers from Vision Foundation Models

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

ViT-Lens: Towards Omni-modal Representations

Egocentric Video-Language Pretraining

DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes

Object-centric Learning with Cyclic Walks between Parts and Whole

Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models

XAGen: 3D Expressive Human Avatars Generation

DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models

Learning Visual Prior via Generative Pre-Training