Mike Zheng Shou

64
Papers
852
Total Citations

Papers (64)

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model

CVPR 2024
318
citations

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

CVPR 2025
123
citations

VideoLLM-online: Online Video Large Language Model for Streaming Video

CVPR 2024
109
citations

Show-o2: Improved Native Unified Multimodal Models

NeurIPS 2025
90
citations

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence

CVPR 2024
63
citations

DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models

CVPR 2025
59
citations

LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer

ICCV 2025
26
citations

AssistGUI: Task-Oriented PC Graphical User Interface Automation

CVPR 2024
18
citations

IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation

CVPR 2025
14
citations

Impossible Videos

ICML 2025
7
citations

DiffSim: Taming Diffusion Models for Evaluating Visual Similarity

ICCV 2025
7
citations

ROICtrl: Boosting Instance Control for Visual Generation

CVPR 2025
7
citations

DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles

CVPR 2025
4
citations

LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale

CVPR 2025
4
citations

SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost

CVPR 2025
3
citations

Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization

CVPR 2021arXiv
0
citations

Object-Aware Video-Language Pre-Training for Retrieval

CVPR 2022arXiv
0
citations

Unified Transformer Tracker for Object Tracking

CVPR 2022arXiv
0
citations

Ego4D: Around the World in 3,000 Hours of Egocentric Video

CVPR 2022
0
citations

Position-Guided Text Prompt for Vision-Language Pre-Training

CVPR 2023arXiv
0
citations

All in One: Exploring Unified Video-Language Pre-Training

CVPR 2023arXiv
0
citations

Making Vision Transformers Efficient From a Token Sparsification View

CVPR 2023arXiv
0
citations

Affordance Grounding From Demonstration Video To Target Image

CVPR 2023arXiv
0
citations

MIST: Multi-Modal Iterative Spatial-Temporal Transformer for Long-Form Video Question Answering

CVPR 2023arXiv
0
citations

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-Channel Video-Language Retrieval

CVPR 2023arXiv
0
citations

Channel Augmented Joint Learning for Visible-Infrared Recognition

ICCV 2021
0
citations

Searching for Two-Stream Models in Multivariate Space for Video Recognition

ICCV 2021arXiv
0
citations

Generic Event Boundary Detection: A Benchmark for Event Segmentation

ICCV 2021arXiv
0
citations

DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models

ICCV 2023arXiv
0
citations

Too Large; Data Reduction for Vision-Language Pre-Training

ICCV 2023arXiv
0
citations

STPrivacy: Spatio-Temporal Privacy-Preserving Action Recognition

ICCV 2023arXiv
0
citations

Unsupervised Open-Vocabulary Object Localization in Videos

ICCV 2023arXiv
0
citations

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

ICCV 2023
0
citations

Learning to Learn: How to Continuously Teach Humans and Machines

ICCV 2023arXiv
0
citations

UniVTG: Towards Unified Video-Language Temporal Grounding

ICCV 2023arXiv
0
citations

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone

ICCV 2023arXiv
0
citations

BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion

ICCV 2023arXiv
0
citations

HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video

ICCV 2023arXiv
0
citations

Label-Efficient Online Continual Object Detection in Streaming Video

ICCV 2023arXiv
0
citations

MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

ECCV 2022
0
citations

"GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval"

ECCV 2022
0
citations

AssistQ: Affordance-Centric Question-Driven Task Completion for Egocentric Assistant

ECCV 2022
0
citations

Revisiting Vision Transformer from the View of Path Ensemble

ICCV 2023arXiv
0
citations

VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary

CVPR 2025
0
citations

MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation

CVPR 2025
0
citations

ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning

CVPR 2025
0
citations

Factorized Learning for Temporally Grounded Video-Language Models

ICCV 2025
0
citations

Balanced Image Stylization with Style Matching Score

ICCV 2025
0
citations

VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting

AAAI 2025
0
citations

L4D-Track: Language-to-4D Modeling Towards 6-DoF Tracking and Shape Reconstruction in 3D Point Cloud Stream

CVPR 2024
0
citations

DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing

CVPR 2024
0
citations

Tune-An-Ellipse: CLIP Has Potential to Find What You Want

CVPR 2024
0
citations

X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model

CVPR 2024
0
citations

Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis

CVPR 2024
0
citations

Bootstrapping SparseFormers from Vision Foundation Models

CVPR 2024
0
citations

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

CVPR 2024
0
citations

ViT-Lens: Towards Omni-modal Representations

CVPR 2024
0
citations

Egocentric Video-Language Pretraining

NeurIPS 2022
0
citations

DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes

NeurIPS 2022
0
citations

Object-centric Learning with Cyclic Walks between Parts and Whole

NeurIPS 2023
0
citations

Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models

NeurIPS 2023
0
citations

XAGen: 3D Expressive Human Avatars Generation

NeurIPS 2023
0
citations

DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models

NeurIPS 2023
0
citations

Learning Visual Prior via Generative Pre-Training

NeurIPS 2023
0
citations