Ziwei Liu

Google Scholar OpenReview

154

Papers

3,483

Total Citations

10

h-index

Papers (154)

VBench: Comprehensive Benchmark Suite for Video Generative Models

LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Knowledge Distillation Meets Self-Supervision

SinSR: Diffusion-Based Image Super-Resolution in a Single Step

SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

VideoBooth: Diffusion-based Video Generation with Image Prompts

Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation

InstructVideo: Instructing Video Diffusion Models with Human Feedback

FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation

Digital Life Project: Autonomous 3D Characters with Social Intelligence

Duolando: Follower GPT with Off-Policy Reinforcement Learning for Dance Accompaniment

AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation

Generative Gaussian Splatting for Unbounded 3D City Generation

Towards Language-Driven Video Inpainting via Multimodal Large Language Models

Multi-Space Alignments Towards Universal LiDAR Segmentation

VistaDream: Sampling multiview consistent images for single-view scene reconstruction

Material Anything: Generating Materials for Any 3D Object via Diffusion

MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D

AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation

Move Anything with Layered Scene Diffusion

EgoLM: Multi-Modal Language Model of Egocentric Motions

SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters

GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography

ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models

Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion

FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model

Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

GeneMAN: Generalizable Single-Image 3D Human Reconstruction from Multi-Source Human Data

AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers

Self-Supervised Scene De-Occlusion

When NAS Meets Robustness: In Search of Robust Architectures Against Adversarial Attacks

Online Deep Clustering for Unsupervised Representation Learning

Rotate-and-Render: Unsupervised Photorealistic Face Rotation From Single-View Images

MaskGAN: Towards Diverse and Interactive Facial Image Manipulation

Open Compound Domain Adaptation

Visually Informed Binaural Audio Generation without Binaural Audios

Adversarial Robustness Under Long-Tailed Distribution

Unsupervised Feature Learning by Cross-Level Instance-Group Discrimination

LiDAR-Based Panoptic Segmentation via Dynamic Shifting Network

Seesaw Loss for Long-Tailed Instance Segmentation

Variational Relational Point Completion Network

ForgeryNet: A Versatile Benchmark for Comprehensive Forgery Analysis

Deep Animation Video Interpolation in the Wild

Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation

Robust Reference-Based Super-Resolution via C2-Matching

Delving Deep Into the Generalization of Vision Transformers Under Distribution Shifts

Versatile Multi-Modal Pre-Training for Human-Centric Perception

Pastiche Master: Exemplar-Based High-Resolution Portrait Style Transfer

TCTrack: Temporal Contexts for Aerial Tracking

Balanced MSE for Imbalanced Visual Regression

Bailando: 3D Dance Generation by Actor-Critic GPT With Choreographic Memory

Conditional Prompt Learning for Vision-Language Models

Full-Range Virtual Try-On With Recurrent Tri-Level Transform

Unsupervised Image-to-Image Translation With Generative Prior

F2-NeRF: Fast Neural Radiance Field Training With Free Camera Trajectories

StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-Based Generator

LaserMix for Semi-Supervised LiDAR Semantic Segmentation

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation

Panoptic Video Scene Graph Generation

Detecting and Grounding Multi-Modal Media Manipulation

Collaborative Diffusion for Multi-Modal Face Generation and Editing

Semantic Image Segmentation via Deep Parsing Network

Deep Learning Face Attributes in the Wild

Video Frame Synthesis Using Deep Voxel Flow

Vision-Infused Deep Audio Inpainting

CARAFE: Content-Aware ReAssembly of FEatures

Delving Deep Into Hybrid Annotations for 3D Human Recovery in the Wild

Unsupervised Domain Adaptive 3D Detection With Multi-Level Consistency

Differentiable Dynamic Wirings for Neural Networks

Talk-To-Edit: Fine-Grained Facial Editing via Dialog

Incorporating Convolution Designs Into Visual Transformers

Semantically Coherent Out-of-Distribution Detection

BlockPlanner: City Block Generation With Vectorized Graph Representation

Energy-Based Open-World Uncertainty Modeling for Confidence Calibration

Deep Geometrized Cartoon Line Inbetweening

Cloth2Body: Generating 3D Human Body Mesh from 2D Clothing

SynBody: Synthetic Dataset with Layered Human Models for 3D Human Perception and Modeling

Robo3D: Towards Robust and Reliable 3D Perception against Corruptions

DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity Human-Centric Rendering

SparseNeRF: Distilling Depth Ranking for Few-shot Novel View Synthesis

DeformToon3D: Deformable Neural Radiance Fields for 3D Toonification

UnitedHuman: Harnessing Multi-Source Data for High-Resolution Human Generation

StyleGANEX: StyleGAN-Based Manipulation Beyond Cropped Aligned Faces

ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model

Rethinking Range View Representation for LiDAR Segmentation

HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation

SHERF: Generalizable Human NeRF from a Single Image

Distribution-Balanced Loss for Multi-Label Classification in Long-Tailed Datasets

CelebA-Spoof: Large-Scale Face Anti-Spoofing Dataset with Rich Annotations

Unsupervised 3D Human Pose Representation with Viewpoint and Pose Disentanglement

Placepedia: Comprehensive Place Understanding with Multi-Faceted Annotations

UNIF: United Neural Implicit Functions for Clothed Human Reconstruction and Animation

HuMMan: Multi-modal 4D Human Dataset for Versatile Sensing and Modeling

Benchmarking Omni-Vision Representation through the Lens of Visual Realms

CelebV-HQ: A Large-Scale Video Facial Attributes Dataset

Detecting and Recovering Sequential DeepFake Manipulation

Relighting4D: Neural Relightable Human from Videos

StyleSwap: Style-Based Generator Empowers Robust Face Swapping

Fast-Vid2Vid: Spatial-Temporal Compression for Video-to-Video Synthesis

StyleLight: HDR Panorama Generation for Lighting Estimation and Editing

StyleGAN-Human: A Data-Centric Odyssey of Human Generation

X-Learner: Learning Cross Sources and Tasks for Universal Visual Representation

Panoptic Scene Graph Generation

Mind the Gap in Distilling StyleGANs

Text2Performer: Text-Driven Human Video Generation

LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes

3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Disco4D: Disentangled 4D Human Generation and Animation from a Single Image

EgoLife: Towards Egocentric Life Assistant

WildAvatar: Learning In-the-wild 3D Avatars from the Web

GauUpdate: New Object Insertion in 3D Gaussian Fields with Consistent Global Illumination

Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data and Metric Perspectives

Dual-Expert Consistency Model for Efficient and High-Quality Video Generation

FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion

Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding

Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency

DPoser-X: Diffusion Model as Robust 3D Whole-body Human Pose Prior

SIGMA: Selective Gated Mamba for Sequential Recommendation

GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation

URHand: Universal Relightable Hands

GauHuman: Articulated Gaussian Splatting from Monocular Human Videos

SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering

CityDreamer: Compositional Generative Model of Unbounded 3D Cities

Vlogger: Make Your Dream A Vlog

FreeU: Free Lunch in Diffusion U-Net

Link-Context Learning for Multimodal LLMs

HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting

DeepFashion: Powering Robust Clothes Recognition and Retrieval With Rich Annotations

Not All Pixels Are Equal: Difficulty-Aware Semantic Segmentation via Deep Layer Cascade

Self-Supervised Learning via Conditional Motion Propagation

Large-Scale Long-Tailed Recognition in an Open World

Hybrid Task Cascade for Instance Segmentation

Few-Shot Object Detection via Association and DIscrimination

Garment4D: Garment Reconstruction from Point Cloud Sequences

Unsupervised Object-Level Representation Learning from Scene Images

Balanced Chamfer Distance as a Comprehensive Metric for Point Cloud Completion

AnimeRun: 2D Animation Visual Correspondence from Open Source 3D Movies

Audio-Driven Co-Speech Gesture Video Generation

Benchmarking and Analyzing 3D Human Pose and Shape Estimation Beyond Algorithms

OpenOOD: Benchmarking Generalized Out-of-Distribution Detection

RenderMe-360: A Large Digital Asset Library and Benchmarks Towards High-fidelity Head Avatars

SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation

PrimDiffusion: Volumetric Primitives Diffusion for 3D Human Generation

FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing

Towards Robust and Expressive Whole-body Human Pose and Shape Estimation

What Makes Good Examples for Visual In-Context Learning?

Segment Any Point Cloud Sequences by Distilling Vision Foundation Models

InsActor: Instruction-driven Physics-based Characters

4D Panoptic Scene Graph Generation

Large Language Models are Visual Reasoning Coordinators