Deli Zhao

41

Papers

204

Total Citations

Papers (41)

Space Group Constrained Crystal Generation

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Latent Space Editing in Transformer-Based Flow Matching

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

Lipschitz Singularities in Diffusion Models

Towards More Accurate Diffusion Model Acceleration with A Timestep Tuner

MolSpectra: Pre-training 3D Molecular Representation with Multi-modal Energy Spectra

Universally Invariant Learning in Equivariant GNNs

RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-Training

Neural Dependencies Emerging From Learning Massive Categories

Dimensionality-Varying Diffusion Process

LipFormer: High-Fidelity and Generalizable Talking Face Generation With a Pre-Learned Facial Codebook

LinkGAN: Linking GAN Latents to Pixels for Controllable Image Synthesis

Space-time Prompting for Video Class-incremental Learning

ViM: Vision Middleware for Unified Downstream Transferring

Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-Trained Vision-Language Models

Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos

Breaking the Memory Barrier of Contrastive Loss via Tile-Based Strategy

Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval

Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient Vision Transformers

Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning

RLIPv2: Fast Scaling of Relational Language-Image Pre-Training

In-Domain GAN Inversion for Real Image Editing

Self-Organizing Pathway Expansion for Non-Exemplar Class-Incremental Learning

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting

AnyDoor: Zero-shot Object-level Image Customization

A New Retraction for Accelerating the Riemannian Three-Factor Low-Rank Matrix Completion Algorithm

Sparse Coding and Dictionary Learning With Linear Dynamical Systems

Weakly Supervised High-Fidelity Clothing Model Generation

MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Action Recognition

DeepExposure: Learning to Expose Photos with Asynchronously Reinforced Adversarial Learning

Low-Rank Subspaces in GANs

Improving 3D-aware Image Synthesis with A Geometry-aware Discriminator

Improving GANs with A Dynamic Discriminator

Rank Diminishing in Deep Neural Networks

VideoComposer: Compositional Video Synthesis with Motion Controllability

FaceComposer: A Unified Model for Versatile Facial Content Creation

Res-Tuning: A Flexible and Efficient Tuning Paradigm via Unbinding Tuner from Backbone

Customizable Image Synthesis with Multiple Subjects

MomentDiff: Generative Video Moment Retrieval from Random to Real