Zicheng Liu

58

Papers

283

Total Citations

Papers (58)

MogaNet: Multi-order Gated Aggregation Network

MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning

SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

SemiReward: A General Reward Model for Semi-supervised Learning

PSC-CPI: Multi-Scale Protein Sequence-Structure Contrasting for Efficient and Generalizable Compound-Protein Interaction Prediction

CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph

Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization

MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization

DaCapo: Score Distillation as Stacked Bridge for Fast and High-quality 3D Editing

Exploring Invariance in Images through One-way Wave Equations

StrokeNUWA—Tokenizing Strokes for Vector Graphic Synthesis

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Large Scale Incremental Learning

Rethinking Classification and Localization for Object Detection

Dynamic Convolution: Attention Over Convolution Kernels

Probabilistic Model Distillation for Semantic Correspondence

End-to-End Human Pose and Mesh Reconstruction with Transformers

Mobile-Former: Bridging MobileNet and Transformer

Lifelong Unsupervised Domain Adaptive Person Re-Identification With Coordinated Anti-Forgetting and Adaptation

Cross-Modal Representation Learning for Zero-Shot Action Recognition

SwinBERT: End-to-End Transformers With Sparse Attention for Video Captioning

An Empirical Study of Training End-to-End Vision-and-Language Transformers

Injecting Semantic Concepts Into End-to-End Image Captioning

Scaling Up Vision-Language Pre-Training for Image Captioning

Deep Frequency Filtering for Domain Generalization

Adaptive Human Matting for Dynamic Videos

An Empirical Study of End-to-End Video-Language Transformers With Masked Visual Modeling

Binary Latent Diffusion

LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling

Neural Voting Field for Camera-Space 3D Hand Pose Estimation

Compressing Visual-Linguistic Model via Knowledge Distillation

End-to-End Semi-Supervised Object Detection With Soft Teacher

Mesh Graphormer

MicroNet: Improving Image Recognition With Extremely Low FLOPs

Equivariant Similarity for Vision-Language Foundation Models

Dynamic ReLU

"A Simple Approach and Benchmark for 21,000-Category Object Detection"

AutoMix: Unveiling the Power of Mixup for Stronger Classifiers

Should All Proposals Be Treated Equally in Object Detection?

UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling

ReCo: Region-Controlled Text-to-Image Generation

B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens

MyGO: Virtual Reality Locomotion Prediction using Multitask Learning

Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning

DisCo: Disentangled Control for Realistic Human Dance Generation

Segment and Caption Anything

Completing Visual Objects via Bridging Generation and Segmentation

VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling

PPFLOW: Target-Aware Peptide Design with Torsional Flow Matching

Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences

Stronger NAS with Weaker Predictors

ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

Towards Reasonable Budget Allocation in Untargeted Graph Structure Attacks via Gradient Debias

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

PaintSeg: Painting Pixels for Training-free Segmentation

Harnessing Hard Mixed Samples with Decoupled Regularizer

OpenSTL: A Comprehensive Benchmark of Spatio-Temporal Predictive Learning