Zhou Zhao

21

Papers

333

Total Citations

Papers (21)

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis

Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations

Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models

TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching

RoboGround: Robotic Manipulation with Grounded Vision-Language Priors

OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup

MergeNet: Knowledge Migration Across Heterogeneous Models, Tasks, and Modalities

FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation

Bridging Domain Generalization to Multimodal Domain Generalization via Unified Representations

Towards Transformer-Based Aligned Generation with Self-Coherence Guidance

MPOD123: One Image to 3D Content Generation Using Mask-enhanced Progressive Outline-to-Detail Optimization

Non-Natural Image Understanding with Advancing Frequency-based Vision Encoders

Dataflow-Guided Neuro-Symbolic Language Models for Type Inference

InstructSpeech: Following Speech Editing Instructions via Large Language Models

Non-confusing Generation of Customized Concepts in Diffusion Models

UniAudio: Towards Universal Audio Generation with Large Language Models

FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

Open-set Cross Modal Generalization via Multimodal Unified Representation

SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language

Speech Watermarking with Discrete Intermediate Representations