Zhou Zhao

53

Papers

333

Total Citations

Papers (53)

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis

Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations

Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models

TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching

RoboGround: Robotic Manipulation with Grounded Vision-Language Priors

OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup

MergeNet: Knowledge Migration Across Heterogeneous Models, Tasks, and Modalities

FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation

Bridging Domain Generalization to Multimodal Domain Generalization via Unified Representations

UniAudio: Towards Universal Audio Generation with Large Language Models

FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences

Cascaded Prediction Network via Segment Tree for Temporal Video Grounding

Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval

Fine-Grained Predicates Learning for Scene Graph Generation

MLSLT: Towards Multilingual Sign Language Translation

Wnet: Audio-Guided Video Object Segmentation via Wavelet-Based Cross-Modal Denoising Networks

Cross-Modal Background Suppression for Audio-Visual Event Localization

DATE: Domain Adaptive Product Seeker for E-Commerce

WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding

Gloss Attention for Gloss-Free Sign Language Translation

Cortical Surface Shape Analysis Based on Alexandrov Polyhedra

Open-Vocabulary Object Detection With an Open Corpus

Exploring Group Video Captioning with Efficient Relational Approximation

Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition

ANetQA: A Large-Scale Benchmark for Fine-Grained Compositional Reasoning Over Untrimmed Videos

Non-Natural Image Understanding with Advancing Frequency-based Vision Encoders

Towards Transformer-Based Aligned Generation with Self-Coherence Guidance

SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language

Open-set Cross Modal Generalization via Multimodal Unified Representation

Speech Watermarking with Discrete Intermediate Representations

MPOD123: One Image to 3D Content Generation Using Mask-enhanced Progressive Outline-to-Detail Optimization

Dataflow-Guided Neuro-Symbolic Language Models for Type Inference

InstructSpeech: Following Speech Editing Instructions via Large Language Models

Non-confusing Generation of Customized Concepts in Diffusion Models

MacNet: Transferring Knowledge from Machine Comprehension to Sequence-to-Sequence Models

FastSpeech: Fast, Robust and Controllable Text to Speech

Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding

Generalizable Multi-linear Attention Network

PortaSpeech: Portable and High-Quality Generative Text-to-Speech

M4Singer: A Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

Unsupervised Representation Learning from Pre-trained Diffusion Probabilistic Models

Towards Effective Multi-Modal Interchanges in Zero-Resource Sounding Object Localization

Connecting Multi-modal Contrastive Representations

PTADisc: A Cross-Course Dataset Supporting Personalized Learning in Cold-Start Scenarios

Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks

Achieving Cross Modal Generalization with Multimodal Unified Representation

Almost Unsupervised Text to Speech and Automatic Speech Recognition