Zhou Zhao

53
Papers
333
Total Citations

Papers (53)

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

ICLR 2025arXiv
125
citations

Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis

ICLR 2024
74
citations

Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations

AAAI 2024arXiv
49
citations

Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models

ICML 2025
28
citations

TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching

AAAI 2025
16
citations

RoboGround: Robotic Manipulation with Grounded Vision-Language Priors

CVPR 2025arXiv
15
citations

OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup

ICLR 2025
10
citations

MergeNet: Knowledge Migration Across Heterogeneous Models, Tasks, and Modalities

AAAI 2025
8
citations

FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation

CVPR 2025arXiv
5
citations

Bridging Domain Generalization to Multimodal Domain Generalization via Unified Representations

ICCV 2025
3
citations

UniAudio: Towards Universal Audio Generation with Large Language Models

ICML 2024
0
citations

FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

ICML 2024
0
citations

Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction

CVPR 2019
0
citations

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences

CVPR 2020arXiv
0
citations

Cascaded Prediction Network via Segment Tree for Temporal Video Grounding

CVPR 2021
0
citations

Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval

CVPR 2021
0
citations

Fine-Grained Predicates Learning for Scene Graph Generation

CVPR 2022arXiv
0
citations

MLSLT: Towards Multilingual Sign Language Translation

CVPR 2022
0
citations

Wnet: Audio-Guided Video Object Segmentation via Wavelet-Based Cross-Modal Denoising Networks

CVPR 2022
0
citations

Cross-Modal Background Suppression for Audio-Visual Event Localization

CVPR 2022
0
citations

DATE: Domain Adaptive Product Seeker for E-Commerce

CVPR 2023
0
citations

WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding

CVPR 2023
0
citations

Gloss Attention for Gloss-Free Sign Language Translation

CVPR 2023
0
citations

Cortical Surface Shape Analysis Based on Alexandrov Polyhedra

ICCV 2021
0
citations

Open-Vocabulary Object Detection With an Open Corpus

ICCV 2023
0
citations

Exploring Group Video Captioning with Efficient Relational Approximation

ICCV 2023
0
citations

Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding

ICCV 2023arXiv
0
citations

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition

ICCV 2023arXiv
0
citations

ANetQA: A Large-Scale Benchmark for Fine-Grained Compositional Reasoning Over Untrimmed Videos

CVPR 2023arXiv
0
citations

Non-Natural Image Understanding with Advancing Frequency-based Vision Encoders

CVPR 2025
0
citations

Towards Transformer-Based Aligned Generation with Self-Coherence Guidance

CVPR 2025
0
citations

SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language

CVPR 2025
0
citations

Open-set Cross Modal Generalization via Multimodal Unified Representation

ICCV 2025
0
citations

Speech Watermarking with Discrete Intermediate Representations

AAAI 2025
0
citations

MPOD123: One Image to 3D Content Generation Using Mask-enhanced Progressive Outline-to-Detail Optimization

CVPR 2024
0
citations

Dataflow-Guided Neuro-Symbolic Language Models for Type Inference

ICML 2025
0
citations

InstructSpeech: Following Speech Editing Instructions via Large Language Models

ICML 2024
0
citations

Non-confusing Generation of Customized Concepts in Diffusion Models

ICML 2024
0
citations

MacNet: Transferring Knowledge from Machine Comprehension to Sequence-to-Sequence Models

NeurIPS 2018
0
citations

FastSpeech: Fast, Robust and Controllable Text to Speech

NeurIPS 2019
0
citations

Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding

NeurIPS 2020
0
citations

Generalizable Multi-linear Attention Network

NeurIPS 2021
0
citations

PortaSpeech: Portable and High-Quality Generative Text-to-Speech

NeurIPS 2021
0
citations

M4Singer: A Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus

NeurIPS 2022
0
citations

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

NeurIPS 2022
0
citations

Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

NeurIPS 2022
0
citations

Unsupervised Representation Learning from Pre-trained Diffusion Probabilistic Models

NeurIPS 2022
0
citations

Towards Effective Multi-Modal Interchanges in Zero-Resource Sounding Object Localization

NeurIPS 2022
0
citations

Connecting Multi-modal Contrastive Representations

NeurIPS 2023
0
citations

PTADisc: A Cross-Course Dataset Supporting Personalized Learning in Cold-Start Scenarios

NeurIPS 2023
0
citations

Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks

NeurIPS 2023
0
citations

Achieving Cross Modal Generalization with Multimodal Unified Representation

NeurIPS 2023
0
citations

Almost Unsupervised Text to Speech and Automatic Speech Recognition

ICML 2019
0
citations