Zhou Zhao
53
Papers
333
Total Citations
Papers (53)
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
ICLR 2025arXiv
125
citations
Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis
ICLR 2024
74
citations
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations
AAAI 2024arXiv
49
citations
Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models
ICML 2025
28
citations
TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching
AAAI 2025
16
citations
RoboGround: Robotic Manipulation with Grounded Vision-Language Priors
CVPR 2025arXiv
15
citations
OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup
ICLR 2025
10
citations
MergeNet: Knowledge Migration Across Heterogeneous Models, Tasks, and Modalities
AAAI 2025
8
citations
FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation
CVPR 2025arXiv
5
citations
Bridging Domain Generalization to Multimodal Domain Generalization via Unified Representations
ICCV 2025
3
citations
UniAudio: Towards Universal Audio Generation with Large Language Models
ICML 2024
0
citations
FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion
ICML 2024
0
citations
Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction
CVPR 2019
0
citations
Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences
CVPR 2020arXiv
0
citations
Cascaded Prediction Network via Segment Tree for Temporal Video Grounding
CVPR 2021
0
citations
Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval
CVPR 2021
0
citations
Fine-Grained Predicates Learning for Scene Graph Generation
CVPR 2022arXiv
0
citations
MLSLT: Towards Multilingual Sign Language Translation
CVPR 2022
0
citations
Wnet: Audio-Guided Video Object Segmentation via Wavelet-Based Cross-Modal Denoising Networks
CVPR 2022
0
citations
Cross-Modal Background Suppression for Audio-Visual Event Localization
CVPR 2022
0
citations
DATE: Domain Adaptive Product Seeker for E-Commerce
CVPR 2023
0
citations
WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding
CVPR 2023
0
citations
Gloss Attention for Gloss-Free Sign Language Translation
CVPR 2023
0
citations
Cortical Surface Shape Analysis Based on Alexandrov Polyhedra
ICCV 2021
0
citations
Open-Vocabulary Object Detection With an Open Corpus
ICCV 2023
0
citations
Exploring Group Video Captioning with Efficient Relational Approximation
ICCV 2023
0
citations
Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding
ICCV 2023arXiv
0
citations
MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition
ICCV 2023arXiv
0
citations
ANetQA: A Large-Scale Benchmark for Fine-Grained Compositional Reasoning Over Untrimmed Videos
CVPR 2023arXiv
0
citations
Non-Natural Image Understanding with Advancing Frequency-based Vision Encoders
CVPR 2025
0
citations
Towards Transformer-Based Aligned Generation with Self-Coherence Guidance
CVPR 2025
0
citations
SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language
CVPR 2025
0
citations
Open-set Cross Modal Generalization via Multimodal Unified Representation
ICCV 2025
0
citations
Speech Watermarking with Discrete Intermediate Representations
AAAI 2025
0
citations
MPOD123: One Image to 3D Content Generation Using Mask-enhanced Progressive Outline-to-Detail Optimization
CVPR 2024
0
citations
Dataflow-Guided Neuro-Symbolic Language Models for Type Inference
ICML 2025
0
citations
InstructSpeech: Following Speech Editing Instructions via Large Language Models
ICML 2024
0
citations
Non-confusing Generation of Customized Concepts in Diffusion Models
ICML 2024
0
citations
MacNet: Transferring Knowledge from Machine Comprehension to Sequence-to-Sequence Models
NeurIPS 2018
0
citations
FastSpeech: Fast, Robust and Controllable Text to Speech
NeurIPS 2019
0
citations
Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding
NeurIPS 2020
0
citations
Generalizable Multi-linear Attention Network
NeurIPS 2021
0
citations
PortaSpeech: Portable and High-Quality Generative Text-to-Speech
NeurIPS 2021
0
citations
M4Singer: A Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus
NeurIPS 2022
0
citations
GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech
NeurIPS 2022
0
citations
Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech
NeurIPS 2022
0
citations
Unsupervised Representation Learning from Pre-trained Diffusion Probabilistic Models
NeurIPS 2022
0
citations
Towards Effective Multi-Modal Interchanges in Zero-Resource Sounding Object Localization
NeurIPS 2022
0
citations
Connecting Multi-modal Contrastive Representations
NeurIPS 2023
0
citations
PTADisc: A Cross-Course Dataset Supporting Personalized Learning in Cold-Start Scenarios
NeurIPS 2023
0
citations
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks
NeurIPS 2023
0
citations
Achieving Cross Modal Generalization with Multimodal Unified Representation
NeurIPS 2023
0
citations
Almost Unsupervised Text to Speech and Automatic Speech Recognition
ICML 2019
0
citations