DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors

28citations

arXiv:2406.11427 Project

Citations

#349

in ICLR 2025

of 3827 papers

Authors

Data Points

Authors

Keon Lee Dong Won Kim Jaehyeon Kim Seungjun Chung Jaewoong Cho

Topics

diffusion transformers text-to-speech synthesis latent diffusion models variable-length modeling speech latent representations zero-shot performance semantic alignment scalable speech generation

Abstract

Large-scale latent diffusion models (LDMs) excel in content generation across various modalities, but their reliance on phonemes and durations in text-to-speech (TTS) limits scalability and access from other fields. While recent studies show potential in removing these domain-specific factors, performance remains suboptimal. In this work, we introduce DiTTo-TTS, a Diffusion Transformer (DiT)-based TTS model, to investigate whether LDM-based TTS can achieve state-of-the-art performance without domain-specific factors. Through rigorous analysis and empirical exploration, we find that (1) DiT with minimal modifications outperforms U-Net, (2) variable-length modeling with a speech length predictor significantly improves results over fixed-length approaches, and (3) conditions like semantic alignment in speech latent representations are key to further enhancement. By scaling our training data to 82K hours and the model size to 790M parameters, we achieve superior or comparable zero-shot performance to state-of-the-art TTS models in naturalness, intelligibility, and speaker similarity, all without relying on domain-specific factors. Speech samples are available at https://ditto-tts.github.io.

Citation History

Jan 26, 2026

Jan 27, 2026

Feb 1, 2026

28+28