SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models

0citations
Project
0
Citations
#595
in ICLR 2024
of 2297 papers
5
Authors
1
Data Points

Abstract

Current speech large language models build upon discrete speech representations,which can be categorized into semantic tokens and acoustic tokens. However,existing speech tokens are not specifically designed for speech language modeling. To assess the suitability of speech tokens for building speech languagemodels, we established the first benchmark, SLMTokBench. Our results indicatethat neither semantic nor acoustic tokens are ideal for this purpose. Therefore, wepropose SpeechTokenizer, a unified speech tokenizer for speech large languagemodels. SpeechTokenizer adopts the Encoder-Decoder architecture with residualvector quantization (RVQ). Unifying semantic and acoustic tokens, SpeechTokenizer disentangles different aspects of speech information hierarchically acrossdifferent RVQ layers. Furthermore, We construct a Unified Speech LanguageModel (USLM) leveraging SpeechTokenizer. Experiments show that SpeechTokenizer performs comparably to EnCodec in speech reconstruction and demonstratesstrong performance on the SLMTokBench benchmark. Also, USLM outperformsVALL-E in zero-shot Text-to-Speech tasks. Code and models are available athttps://github.com/ZhangXInFD/SpeechTokenizer/.

Citation History

Jan 28, 2026
0