PolyVoice: Language Models for Speech to Speech Translation

29citations
Project
29
Citations
#240
in ICLR 2024
of 2297 papers
18
Authors
1
Data Points

Abstract

With the huge success of GPT models in natural language processing, there is a growing interest in applying language modeling approaches to speech tasks.Currently, the dominant architecture in speech-to-speech translation (S2ST) remains the encoder-decoder paradigm, creating a need to investigate the impact of language modeling approaches in this area. In this study, we introduce PolyVoice, a language model-based framework designed for S2ST systems. Our framework comprises three decoder-only language models: a translation language model, a duration language model, and a speech synthesis language model. These language models employ different types of prompts to extract learned information effectively. By utilizing unsupervised semantic units, our framework can transfer semantic information across these models, making it applicable even to unwritten languages. We evaluate our system on Chinese $\rightarrow$ English and English $\rightarrow$ Spanish language pairs. Experimental results demonstrate that \method outperforms the state-of-the-art encoder-decoder model, producing voice-cloned speech with high translation and audio quality.Speech samples are available at https://polyvoice.github.io.

Citation History

Jan 27, 2026
29