Separate and Diffuse: Using a Pretrained Diffusion Model for Better Source Separation

0citations
Project
0
Citations
#595
in ICLR 2024
of 2297 papers
3
Authors
1
Data Points

Abstract

The problem of speech separation, also known as the cocktail party problem,refers to the task of isolating a single speech signal from a mixture of speechsignals. Previous work on source separation derived an upper bound for thesource separation task in the domain of human speech. This bound is derived fordeterministic models. Recent advancements in generative models challenge thisbound. We show how the upper bound can be generalized to the case of randomgenerative models. Applying a diffusion model Vocoder that was pretrained tomodel single-speaker voices on the output of a deterministic separation model leadsto state-of-the-art separation results. It is shown that this requires one to combinethe output of the separation model with that of the diffusion model. In our method,a linear combination is performed, in the frequency domain, using weights that areinferred by a learned model. We show state-of-the-art results on 2, 3, 5, 10, and 20speakers on multiple benchmarks. In particular, for two speakers, our method isable to surpass what was previously considered the upper performance bound.

Citation History

Jan 28, 2026
0