MDSGen: Fast and Efficient Masked Diffusion Temporal-Aware Transformers for Open-Domain Sound Generation

8citations

arXiv:2410.02130 Project

Citations

#1019

in ICLR 2025

of 3827 papers

Authors

Data Points

Authors

Trung X. Pham Tri Ton Chang Yoo

Topics

masked diffusion transformers sound generation vision-guided generation temporal-aware masking video feature removal denoising diffusion models open-domain audio generation

Abstract

We introduce MDSGen, a novel framework for vision-guided open-domain sound generation optimized for model parameter size, memory consumption, and inference speed. This framework incorporates two key innovations: (1) a redundant video feature removal module that filters out unnecessary visual information, and (2) a temporal-aware masking strategy that leverages temporal context for enhanced audio generation accuracy. In contrast to existing resource-heavy Unet-based models, \texttt{MDSGen} employs denoising masked diffusion transformers, facilitating efficient generation without reliance on pre-trained diffusion models. Evaluated on the benchmark VGGSound dataset, our smallest model (5M parameters) achieves $97.9$% alignment accuracy, using $172\times$ fewer parameters, $371$% less memory, and offering $36\times$ faster inference than the current 860M-parameter state-of-the-art model ($93.9$% accuracy). The larger model (131M parameters) reaches nearly $99$% accuracy while requiring $6.5\times$ fewer parameters. These results highlight the scalability and effectiveness of our approach. The code is available at https://bit.ly/mdsgen.

Citation History

Jan 25, 2026

Jan 31, 2026