Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

0citations

Project

Citations

#2008

in ICLR 2025

of 3827 papers

Authors

Data Points

Authors

Chien-yu Huang Wei-Chih Chen Shu-wen Yang Andy T. Liu Chen-An Li Yu-Xiang Lin Wei-Cheng Tseng Anuj Diwan Yi-Jen Shih Jiatong Shi William Chen Chih-Kai Yang Xuanjun Chen Chi-Yuan Hsiao Puyuan Peng Shih-Heng Wang Chun-Yi Kuan Ke-Han Lu Kai-Wei Chang Fabian Ritter Gutierrez Kuan-Po Huang Siddhant Arora You-Kuan Lin CHUANG To Eunjung Yeo Kalvin Chang Chung-Ming Chien Kwanghee Choi Cheng-Hsiu Hsieh Yi-Cheng Lin Chee-En Yu I-Hsiang Chiu Heitor Rodrigues Guimarães Jionghao Han Tzu-Quan Lin Tzu-Yuan Lin Homu Chang Ting-Wu Chang Chun Chen Shou-Jen Chen Yu-Hua Chen Hsi-Chun Cheng Kunal Dhawan Jia-Lin Fang Shi-Xin Fang KUAN CHIANG Chi-An Fu Hsien-Fu Hsiao Ching Hsu Shao-Syuan Huang Lee Wei Hsi-Che Lin Hsuan-Hao Lin Hsuan-Ting Lin Jian-Ren Lin Ting-Chun Liu Li-Chun Lu Tsung-Min Pai Ankita Pasad Shih-Yun Kuan Suwon Shon Yuxun Tang Yun-Shao Tsai Wei Chiang Tzu-Chieh Wei Chengxi Wu Dien-Ruei Wu Chao-Han Huck Yang Chieh-Chi Yang Jia Qi Yip Shao-Xiang Yuan Haibin Wu Karen Livescu David Harwath Shinji Watanabe Hung-yi Lee

Abstract

Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluation benchmark poses a significant challenge. We present Dynamic-SUPERB Phase-2, an open and evolving benchmark for the comprehensive evaluation of instruction-based universal speech models. Building upon the first generation, this second version incorporates 125 new tasks contributed collaboratively by the global research community, expanding the benchmark to a total of 180 tasks, making it the largest benchmark for speech and audio evaluation. While the first generation of Dynamic-SUPERB was limited to classification tasks, Dynamic-SUPERB Phase-2 broadens its evaluation capabilities by introducing a wide array of novel and diverse tasks, including regression and sequence generation, across speech, music, and environmental audio. Evaluation results show that no model performed well universally. SALMONN-13B excelled in English ASR and Qwen2-Audio-7B-Instruct showed high accuracy in emotion recognition, but current models still require further innovations to handle a broader range of tasks. We open-source all task data and the evaluation pipeline at https://github.com/dynamic-superb/dynamic-superb.

Citation History

Jan 25, 2026

Jan 26, 2026

Jan 28, 2026