Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

16citations

arXiv:2411.02335 Project

citations

#359

in ICML 2025

of 3340 papers

Top Authors

Data Points

Top Authors

Yuqi Luo Chenyang Song Xu Han Yingfa Chen Chaojun Xiao Xiaojun Meng Liqun Deng Jiansheng Wei Zhiyuan Liu Maosong Sun

Abstract

Activation sparsity denotes the existence of substantial weakly-contributed neurons within feed-forward networks of large language models (LLMs), providing wide potential benefits such as computation acceleration. However, existing works lack thorough quantitative studies on this useful property, in terms of both its measurement and influential factors. In this paper, we address three underexplored research questions: (1) How can activation sparsity be measured more accurately? (2) How is activation sparsity affected by the model architecture and training process? (3) How can we build a more sparsely activated and efficient LLM? Specifically, we develop a generalizable and performance-friendly metric, named CETT-PPL-1\%, to measure activation sparsity. Based on CETT-PPL-1\%, we quantitatively study the influence of various factors and observe several important phenomena, such as the convergent power-law relationship between sparsity and training data amount, the higher competence of ReLU activation than mainstream SiLU activation, the potential sparsity merit of a small width-depth ratio, and the scale insensitivity of activation sparsity. Finally, we provide implications for building sparse and effective LLMs, and demonstrate the reliability of our findings by training a 2.4B model with a sparsity ratio of 93.52\%, showing 4.1$\times$ speedup compared with its dense version. The codes and checkpoints are available at https://github.com/thunlp/SparsingLaw/.

Citation History

Jan 28, 2026

Feb 13, 2026

16+16

Feb 13, 2026