Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

16
citations
#359
in ICML 2025
of 3340 papers
10
Top Authors
4
Data Points

Abstract

Activation sparsity denotes the existence of substantial weakly-contributed neurons within feed-forward networks of large language models (LLMs), providing wide potential benefits such as computation acceleration. However, existing works lack thorough quantitative studies on this useful property, in terms of both its measurement and influential factors. In this paper, we address three underexplored research questions: (1) How can activation sparsity be measured more accurately? (2) How is activation sparsity affected by the model architecture and training process? (3) How can we build a more sparsely activated and efficient LLM? Specifically, we develop a generalizable and performance-friendly metric, named CETT-PPL-1\%, to measure activation sparsity. Based on CETT-PPL-1\%, we quantitatively study the influence of various factors and observe several important phenomena, such as the convergent power-law relationship between sparsity and training data amount, the higher competence of ReLU activation than mainstream SiLU activation, the potential sparsity merit of a small width-depth ratio, and the scale insensitivity of activation sparsity. Finally, we provide implications for building sparse and effective LLMs, and demonstrate the reliability of our findings by training a 2.4B model with a sparsity ratio of 93.52\%, showing 4.1$\times$ speedup compared with its dense version. The codes and checkpoints are available at https://github.com/thunlp/SparsingLaw/.

Citation History

Jan 28, 2026
0
Feb 13, 2026
16+16
Feb 13, 2026
16
Feb 13, 2026
16