Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models

0citations

Citations

#607

in ICLR 2024

of 2297 papers

Authors

Data Points

Authors

Yingtao Zhang Haoli Bai Haokun Lin Jialin Zhao LU HOU Carlo Vittorio Cannistraci

Abstract

With the rapid growth of large language models (LLMs), there is increasing demand for memory and computation in LLMs. Recent efforts on post-training pruning of LLMs aim to reduce the model size and computation requirements, yet the performance is still sub-optimal. In this paper, we present a plug-and-play solution for post-training pruning of LLMs.The proposed solution has two innovative components: 1)Relative Importance and Activations (RIA), a new pruning metric that jointly considers the weight and activations efficiently on LLMs, and 2)Channel Permutation, a new approach to maximally preserves important weights under N:M sparsity.The two proposed components can be readily combined to further enhance the N:M semi-structured pruning of LLMs.Our empirical experiments show that RIA alone can already surpass all existing post-training pruning methods on prevalent LLMs, e.g., LLaMA ranging from 7B to 65B. Furthermore, N:M semi-structured pruning with channel permutation can even outperform the original LLaMA2-70B on zero-shot tasks, together with practical speed-up on specific hardware.Our code is available at: https://github.com/biomedical-cybernetics/Relative-importance-and-activation-pruning

Citation History

Jan 28, 2026