Matryoshka Quantization

4citations

arXiv:2502.06786

citations

#1154

in ICML 2025

of 3340 papers

Top Authors

Data Points

Top Authors

Pranav Nair Puranjay Datta Jeff Dean Prateek Jain Aditya Kusupati

Abstract

Quantizing model weights is critical for reducingthe communication and inference costs of largemodels. However, quantizing models – especiallyto low precisions like int4 or int2 – requires atrade-off in model quality; int2, in particular, isknown to severely degrade model quality. Consequently, practitioners are often forced to maintainmultiple models with different quantization levels or serve a single model that best satisfies thequality-latency trade-off. On the other hand, integer data types, such as int8, inherently possessa nested (Matryoshka) structure where smallerbit-width integers, like int4 or int2, are nestedwithin the most significant bits. Leveraging thisinsight, in this paper, we propose MatryoshkaQuantization (MatQuant), a novel multi-scalequantization technique that alleviates the aforementioned challenge. This technique allows us totrain and maintain a single quantized model butserve it with the precision demanded by the deployment. Furthermore, leveraging MatQuant’sco-training and co-distillation, int2 precision models extracted by MatQuant outperform standardint2 quantization by up to 4% and 7% with OmniQuant and QAT as base algorithms respectively.Finally, we demonstrate that by using an extra bitto represent outliers, a model with an effectiveprecision of 2.05-bit improves further by 6% withOmniQuant as the base algorithm.

Citation History

Jan 28, 2026

Feb 13, 2026

4+4

Feb 13, 2026