Quartet: Native FP4 Training Can Be Optimal for Large Language Models

0citations

arXiv:2505.14669

citations

#2219

in NEURIPS 2025

of 5858 papers

Top Authors

Data Points

Top Authors

Roberto Castro Andrei Panferov Rush Tabesh Oliver Sieberling Jiale Chen Mahdi Nikdan Saleh Ashkboos Dan Alistarh

Abstract

Training large language models (LLMs) models directly in low-precision offers a way to address computational costs by improving both throughput and energy efficiency. For those purposes, NVIDIA's recent Blackwell architecture facilitates very low-precision operations using FP4 variants. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we investigate hardware-supported FP4 training and introduce a new approach for accurate, end-to-end FP4 training with all the major computations (i.e., linear layers) in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across bit-widths and training setups. Guided by this investigation, we design an "optimal" technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for Blackwell, demonstrating that fully FP4-based training is a competitive alternative to FP16 half-precision and to FP8 training. Our code is available at https://github.com/IST-DASLab/Quartet .

Citation History

Jan 25, 2026

Jan 27, 2026

Jan 28, 2026