Beyond Next Token Prediction: Patch-Level Training for Large Language Models

2citations

arXiv:2407.12665

Citations

#1742

in ICLR 2025

of 3827 papers

Authors

Data Points

Authors

Chenze Shao Fandong Meng Jie Zhou

Topics

patch-level training next token prediction large language models training cost reduction token aggregation sequence length reduction model efficiency optimization

Abstract

The prohibitive training costs of Large Language Models (LLMs) have emerged as a significant bottleneck in the development of next-generation LLMs. In this paper, we show that it is possible to significantly reduce the training costs of LLMs without sacrificing their performance. Specifically, we introduce patch-level training for LLMs, in which multiple tokens are aggregated into a unit of higher information density, referred to as a `patch', to serve as the fundamental text unit for training LLMs. During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch, thereby processing the majority of the training data at a significantly reduced cost. Following this, the model continues token-level training on the remaining training data to align with the inference mode. Experiments on a diverse range of models (370M-2.7B parameters) demonstrate that patch-level training can reduce the overall training costs to 0.5$\times$, without compromising the model performance compared to token-level training. Source code: https://github.com/shaochenze/PatchTrain.

Citation History

Jan 26, 2026

Jan 27, 2026

Feb 1, 2026

2+2