LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

11citations
11
Citations
#152
in ICML 2025
of 3340 papers
5
Authors
1
Data Points

Abstract

Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute.More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance and generalization.In this work, we investigate which factors most strongly influence loss-to-loss scaling.Our experiments reveal that the pretraining data determines the scaling trend.In contrast, model size, optimization hyperparameters, tokenizer and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, generally have limited impact.Consequently, practitioners should carefully curate pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.

Citation History

Jan 28, 2026
11