"pretraining" Papers
7 papers found
Conference
2 OLMo 2 Furious (COLM’s Version)
Evan Pete Walsh, Luca Soldaini, Dirk Groeneveld et al.
COLM 2025paper
CASCADE Your Datasets for Cross-Mode Knowledge Retrieval of Language Models
Runlong Zhou, Yi Zhang
COLM 2025paperarXiv:2504.01450
1
citations
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
Rosie Zhao, Alexandru Meterez, Sham M. Kakade et al.
COLM 2025paperarXiv:2504.07912
87
citations
FineWeb2: One Pipeline to Scale Them All — Adapting Pre-Training Data Processing to Every Language
Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec et al.
COLM 2025paperarXiv:2506.20920
48
citations
Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models
Thao Nguyen, Yang Li, Olga Golovneva et al.
COLM 2025paperarXiv:2506.04689
13
citations
SmolLM2: When Smol Goes Big — Data-Centric Training of a Fully Open Small Language Model
Loubna Ben allal, Anton Lozhkov, Elie Bakouch et al.
COLM 2025paper
When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars
Rei Higuchi, Ryotaro Kawata, Naoki Nishikawa et al.
COLM 2025paperarXiv:2504.17562
2
citations