"dataset" Papers
3 papers found
Conference
FineWeb2: One Pipeline to Scale Them All — Adapting Pre-Training Data Processing to Every Language
Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec et al.
COLM 2025paperarXiv:2506.20920
48
citations
SmolLM2: When Smol Goes Big — Data-Centric Training of a Fully Open Small Language Model
Loubna Ben allal, Anton Lozhkov, Elie Bakouch et al.
COLM 2025paper
Yourbench: Dynamic Evaluation Set Generation with LLMs
Sumuk Shashidhar, Clémentine Fourrier, Alina Lozovskaya et al.
COLM 2025paper