Pretraining

November 20, 2025 · View on GitHub

We use nanotron library for training SmolLM, SmolLM2 and SmolLM3 base models.

Setup

Please refer to nanotron for detailed instructions on setting up your training environment and launching jobs. For SmolLM3 we use this branch and this branch of datatrove.

Below is an example of launching SmolLM3 training on 1 node (you can change the DP value to 4 in the config and adjust the batch size) and run:

git clone https://github.com/huggingface/nanotron
cd nanotron
# follow installation
CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=8 run_train.py --config-file smollm3/stage1_8T.yaml

If you are working on a slurm cluster, you can modify the launch.slurm and launch the training with:

sbatch launch.slurm

Note

Don't forget to create the logs directory before launching the job

Continual pre-training

The nanotron checkpoints for SmolLM2 models are available at: https://huggingface.co/HuggingFaceTB/SmolLM2-nanotron-ckpt. SmolLM3 transformers format checkpoint are availble here https://huggingface.co/HuggingFaceTB/SmolLM3-3B-checkpoints.

You can find an example of continual pre-training in the continual-pretraining folder.

SmolLM3

SmolLM3 follows a transformer decoder architecture with tied embedding similar to SmolLM2, building on Llama architecture with some key modifications optimized for efficiency and long context performance.

Grouped Query Attention (GQA): We replaced multi-head attention with grouped query attention using 4 groups. Our ablations on a 3B model trained with 100B tokens from FineWeb-Edu showed that GQA matches the performance of multi-head attention while significantly reducing the KV cache size during inference.

NoPE: We implemented NoPE from "RoPE to NoRoPE and Back Again: A New Hybrid Attention Strategy" (Yang et al., 2025), selectively removing rotary position embeddings from every 4th layer. This approach improves long context performance without affecting short context capabilities, as confirmed by our ablations.

Intra-Document Masking: During training, we use attention masking to ensure tokens from different documents in the same training sequence don't attend to each other. Similar to Llama 3, this helps with faster and more stable long context training while maintaining short context performance.

Training Stability: Following OLMo 2, we remove weight decay from embedding layers to improve training stability. This modification contributed to more stable training dynamics, with embedding norms naturally stabilizing at healthier values during training without impacting overall performance in our ablations.

All these changes were validated through ablations using the same 3B architecture trained on 100B tokens from FineWeb-Edu, ensuring each modification either improved performance or maintained it while offering other benefits.

Training Configuration: We use a global batch size of 2.36M tokens with 4096 sequence length, a learning rate of 2e-4, and the AdamW optimizer (beta1: 0.9, beta2: 0.95) with weight decay of 0.1 and gradient clipping of 1. We use the WSD (Warmup-Stable-Decay) scheduler, with 2000  warmup steps, and a linear decay to 0 in the final 10% training steps. We use nanotron framework for the training, datatrove for data processing and lighteval for evaluation. The model was trained on 384 H100 GPUs for 24 days. You can see the distributed training setup in the following figure.

You can find our full training logs here: https://wandb.ai/huggingface/SmolLM3-training-logs?nw=nwusereliebak