Pretrain a Transformer on Language Modeling
October 17, 2025 Β· View on GitHub
A minimal yet efficient implementation of causal language modeling in PyTorch.
It features a custom torch-compilable Transformer model implementation supporting RoPE, GLU, and RMSNorm. It supports distributed training via Distributed Data Parallel (DDP).
A dedicated script is included for downloading, tokenizing, and chunking data, making data preparation seamless.
π Installation
We recommend running plainLM in a dedicated Python environment. To install dependencies in an Anaconda environment, execute:
conda create --name plainLM python=3.12 -y && conda activate plainLM && cd plainLM
pip install .
π Data
We provide a script for downloading, tokenizing, chunking and saving Hugging Face datasets: data/datasets/prepare.py.
You can specify any HF dataset and tokenizer. To avoid downloading the entire corpus, we support streaming, tokenizing, and chunking data on-the-fly. We provide an example for FineWebEdu-100BT in data/datasets/prepare_finewebedu_100BT.sh.
β‘οΈ Usage
Specify hyperparameters in config.yaml and launch training as follows:
Single GPU/CPU:
python train.py --config=config/config.yaml
Multiple GPUs:
torchrun --nnodes=1 --nproc_per_node=4 train.py --config=code/config/sweep.yaml
Run a sweep in parallel on a SLURM or Condor HPC cluster:
- Define hyperparameter sweep:
create a single YAML file with lists of hyperparameter values. Each value in the list will represent a different configuration, e.g.:
lr: [0.1, 0.01] wd: [0.1, 0.2, 0.5] beta1: 0.9 ... - Submit the sweep:
Submit a job-array, where each job executes the same python script and reads the same configuration, but with a different
job_idx. We usejob_idxto map a job to its hyperparameters;job_idxshould range from0ton-1, wherenis the number of Cartesian product configurations in the YAML. This is done automatically bycluster/slurm.shandcluster/condor.sub. Python takes care of assigning the corresponding configuration to each job based on the value ofjob_idx.
π Structure
plainLM/
βββ cluster/ # HPC scripts (SLURM & Condor)
βββ config/ # Configuration files for training and model setup
βββ data/ # Everything regarding data preparation and data stream
β βββ datasets/ # Data preprocessing files to download, tokenize, chunk and save data
β βββ dataloaders.py # Dataloader utilities
β βββ datasamplers.py # Custom stateful distributed samplers
βββ engine/ # Core implementation of the model engine: a torch.nn.Module implementing training steps and evaluations
βββ models/ # Model architectures
βββ optim/ # Optimization utilities
βββ checkpoint_utils.py # Checkpoint utilities
βββ torch_utils.py # PyTorch utilities (DDP, seed, TF32...)
βββ train.py # Main training script βοΈ
βββ utils.py # Miscellaneous helper functions
βοΈ TODO
- FSDP2 support, ZeRO-2 and tensor parallel compatibility
- dummy data
- unit tests
- add seed to
DistributedSampler
Citation
@misc{ajroldi2024plainlm,
author = {NiccolΓ² Ajroldi},
title = {plainLM: Language Model Pretraining in PyTorch},
year = {2024},
howpublished = {\url{https://github.com/Niccolo-Ajroldi/plainLM}}
}
Credits
This project was inspired by:
Huge thanks to these projects for their contributions to open-source language model pretraining!
Published works using plainLM
Some recent projects using plainLM:
- Orvieto, A., & Gower, R. (2025). In search of Adamβs secret sauce ArXiv.
- Ajroldi, N., Orvieto, A., & Geiping, J. (2025). When, where and why to average weights? In Proceedings of ICML 2025.
- SreΔkoviΔ, T., Geiping, J., & Orvieto, A. (2025). Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling. ArXiv.
- Belloni, A., Noci, L., & Orvieto, A. (2025). Universal Dynamics of Warmup Stable Decay: Understanding WSD Beyond Transformers. [MOSS Workshop, ICML 2025].(https://icml.cc/virtual/2025/47679)