FOAM: Blocked State Folding for Memory-Efficient LLM Training

May 11, 2026 ยท View on GitHub

This repo contains the official implementation for the ICML 2026 paper FOAM: Blocked State Folding for Memory-Efficient LLM Training.

FOAM preview End to end memory estimate, BF16 PPL learning curves for pre-training LLaMA-1.3B on C4

Reproducibility

All pre-training experiments were conducted using 1 to 32 NVIDIA RTX 3090 GPUs and 4 NVIDIA H100 GPUs with PyTorch version 2.3.0+cu118. run

conda create $yourname python=3.11.9
conda activate $yourname
pip install -r requirements

We present the running scripts for pre-training LLaMA models in here. For fine-tuning RoBERTa models on GLUE, run

#!/bin/bash

export model_name_or_path=roberta-large
export warm_up=0.1
export task_name=cola
export epoch=3
export max_length=256
export level=8
export scale=2.5e-1
export scheduler=cosine

export scale=0.25
for lr in 2.0e-4
do
    for task_name in cola
    do
        python run_glue.py \
            --model_name_or_path $model_name_or_path \
            --task_name $task_name \
            --scale $scale \
            --enable_fold \
            --level $level \
            --lora_all_modules \
            --max_length $max_length \
            --seed 42 \
            --lr_scheduler_type $scheduler \
            --num_warmup_steps $warm_up \
            --per_device_train_batch_size 32 \
            --learning_rate $lr \
            --num_train_epochs $epoch \
            --output_dir $your_dir
        wait
    done
done

For the MMLU fine-tuning tasks, we adopt the implementation of LLaMA-Factory, see here

Citation

@article{wen2025foam,
  title={FOAM: Blocked State Folding for Memory-Efficient LLM Training},
  author={Wen, Ziqing and Wang, Jiahuan and Luo, Ping and Li, Dongsheng and Sun, Tao},
  journal={arXiv preprint arXiv:2512.07112},
  year={2025}
}