MobileLLM-R1

April 30, 2026 ยท View on GitHub

๐Ÿค— Hugging Faceย ย  | ย ย  ๐Ÿ“‘ Paper ย ย  | ย ย  ๐Ÿ’ป Code ย ย 

MobileLLM-R1

MobileLLM-R1 is a new series of efficient reasoning models within the MobileLLM family. Alongside the models, we provide comprehensive training recipes and data sources to ensure reproducibility and facilitate further research.

This repository includes demonstration code to reproduce the pretraining, mid-training, and SFT stages of MobileLLM-R1, as well as the corresponding intermediate checkpoints and data mix weights.

News

Highlights

Remarkably, the MobileLLM-R1 950M, pre-trained on only ~2T high-quality tokens and with fewer than 5T total training tokens, achieves comparable or superior performance to Qwen3 0.6B, which was trained on 36T tokens, across MATH, GSM8K, MMLU, and LiveCodeBench benchmarks.

Compared to existing fully open-source models, MobileLLM-R1 950M model achieves ~5ร— higher accuracy on MATH compared to the Olmo 1.24B model and ~2ร— higher accuracy relative to the SmolLM2 1.7B model, despite being substantially smaller in parameter scale. In addition, MobileLLM-R1 950M outperforms both Olmo 1.24B and SmolLM2 1.7B by a wide margin on coding benchmarks, establishing a new state-of-the-art among fully open-source models.

Pretrained Model

image/jpeg

Token efficiency comparison across pretrained models

Post-trained Model

image/jpeg

Model Architecture

# Layers# Attnetion Heads# KV HeadsDimHidden DimParams
MobileLLM-R1-140M15935762048140M
MobileLLM-R1-360M1516410244096359M
MobileLLM-R1-950M2224615366144949M
Input modalitiesOutput modalitiesContext LengthVocaburary SizeShared Embeddings
MobileLLM-R1-140M-baseTextText4k128kYes
MobileLLM-R1-360M-baseTextText4k128kYes
MobileLLM-R1-950M-baseTextText4k128kYes
MobileLLM-R1-140MTextText32k128kYes
MobileLLM-R1-360MTextText32k128kYes
MobileLLM-R1-950MTextText32k128kYes

Training

Training Process

image/jpeg

Training stages and hyperparameter details

In the pretraining phase, MobileLLM-R1 models are randomly initialized and optimized using the Adam optimizer with hyperparameters (ฮฒ_1, ฮฒ_2, ฮต) = (0.9, 0.95, 1e-8), coupled with a weight decay coefficient of 0.1. The learning rate follows a 2k-step warmup schedule and then decays linearly from its peak to 10% of the maximum.

In the mid-training phase, we use Adam optimizer with learning rate linearly decays from its maximum value to zero. We employ knowledge distillation with Llama-3.1-8B-Instruct model as the teacher, where the student is trained via minimizing the KL divergence between its output logits and the teacher logits.

In the post-training phase, we use the Adam optimizer with zero weight decay. The learning rate warmup ratio is set to 0.03 for general-purpose SFT and 0.1 for reasoning-specific SFT, and it linearly decays from its maximum value to zero. Full training hyperparameters are provided in the table below.

StagePhaseTokens / SamplesBSSequence LengthStepsLR#GPUsTraining Time
Pre-trainingPhase12T tokens162k500k4.00E-0316 x 84-5 days
Phase22T tokens162k500k4.00E-0316 x 84-5 days
Mid-trainingPhase1100B tokens44k50K3.60E-0416 x 81-2 days
Phase2100B tokens44k50K3.60E-0416 x 81-2 days
Post-trainingGeneral SFT866K samples44k2 epochs5.00E-0616 x 8~2h
Reasoning SFT6.2M samples832k4 epochs8.00E-0516 x 8~2.5days

Quick Start

To load the pretrained model for further fine-tuning or evaluation:

from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/MobileLLM-R1-950M")
model = AutoModelForCausalLM.from_pretrained("facebook/MobileLLM-R1-950M")

Pretraining

Training code

The pretraining code that includes support for data mixing is available under ./pretraining. Example code:

cd ./pretraining
bash run_pretrain.sh

Note that our implementation does not include efficiency optimizations; for training speedups, you may refer to open-source efficiency implementations such as torchtitan or torchtune.

Intermediate checkpoints

Model SizePretraining Stage1Pretraining Stage2Intermediate Checkpoints*
950MMobileLLM-R1-950M-baseMobileLLM-R1-950M-baseComing Soon*
360MMobileLLM-R1-360M-baseMobileLLM-R1-360M-baseComing Soon*
140MMobileLLM-R1-140M-baseMobileLLM-R1-140M-baseComing Soon*

*Links to intermediate checkpoints will be made available soon.

Mid-training

Training code

Mid-training only differs from pre-training in the dataset selection and training steps. The training code structure are identical to pretraining. Please refer to the ./pretraining section for details on the training code.

Intermediate checkpoints

Model SizeMid-training Stage1Mid-training Stage2Intermediate Checkpoints*
950MMobileLLM-R1-950M-baseMobileLLM-R1-950M-baseComing Soon*
360MMobileLLM-R1-360M-baseMobileLLM-R1-360M-baseComing Soon*
140MMobileLLM-R1-140M-baseMobileLLM-R1-140M-baseComing Soon*

*Links to intermediate checkpoints will be made available soon.

Post-training

Training code

Training code for the post-training stage is available under ./sft. The code is based on TRL. Example code:

cd ./sft
bash run_general_sft.sh
bash run_reasoning_sft.sh

Intermediate checkpoints

Model SizeGeneral SFTReasoning SFT
950MMobileLLM-R1-950MMobileLLM-R1-950M
360MMobileLLM-R1-360MMobileLLM-R1-360M
140MMobileLLM-R1-140MMobileLLM-R1-140M

Inference

Inference examples

Transformers

from transformers import pipeline
import torch

model_id = "facebook/MobileLLM-R1-950M"

pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype="auto",
    device_map="auto",
)

# Math problem / default scenario
messages = [
    {
        "role": "system",
        "content": "Please reason step by step, and put your final answer within \\boxed{}."
    },
    {"role": "user", "content": "Compute: \$1-2+3-4+5- \\dots +99-100$."},
]

# C++ coding scenario
messages = [
    {
        "role": "system",
        "content": (
            "\nYou are a helpful and harmless assistant. You should think step-by-step before responding to the instruction below.\n\n"
            "Please use c++ programming language only.\n"
            "You must use ```cpp for just the final solution code block with the following format:\n"
            "```cpp\n# Your code here\n```\n"
        )
    },
    {"role": "user", "content": "Write a C++ program that prints 'Hello, World!'."},
]

# Python coding scenario
messages = [
    {
        "role": "system",
        "content": (
            "\nYou are a helpful and harmless assistant. You should think step-by-step before responding to the instruction below.\n\n"
            "Please use python programming language only.\n"
            "You must use ```python for just the final solution code block with the following format:\n"
            "```python\n# Your code here\n```\n"
        )
    },
    {"role": "user", "content": "Write a Python function that returns the square of a number."},
]

outputs = pipe(
    messages,
    max_new_tokens=8192,
)
print(outputs[0]["generated_text"][-1])

You can also run inference with vLLM. You only need to register the model architecture Llama4ForCausalLM with the vLLM ModelRegistry.

from vllm.model_executor.models.llama4 import Llama4ForCausalLM
from vllm.model_executor.models.registry import ModelRegistry
ModelRegistry.register_model("Llama4ForCausalLM", Llama4ForCausalLM)

Data mix

Pretraining

DatasetRowsTokens (B)Phase1 Mix RatioPhase2 Mix Ratio
StarCoder206,640,114263.810.66%0.52%
OpenWebMath6,117,78612.66.93%23.33%
FineWeb-Edu1,279,107,432130063.75%54.83%
Wiki7,222,3033.75.03%0.14%
Arxiv1,533,917286.36%1.32%
StackExchange29,249,12019.65.03%0.86%
Algebraic stack3,404,33112.62.25%1.26%
Nemotron science708,9202--0.03%
Nemotron code10,108,88316--0.72%
Nemotron math22,066,39715--3.01%
Cosmopedia31,064,74425--2.70%
Facebook natural reasoning1,145,8241.8--3.18%
FineMath48,283,98434--8.01%
peS2o38,800,00050--0.08%
Total100%100%

Mid-training

DatasetSubsetRows (M)Phase1 Mix RatioPhase2 Mix Ratio
DolminoDCLM Baseline60637.03%6.51%
FLAN57.34.10%0.72%
peS2o38.811.41%2.01%
Wiki6.172.66%0.47%
StackExchange2.482.12%2.00%
Math2111.63%29.10%
NemotronNemotron-Pretraining-Code-v188220.69%29.10%
Nemotron-CC-Math-v11443.45%19.40%
StarCoderStarCoder2066.90%9.70%
Benchmark training setTriviaQA (train)
OBQA (train)
NaturalQuestions (train)
PIQA (train)
GSM8K (train)
BoolQ (train)
ARC-Easy (train)
ARC-Challenge (train)
~0.0100.97%
Total100.00%100.00%

Post-training

PhaseDatasetRows
General SFTTulu-3-sft-olmo-2-mixture-0225866K samples
Reasoning SFTOpenMathReasoning3.2M samples
OpenScienceReasoning-2803K samples
OpenCodeReasoning-22.16M samples

Evaluation

Evaluation code

We provide the evaluation code necessary to reproduce the MobileLLM-R1 evaluation results in the evaluation folder. The evaluation results are summarized in the following table.

MobileLLM-R1 base model

ModelSizeMATH500GSM8KMBPPHumanEvalCommonSense Avg.MMLU
4-shot
em
8-shot
em
3-shot
pass@1
0-shot
pass@1
0-shot
accuracy
5-shot
accuracy
<150M
SmolLM2-135M-base135M0.41.83.80.050.7--
MobileLLM-R1-140M-base140M4.616.35.415.944.3--
150M - 400M
Gemma-3-270M-pt268M0.61.12.03.148.426.5
SmolLM2-360M-base362M1.85.019.40.056.624.7
MobileLLM-R1-360M-base359M13.439.420.832.951.026.8
400M - 1B
Qwen2.5-0.5B-base494M14.841.829.628.152.347.5
Qwen3-0.6B-base596M29.860.939.030.555.352.4
MobileLLM-R1-950M-base949M26.861.639.246.358.647.4
> 1B
Gemma-3-1B-pt1.0B0.62.49.46.157.326.1
LLaMA3.2-1B-base1.24B1.66.826.617.158.432.0
OLMo-2-0425-1B-base1.48B5.239.87.86.761.042.4
Qwen2.5-1.5B-base1.54B31.068.444.636.658.761.2
SmolLM2-1.7B-base1.71B11.631.835.40.662.950.0
Qwen3-1.7B-base2.03B38.576.256.447.660.962.1

Here, CommonSense Avg. denotes an average of 8 tasks in CommonSense Reasoning benchmarks including ARC-easy, ARC-challenge, BoolQ, PIQA, SIQA, HellaSwag, OBQA, and WinoGrand. Models with fewer than 150M parameters do not yield reliable MMLU scores and are therefore denoted as 'โ€”'.

MobileLLM-R1 post-trained model

ModelSizeMATH500GSM8KAIME'24AIME'25LiveCodeBench-v6
0-shot
pass@1
0-shot
pass@1
0-shot
pass@1, n=64
0-shot
pass@1, n=64
0-shot
pass@1, n=16
<150M
SmolLM2-135M-Instruct135M3.02.4----0.0
MobileLLM-R1-140M140M6.24.1----1.7
150M - 400M
Gemma-3-270m-it268M6.88.4----0.0
SmolLM2-360M-Instruct362M3.48.1----0.7
MobileLLM-R1-360M359M28.424.5----5.1
400M - 1B
Qwen2.5-0.5B-Instruct494M31.248.10.10.33.6
Qwen3-0.6B596M73.079.211.317.014.9
MobileLLM-R1-950M949M74.067.515.516.319.9
> 1B
Gemma-3-1B-it1.0B45.462.90.90.02.0
LLaMA3.2-1B-Instruct1.24B24.838.81.10.24.1
OLMo-2-0425-1B-Instruct1.48B19.269.70.60.10.0
OpenReasoning-Nemotron-1.5B1.54B83.476.749.740.428.3
DeepSeek-R1-Distill-Qwen-1.5B1.54B83.277.329.123.419.9
Qwen2.5-1.5B-Instruct1.54B54.070.02.50.97.9
SmolLM2-1.7B-Instruct1.71B19.241.80.30.14.4
Qwen3-1.7B2.03B89.490.347.037.029.8

For AIME, we evaluate models across 64 runs and report the average accuracy. For LiveCodeBench, results are reported as the average accuracy across 16 runs. Models with fewer than 400M parameters do not produce reliable AIME scores and are therefore denoted as 'โ€”'.

Citation

If you find our model useful for your research, please consider citing:

@article{zhao2025mobilellm-r1,
  title={MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes},
  author={Zhao, Changsheng and Chang, Ernie and Liu, Zechun and Chang, Chia-Jung and Wen, Wei and Lai, Chen and Cao, Sheng, and Tian, Yuandong and Krishnamoorthi, Raghuraman and Shi, Yangyang and  Chandra, Vikas},
  journal={arXiv preprint arXiv:2509.24945},
  year={2025}
}

Contact

Changsheng Zhao, Meta Inc (cszhao at meta dot com)

Ernie Chang, Meta Inc (erniecyc at meta dot com)

Zechun Liu, Meta Inc (zechunliu at meta dot com)

License

MobileLLM-R1 is FAIR NC licensed as of now