Nemotron 3 Super
March 11, 2026 ยท View on GitHub
Nemotron 3 Superis a large language model (LLM) trained by NVIDIA, designed to deliver strong agentic, reasoning, and conversational capabilities. It is employs a hybrid Latent Mixture-of-Experts (LatentMoE) architecture, utilizing interleaved Mamba-2 and MoE layers, along with select Attention layers. Distinct from the Nano model, the Super model incorporates Multi-Token Prediction (MTP) layers for faster text generation and improved quality, and it is trained using NVFP4 quantization to maximize compute efficiency. The model has 12B active parameters and 120B parameters in total.
NeMo Megatron Bridge supports pretraining, full parameters finetuning, and LoRA finetuning this model. The finetuned model can be converted back to the ๐ค Hugging Face format for downstream evaluation.
Please use the custom container `nvcr.io/nvidia/nemo:26.02.nemotron_3_super` when working with this model.
Run all commands from `/opt/Megatron-Bridge` (e.g. `docker run -w /opt/Megatron-Bridge ...`)
Getting the Latest Code
For the best experience, it is recommended to use the latest code from the super-v3 branch. There are two ways to do this:
Option 1: Update the Code Inside the Container
Launch the container and update the code in-place:
# Pull the latest changes from the super-v3 branch
cd /opt/megatron
git pull origin super-v3
Option 2: Mount the Repo from Host
This approach lets you work with the code on your host machine and mount it into the container at runtime.
Step 1 โ Pull the latest super-v3 branch on the host:
git checkout super-v3 && git pull origin super-v3
Step 2 โ Mount the repo when launching the container:
MEGATRON_BRIDGE_PATH=/path/to/Megatron-Bridge # set this to your local clone
docker run --rm -it \
-v $MEGATRON_BRIDGE_PATH:/opt/Megatron-Bridge \
-w /opt/Megatron-Bridge \
nvcr.io/nvidia/nemo:26.02.nemotron_3_super \
bash
Conversion with ๐ค Hugging Face
Import HF โ Megatron
To import the HF model to your desired $MEGATRON_MODEL_PATH, run the following command.
HF_MODEL=/path/to/hf/model
MEGATRON_PATH=/path/to/output/megatron/ckpt
torchrun --nproc-per-node=8 examples/conversion/convert_checkpoints.py import \
--hf-model $HF_MODEL \
--megatron-path $MEGATRON_PATH \
--tp 1 \
--ep 8
Notes:
- The default parallelism is TP=1, EP=8 (Expert Parallel)
- Adjust
--nproc-per-nodebased on your available GPUs
Export Megatron โ HF
HF_MODEL=/path/to/hf/model
MEGATRON_PATH=/path/to/trained/megatron/ckpt
OUTPUT_PATH=/path/to/output/hf/ckpt
torchrun --nproc-per-node=8 examples/conversion/convert_checkpoints.py export \
--hf-model $HF_MODEL \
--megatron-path $MEGATRON_PATH \
--hf-path $OUTPUT_PATH \
--tp 1 \
--ep 8
Roundtrip Testing
To verify the correctness of import/export conversions:
HF_MODEL=/path/to/hf/model
MEGATRON_PATH=/path/to/megatron/ckpt
torchrun --nproc-per-node=8 examples/conversion/hf_megatron_roundtrip_multi_gpu.py \
--hf-model-id $HF_MODEL \
--megatron-load-path $MEGATRON_PATH \
--tp 1 \
--ep 8 \
--trust-remote-code
Compare HF and Megatron Outputs
To compare outputs between HF and Megatron models:
HF_MODEL=/path/to/hf/model
MEGATRON_PATH=/path/to/megatron/ckpt
torchrun --nproc-per-node=8 examples/conversion/compare_hf_and_megatron/compare.py \
--hf_model_path $HF_MODEL \
--megatron_model_path $MEGATRON_PATH \
--prompt "Hello who are " \
--tp 8 \
--ep 8 \
--trust_remote_code
Pretraining Examples
Pretraining with Real Data
BLEND_PATH=/path/to/dataset/blend.json
CHECKPOINT_DIR=/path/to/checkpoints
torchrun --nproc-per-node=8 examples/models/nemotron_3/pretrain_nemotron_3_super.py \
--per-split-data-args-path=${BLEND_PATH} \
logger.wandb_project=your_project \
logger.wandb_entity=nvidia \
logger.log_interval=5 \
checkpoint.load=${CHECKPOINT_DIR} \
checkpoint.save=${CHECKPOINT_DIR} \
checkpoint.save_interval=100 \
train.global_batch_size=8 \
train.micro_batch_size=1 \
train.train_iters=1280 \
scheduler.lr_warmup_iters=128 \
scheduler.lr_decay_iters=1152 \
scheduler.lr_wsd_decay_iters=1152 \
model.tensor_model_parallel_size=4 \
model.context_parallel_size=1 \
model.expert_model_parallel_size=64 \
model.sequence_parallel=True
Notes:
- GPU Requirements: Requires B200 GPUs for NVFP4 support. Minimum of 8 nodes (64 GPUs) required
- The default parallelism settings are TP=4, EP=64, PP=1, CP=1 with sequence parallel enabled
- Expert parallelism (EP) is set to 64 for the MoE architecture
- Adjust batch sizes and iteration counts based on your training requirements
- Make sure to set up WandB credentials if using WandB logging
Pretraining with Mock Data
For quick testing without a dataset:
CHECKPOINT_DIR=/path/to/checkpoints
torchrun --nproc-per-node=8 examples/models/nemotron_3/pretrain_nemotron_3_super.py \
logger.wandb_project=your_project \
logger.wandb_entity=nvidia \
checkpoint.load=${CHECKPOINT_DIR} \
checkpoint.save=${CHECKPOINT_DIR} \
checkpoint.save_interval=100 \
train.global_batch_size=128 \
train.train_iters=100 \
scheduler.lr_warmup_iters=10 \
model.hybrid_override_pattern="MEME*ME" \
model.num_layers=7
Notes:
- If
BLEND_PATHis not specified, mock dataset will be used - The
hybrid_override_patterncan be used to customize the MoE layer pattern - Useful for debugging and testing the training pipeline
Finetuning Recipes
Full Parameter Fine-Tuning
MEGATRON_PATH=/path/to/pretrained/megatron/ckpt
CHECKPOINT_DIR=/path/to/finetuned/checkpoints
torchrun --nproc-per-node=8 examples/models/nemotron_3/finetune_nemotron_3_super.py \
logger.wandb_project=your_project \
logger.wandb_entity=nvidia \
logger.log_interval=5 \
checkpoint.load=${CHECKPOINT_DIR} \
checkpoint.save=${CHECKPOINT_DIR} \
checkpoint.save_interval=50 \
train.global_batch_size=16 \
train.train_iters=200 \
scheduler.lr_warmup_iters=10 \
model.tensor_model_parallel_size=4 \
model.sequence_parallel=True \
checkpoint.pretrained_checkpoint=$MEGATRON_PATH
Notes:
- Default parallelism TP=4, EP=8, PP=1, CP=1 with sequence parallel enabled
- By default, the SQuAD dataset is used.
- Fine-tuning requires a pretrained Megatron checkpoint, which can be obtained from the "Import HF โ Megatron" section above
- Adjust
global_batch_sizeand parallelism settings based on your GPU memory and requirements
LoRA Fine-Tuning
To enable LoRA fine-tuning, pass --peft lora to the script:
MEGATRON_PATH=/path/to/pretrained/megatron/ckpt
CHECKPOINT_DIR=/path/to/lora/checkpoints
torchrun --nproc-per-node=8 examples/models/nemotron_3/finetune_nemotron_3_super.py \
--peft lora \
logger.wandb_project=your_project \
logger.wandb_entity=nvidia \
logger.log_interval=5 \
checkpoint.load=${CHECKPOINT_DIR} \
checkpoint.save=${CHECKPOINT_DIR} \
checkpoint.save_interval=100 \
train.global_batch_size=4 \
train.train_iters=200 \
model.tensor_model_parallel_size=4 \
model.context_parallel_size=2 \
model.sequence_parallel=True \
scheduler.lr_warmup_iters=30 \
checkpoint.pretrained_checkpoint=$MEGATRON_PATH
Notes:
- By default, the target modules are linear layers
["linear_qkv", "linear_proj", "linear_fc1", "linear_fc2", "in_proj", "out_proj"]in the model - LoRA fine-tuning uses less memory and can work with smaller batch sizes
- Consider using Context Parallel (CP) for longer sequences
Quantization (PTQ and QAT)
Quantization support requires the latest code from the `super-v3` branch. See [Getting the Latest Code](#getting-the-latest-code) for instructions.
Nemotron 3 Super supports four quantization configurations:
| Config Name | Format | Description |
|---|---|---|
mamba_moe_fp8_aggressive | FP8 | Aggressive FP8 quantization for Mamba-MoE |
mamba_moe_fp8_conservative | FP8 | Conservative FP8 quantization for Mamba-MoE |
mamba_moe_nvfp4_aggressive | NVFP4 | Aggressive NVFP4 quantization for Mamba-MoE |
mamba_moe_nvfp4_conservative | NVFP4 | Conservative NVFP4 quantization for Mamba-MoE |
Pass the desired config name via --export-quant-cfg to quantize.py.
Quantize
export HF_MODEL=/path/to/hf/model
export MEGATRON_SAVE_PATH=/path/to/quantized/megatron/ckpt
torchrun --nproc_per_node=8 examples/quantization/quantize.py \
--hf-model-id $HF_MODEL \
--export-quant-cfg mamba_moe_nvfp4_conservative \
--megatron-save-path $MEGATRON_SAVE_PATH \
--pp 1 \
--tp 8 \
--ep 8 \
--trust-remote-code
Verify with PTQ Generate
torchrun --nproc_per_node=8 examples/quantization/ptq_generate.py \
--hf-model-id $HF_MODEL \
--megatron-load-path $MEGATRON_SAVE_PATH \
--pp 1 \
--tp 8 \
--ep 8 \
--trust-remote-code
Notes:
- For multi-node setups (e.g. 2 nodes with 8ร H100), increase
--ppaccordingly (e.g.--pp 2) and use a job scheduler like SLURM to launch across nodes.
Export Quantized Megatron Checkpoint โ HF
After quantization, export the Megatron checkpoint back to Hugging Face format:
HF_MODEL=/path/to/hf/model
MEGATRON_LOAD_PATH=/path/to/quantized/megatron/ckpt
EXPORT_DIR=/path/to/output/hf/ckpt
torchrun --nproc_per_node=8 examples/quantization/export.py \
--hf-model-id $HF_MODEL \
--megatron-load-path $MEGATRON_LOAD_PATH \
--export-dir $EXPORT_DIR \
--pp 8 \
--dtype bfloat16 \
--trust-remote-code
Quantization-Aware Training (QAT)
After quantization, further improve model quality with QAT by continuing training from a quantized Megatron checkpoint.
MEGATRON_PATH=/path/to/quantized/megatron/ckpt
CHECKPOINT_DIR=/path/to/qat/checkpoints
torchrun --nproc-per-node=8 examples/models/nemotron_3/qat_nemotron_3_super.py \
--megatron-load-path=${MEGATRON_PATH} \
--seq-length=8192 \
--packed-sequence \
logger.wandb_project=your_project \
logger.wandb_entity=nvidia \
logger.log_interval=5 \
checkpoint.load=${CHECKPOINT_DIR} \
checkpoint.save=${CHECKPOINT_DIR} \
checkpoint.save_interval=50 \
train.global_batch_size=16 \
train.train_iters=200 \
scheduler.lr_warmup_iters=10 \
model.tensor_model_parallel_size=4 \
model.sequence_parallel=True