Megatron Bridge

May 5, 2026 · View on GitHub

This directory contains examples of using Model Optimizer with NeMo Megatron-Bridge framework for pruning, distillation, quantization, etc.

SectionDescriptionLink
Pre-RequisitesDevelopment environment setup[Link]
PruningExamples of pruning a model using Minitron algorithm[Link]
DistillationExamples of distillation a pruned or quantized model[Link]
Post-Training QuantizationExamples of quantizing a model[Link]
ResourcesExtra links to relevant resources[Link]

Pre-Requisites

Running these examples requires many additional dependencies to be installed (e.g., Megatron-Bridge, Megatron-core, etc.), hence we strongly recommend directly using the NeMo container (e.g., nvcr.io/nvidia/nemo:26.04) which has all the dependencies installed.

To get the ModelOpt examples scripts, mount your Model-Optimizer repo to the container as follows:

export MODELOPT_DIR=${PWD}/Model-Optimizer # or set to your local Model-Optimizer repository path if you have cloned it
if [ ! -d "${MODELOPT_DIR}" ]; then
  git clone https://github.com/NVIDIA/Model-Optimizer.git ${MODELOPT_DIR}
fi

export DOCKER_IMAGE=nvcr.io/nvidia/nemo:26.04
docker run \
  --gpus all \
  --shm-size=16GB \
  --net=host \
  --ulimit memlock=-1 \
  --rm -it \
  -v ${MODELOPT_DIR}:/opt/Model-Optimizer \
  -v ${MODELOPT_DIR}/modelopt:/opt/venv/lib/python3.12/site-packages/modelopt \
  -w /opt/Model-Optimizer/examples/megatron_bridge \
  ${DOCKER_IMAGE} bash

Once inside the container, you need to login with your HuggingFace token to download gated datasets / models. Note that the default dataset for pruning and quantization is nemotron-post-training-dataset-v2, which is gated.

hf auth login --token <your token>

Warning

Use python -m pip instead of pip to avoid conflicts with the system-wide installed packages in the NeMo containers. You may also refer to this doc on how to correctly install packages in the NeMo containers without breaking existing torch installation.

Also install additional dependencies from the requirements.txt file.

python -m pip install -r requirements.txt

Pruning

This section shows how to prune a HuggingFace model using Minitron algorithm in Megatron-Bridge framework. Checkout other available pruning algorithms, supported frameworks and models, and general pruning getting-started in the pruning README.

The script supports three NAS-based pruning targets and one manual export mode:

ModeFlagDescription
NAS--prune_target_paramsPrune to a target total parameter count
NAS--prune_target_active_paramsPrune to a target active parameter count (useful for MoE models). For non-MoE models, this is equivalent to --prune_target_params.
NAS--prune_target_memory_mbPrune to a target memory footprint in MB (weights + KV-cache) for a given batch size and sequence length assuming BF16 precision
Manual--prune_export_configPrune directly to a specified architecture config (no NAS). Useful if you want to take top K candidates and do a short knowledge distillation before selecting the best model.

Multiple NAS targets can be combined — e.g. --prune_target_params 6e9 --prune_target_memory_mb 12288 finds the best model with under 6B params and under 12GB memory footprint at (default) batch size 1 and sequence length 4096 assuming BF16 precision.

Prune by total parameter count — prune Qwen3-8B to 6B on 2-GPUs (Pipeline Parallelism = 2) while skipping pruning of num_attention_heads using following defaults: 1024 samples from nemotron-post-training-dataset-v2 for calibration, at-most 20% depth (num_layers) and 40% width is pruned per prunable hparam (hidden_size, ffn_hidden_size, ...), top-10 candidates are evaluated for MMLU score (5% sampled data) to select the best model.

torchrun --nproc_per_node 2 prune_minitron.py \
    --pp_size 2 \
    --hf_model_name_or_path Qwen/Qwen3-8B \
    --prune_target_params 6e9 \
    --hparams_to_skip num_attention_heads \
    --output_hf_path /tmp/Qwen3-8B-Pruned-6B

Prune by active parameter count — useful for MoE models where most experts are inactive per token (e.g. prune Nemotron-3-Nano-30B-A3B-BF16 (3.6B active params) to 3B active params):

torchrun --nproc_per_node 2 prune_minitron.py \
    --pp_size 2 \
    --hf_model_name_or_path nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
    --prune_target_active_params 3e9 \
    --output_hf_path /tmp/Nemotron-3-Nano-30B-A3B-BF16-Pruned-3B-Active

Prune by memory footprint — prune to fit a target GPU memory budget (weights + KV-cache at the given sequence length and batch size, assuming BF16):

torchrun --nproc_per_node 2 prune_minitron.py \
    --pp_size 2 \
    --hf_model_name_or_path Qwen/Qwen3-8B \
    --prune_target_memory_mb 12288 \
    --seq_length 4096 \
    --calib_mbs 1 \
    --output_hf_path /tmp/Qwen3-8B-Pruned-12GB

Manual pruning — prune directly to a specified architecture (no NAS, no score evaluation):

torchrun --nproc_per_node 2 prune_minitron.py \
    --pp_size 2 \
    --hf_model_name_or_path Qwen/Qwen3-8B \
    --prune_export_config '{"hidden_size": 3584, "ffn_hidden_size": 9216}' \
    --output_hf_path /tmp/Qwen3-8B-Pruned-6B-manual

To see the full usage for advanced configurations, run:

torchrun --nproc_per_node 1 prune_minitron.py --help

Tip

If number of layers in the model is not divisible by number of GPUs i.e. pipeline parallel (PP) size, you can configure uneven PP by setting --num_layers_in_first_pipeline_stage and --num_layers_in_last_pipeline_stage. E.g. for Qwen3-8B with 36 layers and 8 GPUs, you can set both to 3 to get 3-5-5-5-5-5-5-3 layers per GPU.

Distillation

This section shows how to distill a student model from a teacher model in the Megatron-Bridge framework.

This can be used stand-alone or after Pruning / Post-Training Quantization to recover accuracy of the model by distilling from the original model (teacher).

The distill.py script supports both standard HuggingFace checkpoints and Puzzletron AnyModel checkpoints as student/teacher inputs. Just pass the checkpoint path via --student_hf_path / --teacher_hf_path. The distilled model is saved to <output_dir>/checkpoints in Megatron distributed checkpoint format.

Data Preparation

The distillation script expects pre-tokenized data in Megatron's binary format (.bin / .idx files).

See the Dataset Preparation README for full instructions on tokenizing JSONL files and Hugging Face datasets and get the list of output prefixes that you can use for --data_paths argument.

Distillation with Real Data

Example usage to distill a 4B student (HF) from an 8B teacher (HF) on 8 GPUs (TP=8, PP=1):

torchrun --nnodes 1 --nproc_per_node 8 distill.py \
    --tp_size 8 \
    --teacher_hf_path Qwen/Qwen3-8B \
    --student_hf_path Qwen/Qwen3-4B \
    --data_paths 1.0 tokenized_qwen3/data1_text_document 1.0 tokenized_qwen3/data2_text_document \
    --data_path_to_cache /path/to/cache/dataset_indices_qwen3 \
    --seq_length 8192 \
    --mbs 1 \
    --gbs 768 \
    --train_iters 15000 \
    --lr 1e-4 \
    --min_lr 1e-5 \
    --lr_warmup_iters 50 \
    --eval_interval 100 \
    --eval_iters 32 \
    --log_interval 10 \
    --output_dir /output/qwen3_8b_to_4b_distill

Tensorboard logging is enabled by default and logs are saved to <output_dir>/tensorboard directory. To use Weights & Biases for logging, set the WANDB_API_KEY environment variable and pass the --wandb_project argument. Optionally, you can also pass --wandb_entity and --wandb_exp_name arguments to group runs under a project and experiment name.

To see all available arguments:

torchrun --nproc_per_node 1 distill.py --help

Quick Test with Mock Data

Example usage with mock data for quick testing (no pre-tokenized data needed):

torchrun --nproc_per_node 8 distill.py \
    --tp_size 8 \
    --teacher_hf_path Qwen/Qwen3-0.6B \
    --student_hf_path Qwen/Qwen3-0.6B \
    --use_mock_data \
    --seq_length 512 \
    --mbs 1 \
    --gbs 8 \
    --train_iters 100 \
    --eval_interval 10 \
    --eval_iters 4 \
    --output_dir /tmp/test_distill

Slurm Usage

To run the distillation script on a Slurm cluster for multi-node training, you just need use python instead of torchrun and set the number of nodes using #SBATCH --nodes=<num_nodes> clause in your Slurm script.

Converting to Hugging Face format (optional)

The distilled checkpoint is saved in Megatron distributed format. If you need a HuggingFace checkpoint, there are two ways to convert it:

Inline -- add --hf_export_path and --student_hf_model to the distill.py command to automatically convert the final checkpoint after distillation:

torchrun --nnodes 1 --nproc_per_node 8 distill.py \
    ... \
    --hf_export_path /path/to/save/distilled_hf_ckpt \
    --student_hf_model Qwen/Qwen3-4B

--student_hf_model should match the base architecture of the student (used as a template for export). For non-Puzzletron (i.e. standard) models, it should be same as --student_hf_path.

Separate conversion -- convert any saved iteration using the Megatron-Bridge conversion script:

uv run python /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py export \
    --hf-model <path_to_pruned_hf_ckpt> \
    --megatron-path <distill_output_dir>/checkpoints/iter_<iter_number> \
    --hf-path <path_to_save_distilled_hf_ckpt>

For more details, see the Megatron-Bridge conversion README.

Distillation Results

See examples/pruning/ for distillation experiment results covering Minitron and Puzzletron pruning algorithms.

Post-Training Quantization

Checkout Quantization scripts for LLMs and VLMs in the Megatron-Bridge repository here.

Resources