Performance Summary

April 17, 2026 ยท View on GitHub

This document provides performance benchmarks for various large language models using NeMo AutoModel with the PyTorch backend.

Pre-Training Performance

The table below shows training performance for full sequences with no padding across different model architectures and scales.

System: DGX-H100, Precision: BF16

Model#GPUsGBSMBSLBSGASeq LengthTPPPCPEPVPFSDPKernel OptimizationsTime per Global Step (s)Model TFLOPs/sec/GPUTokens/sec/GPU
Nemotron V3 Super 120B (26.02)64512224409611164-64TE + DeepEP + TorchSDPA7.2863344,497
Nemotron V3 Nano 30B (26.02)8512441640961118-8TE + DeepEP + TorchSDPA15.61432816,789
DeepSeek V3 671B102481921844096141648256TE + DeepEP37.87216865
DeepSeek V3 671B256512181409614164864TE + DeepEP8.182501,002
Kimi K2256512182409618132432TE + DeepEP8.86189924
Qwen3 MoE 30B8512441640961118-8TE + DeepEP21.77327712,040
GPT-OSS 20B825622164096111--8TE + DeepEP + FlexAttn10.0427913,058
GPT-OSS 120B645122244096111--64TE + DeepEP + FlexAttn4.302317,626
Llama3 70B641281148192112--32TE + fsdp2_prefetch18.90389866.77

Fine-Tuning (LoRA) Performance

The table below shows fine-tuning (LoRA) performance for full sequences with no padding across different model architectures and scales.

System: DGX-H100, Precision: BF16

Model#GPUsGBSMBSLBSGASeq LengthTPPPCPEPVPFSDPKernel OptimizationsTime per Global Step (s)Model TFLOPs/sec/GPUTokens/sec/GPU
Llama3 8B13222164096111-11TE + triton10.5140212472.87
Qwen2.5 7B13222164096111-11TE + triton9.2942314110.05
Llama3 70B8322244096211-14TE + triton + fsdp2_prefetch15.003161091.85
Qwen2.5 32B8322244096211-14TE + triton + fsdp2_prefetch7.283012250.31
Llama3 70B 2-node16322224096211-18TE + triton + fsdp2_prefetch8.32285984.85
Qwen2.5 32B 2-node16322224096211-18TE + triton + fsdp2_prefetch3.952772072.89

Glossary

  • MFU: Model FLOPs Utilization - ratio of achieved compute to peak hardware capability
  • TP: Tensor Parallelism - splits individual layers across GPUs
  • PP: Pipeline Parallelism - splits model layers into stages
  • EP: Expert Parallelism - distributes MoE experts across GPUs
  • DP: Data Parallelism - replicates model and splits data
  • VP: Virtual Pipeline - number of pipeline stages per GPU for interleaving
  • MBS: Micro-Batch Size - size of one forward pass in pipeline
  • LBS: Local Batch Size - size of one step per GPU
  • GBS: Global Batch Size - total batch size across all GPUs
  • GA: Gradient Accumulation - number of local-batches before optimizer step
  • TE: Transformer Engine kernel optimizations - RMSNorm, Linear and DotProductAttention
  • DeepEP: Deep Expert Parallelism - advanced EP routing for MoE models
  • FlexAttn: PyTorch's Flex Attention

Configuration Files

Pre-training and fine-tuning (LoRA) benchmark configurations are available in examples/llm_benchmark/:

:::{note}

  • All benchmarks use mock data for consistent performance measurement.
  • Fake balanced gate is enabled to simulate ideal expert routing.
  • No gradient clipping applied for pure performance measurement.
  • MFU calculated using peak TFLOPs for the system (989 for BF16 H100).
  • Step times include forward and backward passes + optimizer step for the global batch. :::

Version Information

  • Last Updated: 2025-10-02
  • NeMo AutoModel Version: main Branch