diffusers-torchao

February 19, 2025 ยท View on GitHub

Optimize image and video generation with diffusers, torchao, combining torch.compile() ๐Ÿ”ฅ

We provide end-to-end inference and experimental training recipes to use torchao with diffusers in this repo. We demonstrate 53.88% speedup on Flux.1-Dev* and 27.33% speedup on CogVideoX-5b when comparing compiled quantized models against their standard bf16 counterparts**.

*The experiments were run on a single H100, 80 GB GPU. **The experiments were run on a single A100, 80 GB GPU. For a single H100, the speedup is 33.04%

Updates

  • torchao is now officially supported as a quantization backend in diffusers. Check out the docs for more details.
  • torchao is being integrated into diffusers as an official quantization backend. Be on the lookout for this PR to get merged.
  • torchao will soon be added as a quantization backend in diffusers, making it even easier to use with diffusers.
  • Check out our new AoT compilation and serialization guide to reduce framework overheads.

No-frills code:

from diffusers import FluxPipeline
+ from torchao.quantization import autoquant
import torch 

pipeline = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16
).to("cuda")
+ pipeline.transformer = autoquant(pipeline.transformer, error_on_unseen=False)
image = pipeline(
    "a dog surfing on moon", guidance_scale=3.5, num_inference_steps=50
).images[0]

Throw in torch.compile() to make it go brrr:

# If you are using "autoquant" then you should compile first and then
# apply autoquant.
+ pipeline.transformer.to(memory_format=torch.channels_last)
+ pipeline.transformer = torch.compile(
+    pipeline.transformer, mode="max-autotune", fullgraph=True
+)

This, alone, is sufficient to cut down inference time for Flux.1-Dev from 6.431 seconds to 3.483 seconds on an H100. Check out the inference directory for the code.

Note

Quantizing to a supported datatype and using base precision as fp16 can lead to overflows. The recommended base precision for CogVideoX-2b is fp16 while that of CogVideoX-5b is bf16. If comparisons were to be made in fp16, the speedup gains would be ~23% and ~32% respectively.

Table of contents

Environment

We conducted all our experiments on a single A100 (80GB) and H100 GPUs. Since we wanted to benefit from torch.compile(), we used relatively modern cards here. For older cards, same memory savings (demonstrated more below) can be obtained.

We always default to using the PyTorch nightly, updated diffusers and torchao codebases. We used CUDA 12.2.

Benchmarking results

We benchmark two models (Flux.1-Dev and CogVideoX) using different supported quantization datatypes in torchao. The results are as follows:

Flux.1 Dev Benchmarks

Additional Results
ckpt_idbatch_sizefusecompilecompile_vaequantizationsparsifymodel_memoryinference_memorytime
black-forest-labs/FLUX.1-dev4TrueTrueFalsefp8woFalse22.36835.61616.204
black-forest-labs/FLUX.1-dev8FalseFalseFalseNoneFalse31.43847.50949.438
black-forest-labs/FLUX.1-dev8FalseTrueFalseNoneFalse31.43947.50631.685
black-forest-labs/FLUX.1-dev1FalseTrueFalseint8dqFalse20.38631.6083.406
black-forest-labs/FLUX.1-dev4FalseTrueFalseint8woFalse20.38731.60916.08
black-forest-labs/FLUX.1-dev8FalseTrueFalsefp8dqFalse20.35736.42523.393
black-forest-labs/FLUX.1-dev8TrueTrueFalseint8dqFalse22.39738.46424.696
black-forest-labs/FLUX.1-dev8FalseFalseFalseint8dqFalse20.38636.458333.567
black-forest-labs/FLUX.1-dev4TrueFalseFalsefp8dqFalse22.36135.82626.259
black-forest-labs/FLUX.1-dev8FalseTrueFalseint8dqFalse20.38636.45324.725
black-forest-labs/FLUX.1-dev1TrueTrueFalseint8woFalse22.39635.6164.574
black-forest-labs/FLUX.1-dev1FalseTrueFalsefp8woFalse20.36331.6074.395
black-forest-labs/FLUX.1-dev8TrueFalseFalseint8woFalse22.39738.46857.274
black-forest-labs/FLUX.1-dev4TrueFalseFalseint8dqFalse22.39635.616219.687
black-forest-labs/FLUX.1-dev4FalseFalseFalseNoneFalse31.43839.4924.828
black-forest-labs/FLUX.1-dev1TrueTrueFalsefp8dqFalse22.36335.8273.192
black-forest-labs/FLUX.1-dev1FalseFalseFalsefp8dqFalse20.35631.8178.622
black-forest-labs/FLUX.1-dev8FalseFalseFalsefp8dqFalse20.35736.42855.097
black-forest-labs/FLUX.1-dev4FalseFalseFalseint8woFalse20.38431.60629.414
black-forest-labs/FLUX.1-dev1TrueFalseFalsefp8woFalse22.37135.6188.33
black-forest-labs/FLUX.1-dev1FalseFalseFalseint8dqFalse20.38631.608130.498
black-forest-labs/FLUX.1-dev8TrueTrueFalsefp8woFalse22.36938.43631.718
black-forest-labs/FLUX.1-dev4FalseFalseFalsefp8woFalse20.36331.60726.61
black-forest-labs/FLUX.1-dev1TrueFalseFalseint8woFalse22.39735.6168.49
black-forest-labs/FLUX.1-dev8TrueFalseFalsefp8dqFalse22.36338.43351.547
black-forest-labs/FLUX.1-dev4FalseTrueFalsefp8dqFalse20.35931.8211.919
black-forest-labs/FLUX.1-dev4FalseTrueFalseNoneFalse31.43839.48815.948
black-forest-labs/FLUX.1-dev4TrueTrueFalseint8dqFalse22.39735.61612.594
black-forest-labs/FLUX.1-dev1TrueTrueFalsefp8woFalse22.36935.6164.326
black-forest-labs/FLUX.1-dev4TrueFalseFalseint8woFalse22.39735.61729.394
black-forest-labs/FLUX.1-dev1FalseFalseFalsefp8woFalse20.36231.6078.402
black-forest-labs/FLUX.1-dev8TrueFalseFalseint8dqFalse22.39738.468322.688
black-forest-labs/FLUX.1-dev1FalseFalseFalseint8woFalse20.38531.6078.551
black-forest-labs/FLUX.1-dev8TrueTrueFalsefp8dqFalse22.36338.4323.261
black-forest-labs/FLUX.1-dev4FalseFalseFalsefp8dqFalse20.35631.81728.154
black-forest-labs/FLUX.1-dev1TrueFalseFalseint8dqFalse22.39735.616119.736
black-forest-labs/FLUX.1-dev8TrueFalseFalsefp8woFalse22.36938.44151.311
black-forest-labs/FLUX.1-dev4FalseTrueFalsefp8woFalse20.36331.60716.232
black-forest-labs/FLUX.1-dev4TrueTrueFalseint8woFalse22.39935.61916.158
black-forest-labs/FLUX.1-dev8FalseFalseFalsefp8woFalse20.36336.43451.223
black-forest-labs/FLUX.1-dev4FalseFalseFalseint8dqFalse20.38531.607221.588
black-forest-labs/FLUX.1-dev1TrueFalseFalsefp8dqFalse22.36435.8297.34
black-forest-labs/FLUX.1-dev1FalseFalseFalseNoneFalse31.43833.8516.573
black-forest-labs/FLUX.1-dev4TrueTrueFalsefp8dqFalse22.36335.82711.885
black-forest-labs/FLUX.1-dev1FalseTrueFalseint8woFalse20.38431.6064.615
black-forest-labs/FLUX.1-dev8FalseTrueFalseint8woFalse20.38636.45331.159
black-forest-labs/FLUX.1-dev1TrueTrueFalseint8dqFalse22.39735.6173.357
black-forest-labs/FLUX.1-dev1FalseTrueFalsefp8dqFalse20.35731.8183.243
black-forest-labs/FLUX.1-dev4FalseTrueFalseint8dqFalse20.38431.60612.513
black-forest-labs/FLUX.1-dev8FalseTrueFalsefp8woFalse20.36336.4331.783
black-forest-labs/FLUX.1-dev1FalseTrueFalseNoneFalse31.43833.8514.209
black-forest-labs/FLUX.1-dev8FalseFalseFalseint8woFalse20.38636.45757.026
black-forest-labs/FLUX.1-dev8TrueTrueFalseint8woFalse22.39738.46431.216
black-forest-labs/FLUX.1-dev4TrueFalseFalsefp8woFalse22.36835.61626.716

With the newly added fp8dqrow scheme, we can bring down the inference latency to 2.966 seconds for Flux.1 Dev (batch size:1 , steps: 28, resolution: 1024) on an H100. fp8dqrow has more scales per tensors and less quantization error. Additional results:

Additional `fp8dqrow` results
ckpt_idbatch_sizefusecompilecompile_vaequantizationsparsifymodel_memoryinference_memorytime
0black-forest-labs/FLUX.1-dev4TrueTrueTruefp8dqrowFalse22.37735.8311.441
1black-forest-labs/FLUX.1-dev1FalseTrueTruefp8dqrowFalse20.36831.8182.981
2black-forest-labs/FLUX.1-dev4TrueTrueFalsefp8dqrowFalse22.37835.82911.682
3black-forest-labs/FLUX.1-dev1FalseTrueFalsefp8dqrowFalse20.3731.823.039
4black-forest-labs/FLUX.1-dev4FalseTrueFalsefp8dqrowFalse20.36931.81811.692
5black-forest-labs/FLUX.1-dev4FalseTrueTruefp8dqrowFalse20.36731.81711.421
6black-forest-labs/FLUX.1-dev1TrueTrueTruefp8dqrowFalse22.37935.8312.966
7black-forest-labs/FLUX.1-dev1TrueTrueFalsefp8dqrowFalse22.37635.8273.03

Trade-offs, trade-offs, and more trade-offs

We know that the table included above is hard to parse. So, we wanted to include a couple of points that are worth noting.

  • Select the quantization technique that gives you the best trade-off between memory and latency.
  • A quantization technique may exhibit different optimal settings for a given batch size. For example, for a batch size of 4, int8dq gives best time without any QKV fusion. But for other batch sizes, that is not the case.

The section below, drives this point home.

Higher batch sizes like 16 and above

This is how the top-5 latency looks like:

Collapse table
ckpt_idbatch_sizefusecompilecompile_vaequantizationsparsifymodel_memoryinference_memorytime
0black-forest-labs/FLUX.1-dev16FalseTrueTruefp8dqFalse20.35652.70445.004
1black-forest-labs/FLUX.1-dev16FalseTrueTruefp8dqrowFalse20.36852.71545.521
2black-forest-labs/FLUX.1-dev16TrueTrueFalsefp8dqFalse22.36352.46445.614
3black-forest-labs/FLUX.1-dev16FalseTrueFalsefp8dqFalse20.35650.45845.865
4black-forest-labs/FLUX.1-dev16FalseTrueFalsefp8dqrowFalse20.36750.46946.392

But interestingly, if we use an exotic fpx scheme for quantization, we can afford lesser memory with an increase in the latency:

Collapse table
ckpt_idbatch_sizefusecompilecompile_vaequantizationsparsifymodel_memoryinference_memorytime
0black-forest-labs/FLUX.1-dev16FalseTrueTruefp6_e3m2False17.59149.93861.649
1black-forest-labs/FLUX.1-dev16FalseTrueTruefp4_e2m1False14.82347.17361.75
2black-forest-labs/FLUX.1-dev16TrueTrueFalsefp6_e3m2False19.10449.20662.244
3black-forest-labs/FLUX.1-dev16TrueTrueFalsefp4_e2m1False15.82745.92962.296
4black-forest-labs/FLUX.1-dev16FalseTrueFalsefp6_e3m2False17.59847.762.551

As a reference, with just torch.bfloat16 and SDPA, for a batch size of 16, we get:

ckpt_idbatch_sizefusecompilecompile_vaequantizationsparsifymodel_memoryinference_memorytime
0black-forest-labs/FLUX.1-dev16FalseFalseFalseNoneFalse31.43861.54897.545

Warning


Using fp4_e2m1 on the VAE negatively affects the image quality significantly.

Semi-structured sparsity + dynamic int8 quant

In our inference/benchmark_image.py script, there's an option to enable semi-structured sparsity with dynamic int8 quantization which is particularly suitable for larger batch sizes. You can enable it through the --sparsify flag. But we found that it significantly degrades image quality at the time of this writing.

Things to note:

  • Only CUDA 12.4 and H100 and A100 devices support this option. You can use this Docker container: spsayakpaul/torchao-exps:latest. It has CUDA 12.4, torch nightlies, and other libraries installed to run the sparsity benchmark.
  • Running with semi-structured sparsity and int8 dynamic quantization allows a batch size of 16.

The table below provides some benchmarks:

Sparsity Benchmarks
ckpt_idbatch_sizefusecompilecompile_vaesparsifytime
0black-forest-labs/FLUX.1-dev16TrueTrueTrueTrue50.62
1black-forest-labs/FLUX.1-dev16FalseTrueTrueTrue51.167
2black-forest-labs/FLUX.1-dev16TrueTrueFalseTrue51.418
3black-forest-labs/FLUX.1-dev16FalseTrueFalseTrue51.941

Note

We can additionally compile the VAE too and it should work with most of the quantization schemes: pipeline.vae.decode = torch.compile(pipeline.vae.decode, mode="max-autotune", fullgraph=True), but the sake of simplicity, we decided to not include it.

CogVideoX Benchmarks

CogVideoX Benchmarks

A100

model_typecompilefuse_qkvquantize_vaequantizationmodel_memoryinference_memorytime
5BFalseFalseFalsefp1619.76431.746258.962
5BFalseTrueFalsefp1621.97933.961257.761
5BTrueFalseFalsefp1619.76331.742225.998
5BTrueTrueFalsefp1621.97933.961225.814
5BFalseFalseFalsebf1619.76431.746243.312
5BFalseTrueFalsebf1621.97933.96242.519
5BTrueFalseFalsebf1619.76331.742212.022
5BTrueTrueFalsebf1621.97933.961211.377
5BFalseFalseFalseint8wo10.30222.288260.036
5BFalseTrueFalseint8wo11.41423.396271.627
5BTrueFalseFalseint8wo10.30122.282205.899
5BTrueTrueFalseint8wo11.41223.397209.640
5BFalseFalseFalseint8dq10.322.287550.239
5BFalseTrueFalseint8dq11.41423.399530.113
5BTrueFalseFalseint8dq10.322.286177.256
5BTrueTrueFalseint8dq11.41423.399177.666
5BFalseFalseFalseint4wo6.23718.2211130.86
5BFalseTrueFalseint4wo6.82418.8061127.56
5BTrueFalseFalseint4wo6.23518.2171068.31
5BTrueTrueFalseint4wo6.82518.8091067.26
5BFalseFalseFalseint4dq11.4823.463340.204
5BFalseTrueFalseint4dq12.78524.771323.873
5BTrueFalseFalseint4dq11.4823.466219.393
5BTrueTrueFalseint4dq12.78524.774218.592
5BFalseFalseFalsefp67.90219.886283.478
5BFalseTrueFalsefp68.73420.718281.083
5BTrueFalseFalsefp67.919.885205.123
5BTrueTrueFalsefp68.73420.719204.564
5BFalseFalseFalseautoquant19.76324.938540.621
5BFalseTrueFalseautoquant21.97827.1504.031
5BTrueFalseFalseautoquant19.76324.73176.794
5BTrueTrueFalseautoquant21.97826.948177.122
5BFalseFalseFalsesparsify6.74318.727308.767
5BFalseTrueFalsesparsify7.43919.433300.013
2BFalseFalseFalsefp1612.53524.51196.918
2BFalseTrueFalsefp1613.16925.14296.610
2BTrueFalseFalsefp1612.52424.49883.938
2BTrueTrueFalsefp1613.16925.14384.694
2BFalseFalseFalsebf1612.5524.52893.896
2BFalseTrueFalsebf1613.19425.17193.396
2BTrueFalseFalsebf1612.48624.52681.224
2BTrueTrueFalsebf1613.1325.17181.520
2BFalseFalseFalsefp66.12518.16495.684
2BFalseTrueFalsefp66.76918.80891.698
2BTrueFalseFalsefp66.12518.16472.261
2BTrueTrueFalsefp66.76718.80890.585
2BFalseFalseFalseint8wo6.5818.621102.941
2BFalseTrueFalseint8wo6.89418.936102.403
2BTrueFalseFalseint8wo6.57718.61881.389
2BTrueTrueFalseint8wo6.89118.9383.079
2BFalseFalseFalseint8dq6.5818.621197.254
2BFalseTrueFalseint8dq6.89418.936190.125
2BTrueFalseFalseint8dq6.5818.62175.16
2BTrueTrueFalseint8dq6.89118.93374.981
2BFalseFalseFalseint4dq7.34419.385132.155
2BFalseTrueFalseint4dq7.76219.743122.657
2BTrueFalseFalseint4dq7.39519.37483.103
2BTrueTrueFalseint4dq7.76219.74182.642
2BFalseFalseFalseint4wo4.15516.138363.792
2BFalseTrueFalseint4wo4.34516.328361.839
2BTrueFalseFalseint4wo4.15516.139342.817
2BTrueTrueFalseint4wo4.35416.339341.48
2BFalseFalseFalseautoquant12.5519.734185.023
2BFalseTrueFalseautoquant13.19420.319177.602
2BTrueFalseFalseautoquant12.5519.56575.005
2BTrueTrueFalseautoquant13.19520.19174.807
2BFalseFalseFalsesparsify4.44516.431125.59
2BFalseTrueFalsesparsify4.65216.635121.357

H100

model_typecompilefuse_qkvquantize_vaequantizationmodel_memoryinference_memorytime
5BFalseTrueFalsefp1621.97833.988113.945
5BTrueTrueFalsefp1621.97933.9987.155
5BFalseTrueFalsebf1621.97933.988112.398
5BTrueTrueFalsebf1621.97933.98787.455
5BFalseTrueFalsefp811.37423.383113.167
5BTrueTrueFalsefp811.37423.38375.255
5BFalseTrueFalseint8wo11.41423.422123.144
5BTrueTrueFalseint8wo11.41423.42387.026
5BTrueTrueFalseint8dq11.41259.35578.945
5BFalseTrueFalseint4dq12.78524.793151.242
5BTrueTrueFalseint4dq12.78524.79587.403
5BFalseTrueFalseint4wo6.82418.829667.125

Through visual inspection of various outputs, we identified that the best results were achieved with int8 weight-only quantization, int8 dynamic quantization, fp8 (currently supported only on Hopper architecture), and autoquant. While the outputs sometimes differed visually from their standard fp16/bf16 counterparts, they maintained the expected quality. Additionally, we observed that int4 dynamic quantization generally produced satisfactory results in most cases, but showed greater deviation in structure, color, composition and motion.

With the newly added fp8dqrow scheme, the inference latency is 76.70 seconds for CogVideoX-5b (batch size: 1 , steps: 50, frames: 49, resolution: 720x480) on an H100. fp8dqrow has more scales per tensors and less quantization error. The quality, from visual inspection, is very close to fp16/bf16 and better than int8 in many cases.

TorchAO also supports arbitary exponent and mantissa bits for floating point types, which provides experimental freedom to find the best settings for your models. Here, we also share results with fp6_e3m2, fp5_e2m2 and fp4_e2m1. We find that fp6 and fp5 quantizations can preserve good generation quality and match the expectation from fp16 precision most of the time. To achieve a balance between speed and quality, the recommended quantization dtypes for lower VRAM GPUs are int8dq, fp8dqrow, fp6_e3m2 and autoquant which, when compiled, are faster or close in performance to their bf16 counterparts.

Additional `fp8dqrow`, `fp6_e3m2`, `fp5_e2m2` and `fp4_e2m1` benchmarks

H100

model_typecompilefuse_qkvquantize_vaequantizationmodel_memoryinference_memorytime
5BFalseFalseFalsefp8dqrow10.2822.291122.99
5BFalseTrueFalsefp8dqrow11.38923.399118.205
5BTrueFalseFalsefp8dqrow10.28222.29276.777
5BTrueTrueFalsefp8dqrow11.39123.476.705

A100

model_typecompilefuse_qkvquantize_vaequantizationmodel_memoryinference_memorytime
5BFalseFalseFalsefp6_e3m27.79821.028287.842
5BTrueFalseFalsefp6_e3m27.821.028208.499
5BFalseTrueFalsefp6_e3m28.6323.243285.294
5BTrueTrueFalsefp6_e3m28.63123.243208.513
5BFalseFalseFalsefp5_e2m26.61921.02305.401
5BTrueFalseFalsefp5_e2m26.62221.021217.707
5BFalseTrueFalsefp5_e2m27.31223.237304.725
5BTrueTrueFalsefp5_e2m27.31223.237213.837
5BFalseFalseFalsefp4_e2m15.42321.012282.835
5BTrueFalseFalsefp4_e2m15.42221.013207.719
5BFalseTrueFalsefp4_e2m15.97823.228280.262
5BTrueTrueFalsefp4_e2m15.97723.227207.520

Note

From our testing and feedback from various folks that tried out torchao quantization after the release of CogVideoX, it was found that Ampere and above architectures had the best support for quantization dtypes. For other architectures such as Turing or Volta, quantizing the models did not help save memory or the inference errored out. It was particularly pointed out to be erroneous with the Apple mps backend. Support for other architectures will only get better with time.

CogVideoX memory savings

  • From the table, it can be seen that the memory required to load the standard bf16 model into memory is about 19.7 GB, and to run inference is about 31.7 GB. To keep the quality on par, let's quantize using int8 weight-only. This requires about 10.3 GB to load the memory in model, and 22.2 GB to run inference:
Code
import torch
from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXPipeline
from diffusers.utils import export_to_video
from transformers import T5EncoderModel
from torchao.quantization import quantize_, int8_weight_only

model_id = "THUDM/CogVideoX-5b"

text_encoder = T5EncoderModel.from_pretrained(model_id, subfolder="text_encoder", torch_dtype=torch.bfloat16)
quantize_(text_encoder, int8_weight_only())

transformer = CogVideoXTransformer3DModel.from_pretrained(model_id, subfolder="transformer", torch_dtype=torch.bfloat16)
quantize_(transformer, int8_weight_only())

vae = AutoencoderKLCogVideoX.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.bfloat16)
quantize_(vae, int8_weight_only())

# Create pipeline and run inference
pipe = CogVideoXPipeline.from_pretrained(
    model_id,
    text_encoder=text_encoder,
    transformer=transformer,
    vae=vae,
    torch_dtype=torch.bfloat16,
).to("cuda")

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
video = pipe(prompt=prompt, num_inference_steps=1).frames[0]
export_to_video(video, "output.mp4", fps=8)
  • Let's enable CPU offloading for models as described in diffusers-specific optimizations. Initially, no models are loaded onto the GPU and everything resides on the CPU. It requires about 10.3 GB to keep all components on the CPU. However, the peak memory used during inference drops to 12.4 GB. Note that inference will be slightly slower due to the time required to move different modeling components between CPU to GPU and back.
pipe = CogVideoXPipeline.from_pretrained(
    model_id,
    text_encoder=text_encoder,
    transformer=transformer,
    vae=vae,
    torch_dtype=torch.bfloat16,
- ).to("cuda")
+ )

+ pipe.enable_model_cpu_offload()
pipe = ...
pipe.enable_model_cpu_offload()

+ pipe.vae.enable_tiling()
  • Instead of pipe.enable_model_cpu_offload(), one can use pipe.enable_sequential_cpu_offload() that brings down memory usage to 4.8 GB without quantization and 3.1 GB with quantization. Note that sequential cpu offloading comes at a tradeoff with much more time required during inference. You are required to install accelerate from source until next release for this to work without any errors.
pipe = ...
- pipe.enable_model_cpu_offload()
+ pipe.enable_sequential_cpu_offload()

+ pipe.vae.enable_tiling()

Note

We use torch.cuda.max_memory_allocated() to report the peak memory values.

Diffusers-specific optimizations

For supported architectures, memory requirements could further be brought down using Diffusers-supported functionality:

  • pipe.enable_model_cpu_offload(): Only keeps the active Diffusers-used models (text encoder, transformer/unet, vae) on device
  • pipe.enable_sequential_cpu_offload(): Similar to above, but performs cpu offloading more aggressively by only keeping active torch modules on device
  • pipe.vae.enable_vae_tiling(): Enables tiled encoding/decoding by breaking up latents into smaller tiles and performing respective operation on each tile
  • pipe.vae.enable_vae_slicing(): Helps keep memory usage constant when generating more than one image/video at a time

Autoquant and autotuning

Given these many options around quantization, which one do I choose for my model? Enter "autoquant". It tries to quickly and accurately quantize your model. By the end of the process, it creates a "quantization plan" which can be accessed through AUTOQUANT_CACHE and reused.

So, we would essentially do after performing quantization with autoquant and benchmarking:

from torchao.quantization.autoquant import AUTOQUANT_CACHE
import pickle 

with open("quantization-cache.pkl", "wb") as f:
    pickle.dump(AUTOQUANT_CACHE)

And then to reuse the plan, we would do in our final codebase:

from torchao.quantization.autoquant import AUTOQUANT_CACHE
with open("quantization-cache.pkl", "rb") as f:
    AUTOQUANT_CACHE.update(pickle.load(f))

Know more about "autoquant" here.

Another useful (but time-consuming) feature of torchao is "autotuning". It tunes the int_scaled_matmul kernel for int8 dynamic + int8 weight quantization for the shape at runtime (given the shape of tensor passed to int_scaled_matmul op). Through this process, it tries to identify the most efficient kernel configurations for a given model and inputs.

To launch quantization benchmarking with autotuning, we need to enable the TORCHAO_AUTOTUNER_ENABLE. So, essentially: TORCHAO_AUTOTUNER_ENABLE=1 TORCHAO_AUTOTUNER_DATA_PATH=my_data.pkl python my_script.py. And when it's done, we can simply reuse the configs it found by doing: TORCHAO_AUTOTUNER_DATA_PATH=my_data.pkl python my_script.py.

If you're using autotuning, keep in mind that it only works for intX quantization, for now and it is quite time-consuming.

Note

Autoquant and autotuning are two different features.

Reducing quantization time and peak memory

If we keep the model on CPU and quantize it, it takes a long time while keeping the peak memory minimum. How about we do both i.e., quantize fast while keeping peak memory to a bare minimum?

It is possible to pass a device argument to the quantize_() method of torchao. It basically moves the model to CUDA and quantizes each parameter individually:

quantize_(model, int8_weight_only(), device="cuda")

Here's a comparison:

Quantize on CPU:
  - Time taken: 10.48 s
  - Peak memory: 6.99 GiB
Quantize on CUDA:
  - Time taken: 1.96 s
  - Peak memory: 14.50 GiB
Move to CUDA and quantize each param individually:
  - Time taken: 1.94 s
  - Peak memory: 8.29 GiB

Check out this pull request for more details.

Training with FP8

Check out the training directory.

Serialization and loading quantized models

Check out our serialization and loading guide here.

Things to keep in mind when benchmarking

In this section, we provide a non-exhaustive overview of the things we learned during the benchmarking process.

  • Expected gains and their ceiling are dependent on the hardware being used. For example, compute density of the operations popped on a GPU has an effect on on the speedup. For the same code, you may see better numbers on an A100 than H100, simply because the operations weren't compute-dense enough for H100. In these situations, bigger batch sizes might make the effect of using a better GPU like H100 more pronounced.

  • Shapes matter. Not all models are created equal. Certain shapes are friendlier in order for quantization to show its benefits over others. Usually, bigger shapes benefit quantization, resulting into speedups. The thinner the dimensions, the less pronounced the effects of quantization, especially for precisions like int8. In our case, using quantization on smaller models like PixArt-Sigma wasn't particularly beneficial. This is why, torchao provides an "autoquant" option that filters out smaller layers to exclude from quantization.

  • Small matmuls. If the matmuls of the underlying are small enough or the performance without quantization isn't bottlenecked by weight load time, these techniques may reduce performance.

  • Cache compilation results. torch.compile() can take long just like any other deep-learning compiler. So, it is always recommended to cache the compilation results. Refer to the official guide to know more. Additionally, we can configure the ENABLE_AOT_AUTOGRAD_CACHE flag for faster compilation times.

  • Compilation is a time-consuming process. The first time we compile, it takes a lot of time because a lot of things are getting figured out under the hood (best kernel configs, fusion strategies, etc.). The subsequent runs will be significantly faster, though. Also, for the benchmarking scripts provided in inference/, we run a couple of warmup runs to reduce the variance in our numbers as much as possible. So, if you are running the benchmarks, do expect them to take long.

Benefitting from torch.compile()

In this section, we provide a rundown of the scenarios that may prevent your model to optimally benefit from torch.compile(). This is very specific to torch.compile() and the FluxPipeline.

  • Ensure there are no graph-breaks when torch.compile() is applied on the model. Briefly, graph-breaks introduce unnecessary overheads blocking torch.compile() to obtain a full and dense graph of your model. In the case of Flux, we identified that it came from position embeddings, which was fixed in the following PRs: #9307 and #9321. Thanks to Yiyi.

  • Use the torch.profiler.profile() to get a kernel trace to identify if there is any graph break. You could use a script like this. This will give you a JSON file which you can upload to https://ui.perfetto.dev/ to view the trace. Additionally, use this guide to validate the memory wins when using torchao for quantization and combining it with torch.compile().

  • Finally, this torch.compile() manual is a gem of a reading to get an idea of how to go about approaching the profiling process.

Acknowledgement

We acknowledge the generous help and guidance provided by the PyTorch team throughout the development of this project:

  • Christian Puhrsch for guidance on removing graph-breaks and general torch.compile() stuff
  • Jerry Zhang for different torchao stuff (microbenchmarks, serialization, misc discussions)
  • Driss Guessous for all things FP8
  • Jesse Cai for help on int8_dynamic_activation_int8_weight(layout=SemiSparseLayout())
  • Mark Saroufim for reviews, discussions, and navigation