MMaDA

April 3, 2026 ยท View on GitHub

Masked Diffusion Adaptation (MMaDA) is an 8B multimodal foundation model from Gen-Verse that unifies text generation, image generation, and image understanding through a masked diffusion framework.

Architecture

Unlike autoregressive models, MMaDA uses discrete masked diffusion for all modalities --- both text and image tokens are generated via iterative demasking. Key components:

  • LLM base: LLaDA (8B), a masked diffusion language model built on Llama3 architecture
  • Visual tokenizer: MagVITv2 (discrete, codebook size 8192, produces 1024 tokens per 512x512 image)
  • Image generation: MaskGIT-style parallel decoding with classifier-free guidance
  • Image understanding: VQ-encoded image tokens concatenated with text, demasked to produce text response
  • Max resolution: 512x512

Model Variants

VariantHuggingFace IDDescription
MMaDA-8B-BaseGen-Verse/MMaDA-8B-BaseBase model
MMaDA-8B-MixCoTGen-Verse/MMaDA-8B-MixCoTChain-of-thought reasoning variant

Both variants share the same architecture and loading procedure. Switch between them via model_path in config.

Dependencies

The model environment is managed via the mmada image defined in modal/images.py (Python 3.10, PyTorch 2.5.1, CUDA 12.4, transformers 4.46.0). For local setup, install the dependencies listed in model/MMaDA/requirements.txt.

MMaDA benefits from Flash Attention for faster inference. The Modal image includes flash-attn 2.7.4 (pre-compiled wheel). For local installation:

# Check your environment first
python -c "import torch; print(torch.__version__)"  # e.g., 2.5.1
nvcc -V                                               # e.g., CUDA 12.x

# Download matching wheel from https://github.com/Dao-AILab/flash-attention/releases
pip install flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Inference

Python API

from umm.inference.pipeline import InferencePipeline
from umm.inference.multimodal_inputs import InferenceRequest

pipeline = InferencePipeline(backbone_name="mmada", backbone_cfg={
    "model_path": "/path/to/MMaDA-8B-Base",
    "mmada_root": "/path/to/model/MMaDA",
    "vq_model_path": "/path/to/magvitv2",
    "seed": 42,
})

# Text-to-image generation
result = pipeline.run(InferenceRequest(
    backbone="mmada", task="generation",
    prompt="A cat sitting on a rainbow",
    params={"guidance_scale": 1.5, "timesteps": 12},
))

# Image understanding
result = pipeline.run(InferenceRequest(
    backbone="mmada", task="understanding",
    prompt="Describe this image in detail.",
    images=["path/to/image.jpg"],
    params={"max_new_tokens": 512},
))
# Download model weights
modal run modal/download.py --model mmada

# Run generation evaluation
modal run modal/run.py --model mmada --eval-config modal_geneval_mmada

# Run understanding evaluation
modal run modal/run.py --model mmada --eval-config modal_ueval_mmada

Supported Benchmarks

Generation

BenchmarkConfig
GenEvalconfigs/eval/geneval/modal_geneval_mmada.yaml
WISEconfigs/eval/wise/modal_wise_mmada.yaml
DPG Benchconfigs/eval/dpg_bench/modal_dpg_bench_mmada.yaml

Understanding

BenchmarkConfig
UEvalconfigs/eval/ueval/modal_ueval_mmada.yaml
MMBenchconfigs/eval/mmbench/modal_mmbench_mmada.yaml
MMEconfigs/eval/mme/modal_mme_mmada.yaml
MMMUconfigs/eval/mmmu/modal_mmmu_mmada.yaml
MM-Vetconfigs/eval/mmvet/modal_mmvet_mmada.yaml
MathVistaconfigs/eval/mathvista/modal_mathvista_mmada.yaml

Configuration Reference

Generation Parameters

ParameterDefaultDescription
guidance_scale1.5Classifier-free guidance scale (0 = no guidance)
temperature1.0Sampling temperature for Gumbel noise
timesteps12Number of MaskGIT decoding steps
num_vq_tokens1024Number of VQ tokens per image
codebook_size8192VQ codebook size
mask_schedulecosineMask schedule (cosine, linear, sigmoid)

Understanding Parameters

ParameterDefaultDescription
max_new_tokens512Maximum tokens to generate
steps256Demasking steps (typically max_new_tokens / 2)
block_length128Block length for semi-autoregressive generation
temperature0.0Sampling temperature (0 = greedy)
remaskinglow_confidenceRemasking strategy (low_confidence, random)