Pipeline: How All the Components Work Together

June 3, 2026 · View on GitHub

This document explains the end-to-end Ideogram 4 inference pipeline conceptually. For the architecture spec and code pointers, see model_architecture.md.

Overview

Ideogram 4 is a flow-matching text-to-image model built on a single-stream DiT (Diffusion Transformer). The pipeline has four main components:

 ┌─────────────┐   ┌──────────────────────┐   ┌──────────────┐   ┌───────────┐
 │  Qwen3-VL   │   │  Ideogram4          │   │  KL VAE      │   │           │
 │  Text       ├──►│  Transformer (DiT)   ├──►│  VAE         ├──►│  Image    │
 │  Encoder    │   │  + Euler Sampler     │   │  Decoder     │   │           │
 └─────────────┘   └──────────────────────┘   └──────────────┘   └───────────┘
     frozen              trainable                 frozen

1. Text Encoder — Qwen3-VL-8B-Instruct

The text encoder is a frozen Qwen3-VL-8B-Instruct vision-language model, used in text-only mode (no vision inputs).

What it does:

  • Tokenizes the prompt using the Qwen3 chat template.
  • Runs a forward pass through the 36-layer transformer.
  • Extracts hidden states from 13 specific layers: 0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 35.
  • Concatenates these hidden states along the feature dimension, producing a multi-scale text representation.

Why multi-layer extraction? Different layers capture different levels of abstraction — early layers encode surface-level token information, while later layers encode deeper semantic meaning. Concatenating them gives the DiT access to the full spectrum.

Output: A tensor of shape (batch, num_text_tokens, hidden_dim * 13).

2. DiT Backbone — Ideogram4Transformer

The core generative model is a 34-layer single-stream Diffusion Transformer.

Sequence layout

Text tokens and image latent tokens are concatenated into one sequence and processed through the same self-attention layers.

``$ \text{Sequence} \text{layout} (\text{per} \text{sample}):

┌───────────────────┬────────────────────────┐ │ \text{text} \text{tokens} │ \text{image} \text{latent} \text{tokens} │ │ (\text{up} \text{to} 2048) │ (\text{grid_h} \times \text{grid_w}) │ └───────────────────┴────────────────────────┘ ▲ ▲ \text{Qwen3}-\text{VL} \text{features} \text{noisy} \text{latents} \text{z_t} $``

Key components per block

  • Self-attention with QK-RMSNorm and 3D Multimodal RoPE (MRoPE). The positional encoding is 3-dimensional: for text tokens it uses a 1D position broadcast to 3 axes; for image tokens it uses (temporal, height, width) coordinates. This lets text and image tokens coexist in a unified positional space.
  • SwiGLU MLP — the feed-forward layer uses a gated linear unit with SiLU activation.
  • Adaptive Layer Norm (AdaLN) — the timestep t is embedded as a scalar and generates per-block scale and gate parameters. This conditions every layer on the current noise level.

Flow matching

The model is trained with a flow-matching objective. Instead of predicting noise (as in DDPM), the model predicts a velocity field v(z_t, t) that defines the ODE:

dz/dt = v(z_t, t)

At inference time, we start from pure Gaussian noise z_1 and integrate backward to z_0 (the clean image) using the Euler method:

z_{t-dt} = z_t + v(z_t, t) * dt

Noise schedule

The timestep distribution follows a logit-normal schedule parameterized by (mu, sigma). The mean mu controls how much time the sampler spends at different noise levels — higher mu shifts more steps toward higher noise (important for high-resolution images). The schedule auto-adjusts for resolution:

mu_adjusted = mu_base + 0.5 * log(num_pixels / base_pixels)

where base_pixels = 512 * 512.

3. Classifier-Free Guidance (CFG)

At each sampling step, two forward passes are run through the DiT:

  1. Conditional (positive): full text features + noisy image latents.
  2. Unconditional (negative): zeroed text features + noisy image latents (image-only tokens, asymmetric CFG).

The guided velocity is a weighted combination:

v_guided = gw * v_conditional + (1 - gw) * v_unconditional

where gw is the per-step guidance weight. With gw > 1, the model amplifies the text-conditional signal and suppresses the unconditional prediction, producing images that follow the prompt more faithfully.

Asymmetric CFG: The unconditional branch only processes image tokens (no text padding), making it computationally cheaper than a full-sequence negative pass.

Per-step schedules: The guidance weight can vary across steps. The V4_QUALITY_48 preset, for example, uses gw=7 for the first 45 steps and gw=3 for the final 3 "polish" steps near t=0.

4. VAE Decoder — KL Autoencoder

The denoised latent z_0 is decoded to pixel space using a frozen KL autoencoder.

What it does:

  • Unpatching: The DiT works with 2×2 patches of latent pixels. The decoder input is reshaped from (batch, grid_h * grid_w, channels * 4) to (batch, channels, grid_h * 2, grid_w * 2).
  • Denormalization: Per-channel shift and scale are applied to undo the latent normalization used during training.
  • Decoding: The VAE decoder maps latents to RGB pixels.
  • Clipping: Output is clamped to [-1, 1] and rescaled to [0, 255] uint8.

Compression factor: The autoencoder provides 8× spatial compression on each axis, and the 2×2 patching in the DiT adds another 2×. So a 1024×1024 image is represented as a 64×64 grid of latent tokens, each with 128 channels (32 base channels × 2² patch).

Putting it all together

# Pseudocode for one generation call:

# 1. Encode text
text_features = qwen3_vl.encode(prompt)  # (B, L_text, D)

# 2. Initialize noise
z = torch.randn(B, grid_h * grid_w, 128)  # pure noise at t=1

# 3. Euler integration from t=1 to t=0
for step in reversed(range(num_steps)):
    t = schedule(step)
    s = schedule(step - 1)

    # Conditional pass (text + image)
    v_cond = dit(text_features, z, t)

    # Unconditional pass (image only, zeroed text)
    v_uncond = dit(zeros, z, t)

    # CFG combination
    v = gw[step] * v_cond + (1 - gw[step]) * v_uncond

    # Euler step
    z = z + v * (s - t)

# 4. Decode to pixels
image = vae.decode(z)