Pipeline: How All the Components Work Together
June 3, 2026 · View on GitHub
This document explains the end-to-end Ideogram 4 inference pipeline conceptually. For the architecture spec and code pointers, see model_architecture.md.
Overview
Ideogram 4 is a flow-matching text-to-image model built on a single-stream DiT (Diffusion Transformer). The pipeline has four main components:
┌─────────────┐ ┌──────────────────────┐ ┌──────────────┐ ┌───────────┐
│ Qwen3-VL │ │ Ideogram4 │ │ KL VAE │ │ │
│ Text ├──►│ Transformer (DiT) ├──►│ VAE ├──►│ Image │
│ Encoder │ │ + Euler Sampler │ │ Decoder │ │ │
└─────────────┘ └──────────────────────┘ └──────────────┘ └───────────┘
frozen trainable frozen
1. Text Encoder — Qwen3-VL-8B-Instruct
The text encoder is a frozen Qwen3-VL-8B-Instruct vision-language model, used in text-only mode (no vision inputs).
What it does:
- Tokenizes the prompt using the Qwen3 chat template.
- Runs a forward pass through the 36-layer transformer.
- Extracts hidden states from 13 specific layers: 0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 35.
- Concatenates these hidden states along the feature dimension, producing a multi-scale text representation.
Why multi-layer extraction? Different layers capture different levels of abstraction — early layers encode surface-level token information, while later layers encode deeper semantic meaning. Concatenating them gives the DiT access to the full spectrum.
Output: A tensor of shape (batch, num_text_tokens, hidden_dim * 13).
2. DiT Backbone — Ideogram4Transformer
The core generative model is a 34-layer single-stream Diffusion Transformer.
Sequence layout
Text tokens and image latent tokens are concatenated into one sequence and processed through the same self-attention layers.
``$ \text{Sequence} \text{layout} (\text{per} \text{sample}):
┌───────────────────┬────────────────────────┐ │ \text{text} \text{tokens} │ \text{image} \text{latent} \text{tokens} │ │ (\text{up} \text{to} 2048) │ (\text{grid_h} \times \text{grid_w}) │ └───────────────────┴────────────────────────┘ ▲ ▲ \text{Qwen3}-\text{VL} \text{features} \text{noisy} \text{latents} \text{z_t} $``
Key components per block
- Self-attention with QK-RMSNorm and 3D Multimodal RoPE (MRoPE). The positional encoding is 3-dimensional: for text tokens it uses a 1D position broadcast to 3 axes; for image tokens it uses (temporal, height, width) coordinates. This lets text and image tokens coexist in a unified positional space.
- SwiGLU MLP — the feed-forward layer uses a gated linear unit with SiLU activation.
- Adaptive Layer Norm (AdaLN) — the timestep
tis embedded as a scalar and generates per-block scale and gate parameters. This conditions every layer on the current noise level.
Flow matching
The model is trained with a flow-matching objective. Instead of predicting
noise (as in DDPM), the model predicts a velocity field v(z_t, t) that
defines the ODE:
dz/dt = v(z_t, t)
At inference time, we start from pure Gaussian noise z_1 and integrate
backward to z_0 (the clean image) using the Euler method:
z_{t-dt} = z_t + v(z_t, t) * dt
Noise schedule
The timestep distribution follows a logit-normal schedule parameterized by
(mu, sigma). The mean mu controls how much time the sampler spends at
different noise levels — higher mu shifts more steps toward higher noise
(important for high-resolution images). The schedule auto-adjusts for
resolution:
mu_adjusted = mu_base + 0.5 * log(num_pixels / base_pixels)
where base_pixels = 512 * 512.
3. Classifier-Free Guidance (CFG)
At each sampling step, two forward passes are run through the DiT:
- Conditional (positive): full text features + noisy image latents.
- Unconditional (negative): zeroed text features + noisy image latents (image-only tokens, asymmetric CFG).
The guided velocity is a weighted combination:
v_guided = gw * v_conditional + (1 - gw) * v_unconditional
where gw is the per-step guidance weight. With
gw > 1, the model amplifies the text-conditional signal and suppresses the
unconditional prediction, producing images that follow the prompt more
faithfully.
Asymmetric CFG: The unconditional branch only processes image tokens (no text padding), making it computationally cheaper than a full-sequence negative pass.
Per-step schedules: The guidance weight can vary across steps. The
V4_QUALITY_48 preset, for example, uses gw=7 for the first 45 steps and
gw=3 for the final 3 "polish" steps near t=0.
4. VAE Decoder — KL Autoencoder
The denoised latent z_0 is decoded to pixel space using a frozen KL
autoencoder.
What it does:
- Unpatching: The DiT works with 2×2 patches of latent pixels. The decoder
input is reshaped from
(batch, grid_h * grid_w, channels * 4)to(batch, channels, grid_h * 2, grid_w * 2). - Denormalization: Per-channel shift and scale are applied to undo the latent normalization used during training.
- Decoding: The VAE decoder maps latents to RGB pixels.
- Clipping: Output is clamped to [-1, 1] and rescaled to [0, 255] uint8.
Compression factor: The autoencoder provides 8× spatial compression on each axis, and the 2×2 patching in the DiT adds another 2×. So a 1024×1024 image is represented as a 64×64 grid of latent tokens, each with 128 channels (32 base channels × 2² patch).
Putting it all together
# Pseudocode for one generation call:
# 1. Encode text
text_features = qwen3_vl.encode(prompt) # (B, L_text, D)
# 2. Initialize noise
z = torch.randn(B, grid_h * grid_w, 128) # pure noise at t=1
# 3. Euler integration from t=1 to t=0
for step in reversed(range(num_steps)):
t = schedule(step)
s = schedule(step - 1)
# Conditional pass (text + image)
v_cond = dit(text_features, z, t)
# Unconditional pass (image only, zeroed text)
v_uncond = dit(zeros, z, t)
# CFG combination
v = gw[step] * v_cond + (1 - gw[step]) * v_uncond
# Euler step
z = z + v * (s - t)
# 4. Decode to pixels
image = vae.decode(z)