Model Architecture

June 3, 2026 · View on GitHub

prompt ─► Qwen3-VL-8B-Instruct (extract hidden states from layers (0,3,…,33,35) → concat)
            │   
            ▼
    ┌──────────────────────────────────────────────────┐
    │    Ideogram4Transformer                         │  
    │  • 34 × Ideogram4TransformerBlock               │
    │      – Ideogram4Attention (QK-RMSNorm, MRoPE)   │
    │      – Ideogram4MLP (SwiGLU)                    │
    │      – adaln scale/gate from t-embedding         │
    │  • Ideogram4FinalLayer                          │
    └──────────────────────────────────────────────────┘
            │  velocity prediction
            ▼
    Euler flow-matching sampler with asymmetric CFG
            │  denoised image latents
            ▼
    VAE decode
            │
            ▼
            PIL.Image

The transformer is a single-stream DiT: text tokens (Qwen3-VL hidden states from the activation layers) and image latent tokens are concatenated into one sequence, modulated per-block by an AdaLN computed from the flow-matching timestep embedding. Attention uses QK-RMSNorm and 3D MRoPE so that text and image tokens share a unified positional space.

Model spec:

field	value
`emb_dim`	4608
`num_layers`	34
`num_heads`	18
`intermediate`	12288
`adanln_dim`	512
`rope_theta`	5_000_000
`mrope_section`	(24, 20, 20)
latent channels	32 × 2² = 128
max text tokens	2048
sampler	Euler flow-matching, logit-normal schedule, asymmetric CFG