Show-o

July 2, 2026 ยท View on GitHub

Original unified multimodal model from Show Lab, combining autoregressive and discrete diffusion modeling for understanding and generation.

Architecture

Show-o uses a single transformer that processes text tokens autoregressively with causal attention and image tokens via discrete denoising diffusion with full attention. Key components:

  • LLM base: Phi-1.5
  • Visual tokenizer: MagVITv2 (discrete)
  • Image generation: Discrete denoising diffusion
  • Max resolution: 512x512

Dependencies

The model environment is managed via the show_o image defined in modal/images.py (Python 3.10, PyTorch 2.2.1, CUDA 12.1, xformers). For local setup, install the dependencies listed in model/Show-o/requirements.txt.

Relationship to Show-o2

Show-o and Show-o2 share the same backbone adapter (ShowOBackbone) with version-based branching. The version is auto-detected from the model's config.json, or can be explicitly set via version: 1 in config.

Key differences from Show-o2:

AspectShow-o (v1)Show-o2 (v2)
LLM basePhi-1.5Qwen2.5
Visual tokenizerMagVITv2 (discrete)Wan2.1 3D Causal VAE
GenerationDiscrete diffusionFlow matching
Video supportNoYes
Text-only understandingNoYes

Inference

CLI

# Generation
PYTHONPATH=src python -m umm.cli.main infer --config configs/inference/modal_show_o_generation.yaml

Python API

from umm.inference.pipeline import InferencePipeline
from umm.inference.multimodal_inputs import InferenceRequest

pipeline = InferencePipeline(backbone_name="show_o", backbone_cfg={
    "model_path": "/path/to/show-o-w-clip-vit-512x512",
    "show_o_root": "/path/to/model/Show-o",
    "vq_model_path": "/path/to/magvitv2",
    "version": 1,
    "seed": 42,
})

# Generation
result = pipeline.run(InferenceRequest(
    backbone="show_o", task="generation",
    prompt="A cat sitting on a rainbow",
))

# Understanding (requires image input)
result = pipeline.run(InferenceRequest(
    backbone="show_o", task="understanding",
    prompt="Describe this image",
    images=["path/to/image.jpg"],
))

Supported Benchmarks

BenchmarkConfig
DPG Benchconfigs/eval/dpg_bench/modal_dpg_bench_show_o.yaml
GenEvalconfigs/eval/geneval/modal_geneval_show_o.yaml
WISEconfigs/eval/wise/modal_wise_show_o.yaml
UEvalconfigs/eval/ueval/modal_ueval_show_o.yaml
Uni-MMMUconfigs/eval/uni_mmmu/modal_uni_mmmu_show_o.yaml
MMEconfigs/eval/mme/modal_mme_show_o.yaml
MMMUconfigs/eval/mmmu/modal_mmmu_show_o.yaml
MMBenchconfigs/eval/mmbench/modal_mmbench_show_o.yaml
MMStarconfigs/eval/mmstar/mmstar_show_o.yaml
MM-Vetconfigs/eval/mmvet/modal_mmvet_show_o.yaml
MathVistaconfigs/eval/mathvista/modal_mathvista_show_o.yaml

Key Configuration Parameters

  • model_path: Path to Show-o model weights (e.g., showlab/show-o-w-clip-vit-512x512)
  • vq_model_path: Path to MagVITv2 discrete tokenizer (e.g., showlab/magvitv2)
  • show_o_root: Path to the Show-o repository root
  • version: Set to 1 for Show-o v1 (auto-detected if omitted)
# Download model weights
modal run modal/download.py --model show_o

# Run GenEval
modal run modal/run.py --model show_o --eval-config modal_geneval_show_o

# Run MME
modal run modal/run.py --model show_o --eval-config modal_mme_show_o