Text-to-image generation
April 14, 2026 ยท View on GitHub
TorchUMM: Unified Multimodal Model Toolkit
A unified framework for unified multimodal model inference, evaluation, and post-training.
๐ Paper | ๐ค Post-Training Weights
Table of Contents
- Table of Contents
- Introduction
- Supported Models
- Repository Structure
- Installation
- Data Preparation
- Usage
- Evaluation Results
- Extending TorchUMM
- Post-Training Methods
- Disclaimers
- Citation
Introduction
TorchUMM is a unified toolkit for running, evaluating, and fine-tuning state-of-the-art multimodal models under a single interface. It is designed to make fair, reproducible comparisons across diverse multimodal architectures easy.
Key features:
- Pluggable backbone architecture โ 14 multimodal model adapters with a unified inference interface
- Comprehensive evaluation โ 10+ benchmarks covering generation, understanding, and editing
- Post-training support โ SFT, IRG, recA, UniCot, Unigame
- Cloud-native โ seamless scaling to cloud GPUs via Modal (details)
- Config-driven โ all behavior controlled through YAML configs; no code changes needed to switch models or benchmarks
Supported Models
| Model | Parameters | Understand | Generate | Edit | Docs |
|---|---|---|---|---|---|
| Bagel | 7B | โ | โ | โ | guide |
| DeepGen | 5B | โ | โ | โ | guide |
| OmniGen2 | 7B | โ | โ | โ | guide |
| Emu3 | 8B | โ | โ | โ | guide |
| Emu3.5 | 34B | โ | โ | โ | guide |
| MMaDA | 8B | โ | โ | โ | guide |
| Janus | 1.3B | โ | โ | โ | guide |
| Janus-Pro | 1B, 7B | โ | โ | โ | guide |
| JanusFlow | 1.3B | โ | โ | โ | guide |
| Show-o | 1.3B | โ | โ | โ | guide |
| Show-o2 | 1.5B, 7B | โ | โ | โ | guide |
| BLIP3-o | 4B | โ | โ | โ | guide |
| TokenFlow | โ | โ | โ | guide | |
| Ovis-U1 | 3B | โ | โ | โ | guide |
See each model's guide for detailed usage instructions, configuration examples, and supported benchmarks.
Emu3.5 note: Emu3.5 is the only model in TorchUMM that uses native vLLM integration via BAAI's official patches (20 patches applied at image build time). Unlike other models that use the standard
TransformersForCausalLMwrapper, Emu3.5 runs on vLLM's optimized attention kernels with a custom batch scheduler for classifier-free guidance, achieving ~74 tokens/s on 2รA100-80GB. See the Emu3.5 guide for details.Flash Attention note: Most models require or benefit from Flash Attention. Do not
pip install flash-attnfrom source (extremely slow, error-prone). Instead, download a pre-compiled wheel from flash-attention releases matching your Python/CUDA/PyTorch/ABI. All Modal images already include the correct wheel. See each model's guide for the exact wheel command:
Model flash-attn Status Guide Bagel 2.5.8 Required guide BLIP3-o 2.6.2 Required guide Emu3 2.5.7 Required guide Emu3.5 2.8.3 Required guide Janus-Pro 2.7.4 Required guide MMaDA 2.7.4 Recommended guide Show-o2 2.7.4 Required guide OmniGen2 2.7.4 Recommended guide DeepGen latest Recommended guide
Repository Structure
umm_codebase/
โโโ src/umm/ # Core framework
โ โโโ backbones/ # Model adapters (Bagel, BLIP3-o, DeepGen, Emu3, Emu3.5, Janus, Janus-Pro, JanusFlow, MMaDA, OmniGen2, Show-o, Show-o2, TokenFlow)
โ โโโ cli/ # CLI entry points (infer, eval, train)
โ โโโ core/ # Config, registry, interfaces
โ โโโ data/ # Datasets, collators, transforms
โ โโโ evaluation/ # Evaluation runners and metrics
โ โโโ inference/ # Inference pipeline (batching, generation)
โ โโโ models/ # Model builders, heads, processors
โ โโโ post_training/ # Post-training methods (SFT, IRG, recA, UniCot)
โ โโโ serving/ # Serving APIs
โ
โโโ model/ # External model repos & evaluation toolkits (submodules)
โ โโโ Bagel/, BLIP3o/, deepgen/, Emu3/, Emu3.5/, MMaDA/, OmniGen2/, Show-o/, TokenFlow/
โ โโโ UEval/, Uni-MMMU/, WISE/, geneval/, Step1X-Edit/
โ
โโโ configs/ # YAML configurations
โ โโโ inference/ # Per-model inference configs
โ โโโ eval/ # Benchmark evaluation configs (modal_*, amd_*, and local)
โ โโโ posttrain/ # Post-training configs
โ
โโโ modal/ # Modal cloud infrastructure (see modal/README.md)
โโโ docs/ # Per-model usage documentation
โโโ eval/ # Evaluation runner scripts
โโโ scripts/ # Utility scripts
โโโ output/ # Evaluation results
Installation
# Clone the repository
git clone --recursive https://github.com/AIFrontierLab/TorchUMM.git
cd TorchUMM
# Install the package
pip install -e .
# Install model-specific dependencies (example: Bagel)
pip install -r model/Bagel/requirements.txt
Note: Each backbone model has its own dependencies and may require different Python/PyTorch versions. Install only the requirements for the model(s) you plan to use. For cloud execution via Modal, each model runs in an isolated container image with the correct environment โ see modal/README.md for details.
Understanding Benchmarks Data
Understanding benchmarks data is prepared following the InternVL evaluation data preparation guide. All data is stored under data/ at the repository root. Below is a quick-start summary โ see eval/vlm/README.md for full details.
MME
mkdir -p data/mme
cd data/mme
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/MME_Benchmark_release_version.zip
unzip MME_Benchmark_release_version.zip
cd -
MMBench
mkdir -p data/mmbench
cd data/mmbench
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_20230712.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_cn_20231003.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_en_20231003.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_test_cn_20231003.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_test_en_20231003.tsv
cd -
MM-Vet
mkdir -p data/mm-vet
cd data/mm-vet
wget https://github.com/yuweihao/MM-Vet/releases/download/v1/mm-vet.zip
unzip mm-vet.zip
wget https://huggingface.co/OpenGVLab/InternVL/raw/main/llava-mm-vet.jsonl
cd -
MathVista
mkdir -p data/MathVista
cd data/MathVista
wget https://huggingface.co/datasets/AI4Math/MathVista/raw/main/annot_testmini.json
cd -
MMMU โ auto-downloaded from HuggingFace (MMMU/MMMU) at evaluation time, cached in data/MMMU/. No manual download needed.
Generation Benchmarks Data
These benchmarks include their data in the repository:
- DPG Bench: Prompts in
eval/generation/dpg_bench/prompts/(100 prompt files) - GenEval: Metadata and prompts in
model/geneval/ - WISE: Benchmark data in
model/WISE/
Other Benchmarks Data
- UEval: Auto-downloaded from HuggingFace (
primerL/UEval-all) at evaluation time. For Modal, runmodal run modal/download.py --dataset ueval. - Uni-MMMU: Requires dataset, scoring models (Qwen2.5-VL-72B-Instruct + Qwen3-32B), and DreamSim (auto-downloaded). For Modal:
modal run modal/download.py --dataset uni_mmmuandmodal run modal/download.py --model evaluator. See eval/generation/uni_mmmu/README.md for full setup. - GEdit-Bench: Auto-downloaded from HuggingFace (
stepfun-ai/GEdit-Bench) at evaluation time. For Modal, runmodal run modal/download.py --dataset gedit. Scoring uses Qwen2.5-VL-72B-Instruct (same as WISE).
Usage
Local Execution (CLI)
Inference
PYTHONPATH=src python -m umm.cli.main infer --config configs/inference/modal_bagel_generation.yaml
Evaluation
# DPG Bench on Bagel
PYTHONPATH=src python -m umm.cli.main eval --config configs/eval/dpg_bench/dpg_bench_bagel.yaml
# GenEval on Bagel (full pipeline: generation + scoring)
PYTHONPATH=src python -m umm.cli.main eval --config configs/eval/geneval/geneval_bagel.yaml
# UEval on Bagel
PYTHONPATH=src python -m umm.cli.main eval --config configs/eval/ueval/ueval_bagel.yaml
# MME on Bagel
PYTHONPATH=src python -m umm.cli.main eval --config configs/eval/mme/mme_bagel.yaml
Post-Training
PYTHONPATH=src python -m umm.cli.main train --config configs/posttrain/bagel_sft.yaml
For cloud GPU execution via Modal, see modal/README.md.
AMD HPC Execution
For AMD ROCm clusters, use amd_ prefixed configs which contain AMD HPC absolute paths:
# Using local_run.sh (recommended)
bash scripts/amd_migration/local_run.sh bagel --eval-config amd_ueval_bagel
# Or directly with CLI
PYTHONPATH=src python -m umm.cli.main eval --config configs/eval/ueval/amd_ueval_bagel.yaml
Config naming convention:
modal_*.yamlโ Modal cloud (container mount paths like/model_cache/...)amd_*.yamlโ AMD HPC (absolute paths like/work1/jwang/yinyil/model_cache/...)*.yaml(no prefix) โ Legacy local configs (may have outdated paths)
To regenerate AMD configs after modifying modal configs:
python scripts/generate_amd_configs.py
Upload Outputs to HuggingFace
Evaluation outputs live on Modal's umm-outputs Volume. To upload them to HuggingFace (directly from Modal, no local download):
# Upload everything (resumable โ re-run if interrupted)
modal run modal/upload_outputs.py --repo-id wenwenw945/umm_outputs
# Upload a specific subdirectory only
modal run modal/upload_outputs.py --repo-id wenwenw945/umm_outputs --subdir geneval
# Force overwrite: clear remote first, then upload
modal run modal/upload_outputs.py --clear --repo-id wenwenw945/umm_outputs
modal run modal/upload_outputs.py --repo-id wenwenw945/umm_outputs
# Dry run โ list what would be uploaded
modal run modal/upload_outputs.py --repo-id wenwenw945/umm_outputs --dry-run
Requires a
huggingface-secretModal secret with yourHF_TOKEN.
Python API
You can also use TorchUMM programmatically:
from umm.inference.pipeline import InferencePipeline
from umm.inference.multimodal_inputs import InferenceRequest
# Initialize the pipeline with a backbone model
pipeline = InferencePipeline(
backbone_name="bagel",
backbone_cfg={
"model_path": "/path/to/BAGEL-7B-MoT",
"max_mem_per_gpu": "80GiB",
"seed": 42,
},
)
# Text-to-image generation
result = pipeline.run(InferenceRequest(
backbone="bagel",
task="generation",
prompt="A cat sitting on a rainbow",
params={"num_timesteps": 50},
))
# Image understanding
result = pipeline.run(InferenceRequest(
backbone="bagel",
task="understanding",
prompt="Describe this image in detail.",
images=["path/to/image.jpg"],
params={"max_think_token_n": 500, "do_sample": False},
))
# Image editing
result = pipeline.run(InferenceRequest(
backbone="bagel",
task="editing",
prompt="Make the sky purple",
images=["path/to/image.jpg"],
params={"num_timesteps": 25},
))
# Batch inference
results = pipeline.run_many(
[request1, request2, request3],
batch_size=2,
)
The InferenceRequest dataclass accepts:
| Field | Type | Description |
|---|---|---|
backbone | str | Backbone model name (must match pipeline) |
task | str | "generation", "understanding", or "editing" |
prompt | str | Text prompt |
images | list[str] | Input image paths (for understanding/editing) |
videos | list[str] | Input video paths |
params | dict | Task-specific parameters |
output_path | str | Path to save output |
Evaluation Results
All results below are independently reproduced using TorchUMM. See Disclaimers.
Generation Benchmarks
| Model | DPG Bench | GenEval | WISE |
|---|---|---|---|
| Bagel(14B) | 84.11 | 78.81 | 0.3989 |
| DeepGen(5B) | 87.44 | 86.59 | 0.5470 |
| Janus-Pro(7B) | 83.73 | 78.92 | 0.3811 |
| Janus(1.3B) | 73.526 | 40.04 | 0.2222 |
| Janus-Flow(1.3B) | 72.03 | 49.99 | 0.2964 |
| Show-o2(7B) | 82.81 | 59.87 | 0.3595 |
| Show-o2(1.5B) | 82.78 | 55.49 | 0.3349 |
| Show-o(1.3B) | 78.74 | 65.06 | 0.3037 |
| Emu3(8B) | 80.31 | 45.76 | 0.3373 |
| Emu3.5(34B) | 72.51 | 81.83 | 0.6331 |
| OmniGen2(7B) | 84.51 | 78.53 | 0.4029 |
| BLIP3-o(3B) | 61.47 | 81.36 | 0.4138 |
| TokenFlow | 71.29 | 52.21 | 0.3056 |
| MMaDA | 64.55 | 46.12 | 0.6560 |
DeepGen evaluation parameters follow the official DeepGen repository (
EVAL.md): all benchmarks use 512ร512 resolution, 50 inference steps, guidance scale 4.0 (7.5 for DPG-Bench), seed 42.WISE evaluator note: All WISE scores in this table are evaluated using Qwen2.5-VL-72B-Instruct as the VLM judge, rather than GPT-4o used in the original WISE benchmark and most published papers. This leads to systematically lower absolute scores compared to paper-reported numbers (e.g., DeepGen paper reports 0.72 with GPT-4o vs. our 0.5470 with Qwen2.5-VL-72B). The gap is primarily due to: (1) different scoring VLMs have different evaluation biases โ Qwen2.5-VL-72B tends to score more strictly than GPT-4o, especially on the Consistency dimension (weight 0.7 in WiScore); (2) we use the diffusers-format pipeline rather than DeepGen's native pipeline, which may introduce minor generation quality differences. Since all models are evaluated with the same evaluator, relative rankings remain valid for fair comparison.
Understanding Benchmarks
| Model | MME (Perception) | MME (Cognition) | MMMU | MMBench | MM-Vet | MathVista |
|---|---|---|---|---|---|---|
| Bagel (14B) | 1691.5 | 695.4 | 0.519 | 0.843 | 65.9 | 71.6 |
| Janus-Pro (7B) | 1547.9 | 293.2 | 0.407 | 0.699 | 33.7 | 42.8 |
| JanusFlow (1.3B) | 1305.6 | 251.1 | 0.290 | 0.6486 | 31.8 | 34.8 |
| Janus (1.3B) | 1221.4 | 264.3 | 0.273 | 0.4691 | 27.0 | 26.6 |
| Show-o2 (7B) | 1619.8 | 387.5 | 0.479 | 0.430 | 47.1 | 51.5 |
| Show-o2 (1.5B) | 1413.3 | 291.8 | 0.368 | 0.6813 | 46.1 | 37.9 |
| Show-o (1.3B) | 1188.5 | 244.6 | 0.261 | 0.469 | 23.3 | 29.0 |
| Emu3 (8B) | 1176.0 | 213.2 | 0.314 | โ | 30.0 | 44.9 |
| Emu3.5 (34B) | 781.1 | 324.6 | 0.292 | 0.183 | 28.0 | 41.7 |
| OmniGen2 (7B) | 1584.4 | 614.6 | 0.460 | 0.782 | 62.7 | 38.9 |
| MMaDA (8B) | 939.0 | 241.4 | 0.289 | 0.330 | 11.4 | 24.9 |
MathVista evaluator note: All MathVista scores use Qwen3-32B for answer extraction from model responses, with rule-based normalization for scoring. Answer extraction is performed locally (no OpenAI API required). โ OmniGen2 and Show-o produce empty responses on MathVista benchmark.
UEval notes: Emu3 uses separate models for understanding and generation, making it incompatible with UEval's unified evaluation protocol.
Emu3.5 MMBench note โก: Emu3.5's MMBench score (18.3%) is far below its naive accuracy (43.7%) due to severe option position bias under MMBench's CircularEval protocol. CircularEval shuffles option order across variants and requires the model to answer correctly on all variants โ Emu3.5 picks the same letter regardless of content 23.5% of the time (vs. Emu3's 7.1%), indicating it selects by position rather than understanding. This is an inherent limitation of the unified model architecture, not a code bug.
Emu3.5 MME note: Emu3.5 uses
temperature=1.0sampling for understanding, making scores hardware-dependent.
Editing Benchmarks
GEdit-Bench
| Model | EN SC | EN PQ | EN O | CN SC | CN PQ | CN O |
|---|---|---|---|---|---|---|
| DeepGen | 7.44 | 7.54 | 7.33 | 7.41 | 7.59 | 7.36 |
| Bagel | 6.68 | 7.04 | 6.35 | 6.83 | 7.06 | 6.52 |
| OmniGen2 | 6.49 | 7.18 | 6.27 | 6.25 | 7.18 | 6.03 |
| Emu3.5 | 7.64 | 7.48 | 7.56 | 7.62 | 7.50 | 7.56 |
"Intersection" = samples where both EN and CN instructions exist for the same source image.
ImgEdit-Bench
ImgEdit-Bench (Overall)
| Model | ST | MT | UGE |
|---|---|---|---|
| DeepGen | 4.07 | 4.37 | 4.81 |
| Bagel | 3.71 | 4.45 | 4.18 |
| OmniGen2 | 3.88 | 3.27 | 4.06 |
| Emu3.5 | 4.24 | 4.89 | 4.88 |
ImgEdit-Bench evaluates image editing across three suites: Singleturn (9 edit types, 736 samples), UGE (unguided editing, 50 samples), and Multiturn (multi-round editing, 88 samples). All scores use Qwen2.5-VL-72B-Instruct as evaluator (scale 1โ5).
Uni-MMMU Benchmark
| Model | Jig. I | Jig. T | Maze I | Maze T | Slid. I | Slid. T | Geo I | Geo T | Sci. R | Sci. T | Sci. I | Code T | Code S | Code P |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Bagel | 0.660 | 0.553 | 0.004 | 0.101 | 0.000 | 0.050 | 0.050 | 0.143 | 0.592 | 0.522 | 0.185 | 0.115 | 0.375 | 0.275 |
| Janus-Pro | โ | โ | โ | โ | โ | โ | โ | โ | 29.3 | 25.5 | 0.0 | 1.5 | 3.7 | 3.4 |
Note: DeepGen, BLIP3-o, and TokenFlow are excluded from Uni-MMMU as they do not support image understanding. Janus-Pro cannot perform editing tasks.
Post-Training Benchmarks
Generation
| Model | DPG | GenEval | WISE | UEval |
|---|---|---|---|---|
| Bagel (base) | 84.11 | 78.81 | 0.399 | 30.9 |
| Bagel + RecA | 85.20 | 83.05 | 0.423 | 31.0 |
| Bagel + UniCot | 83.52 | 77.91 | 0.404 | 31.8 |
| Bagel + SFT | 83.02 | 78.03 | 0.227 | 31.4 |
| Bagel + IRG | 81.82 | 72.06 | 0.384 | 9.1 |
| Bagel + UniGame | 65.77 | 85.80 | 0.403 | 31.0 |
| Janus-Pro + UniGame | 83.92 | 78.65 | 0.373 | 20.65 |
| Janus-Pro + SFT | 83.93 | 77.61 | 0.370 | 20.61 |
| OmniGen2 + SFT | 84.78 | 77.84 | 0.405 | 25.91 |
| BLIP3-o + SFT | 61.01 | 78.41 | 0.399 | โ |
| TokenFlow + SFT | 22.16 | 51.96 | 0.328 | โ |
| Show-o2 (7B) + SFT | 80.58 | 52.13 | 0.322 | 25.7 |
Understanding
| Model | MME (P) | MME (C) | MMMU | MMBench | MM-Vet | MathVista |
|---|---|---|---|---|---|---|
| Bagel (base) | 1691.5 | 695.4 | 0.519 | 0.843 | 65.9 | 71.6 |
| Bagel + RecA | 1689.1 | 695.4 | 0.523 | 0.842 | 66.1 | 51.6 |
| Bagel + UniCot | 1690.7 | 678.2 | 0.531 | 0.845 | 64.5 | 73.0 |
| Bagel + SFT | 1680.7 | 678.9 | 0.526 | 0.820 | 61.2 | 73.1 |
| Bagel + IRG | 1647.5 | 650.4 | 0.480 | 0.778 | 40.7 | 68.0 |
| Bagel + UniGame | 1692.1 | 695.4 | 0.524 | 0.843 | 60.7 | 72.2 |
| Janus-Pro + UniGame | 1554.0 | 288.9 | 0.409 | 0.698 | 32.4 | 43.9 |
| Janus-Pro + SFT | 1549.9 | 292.9 | 0.400 | 0.700 | 33.0 | 35.4 |
| OmniGen2 + SFT | 1573.6 | 610.0 | 0.469 | 0.782 | 62.2 | 63.5 |
Editing
| Model | GEdit-EN (I/F) | GEdit-CN (I/F) | ImgEdit (S) | ImgEdit (M) | ImgEdit (U) |
|---|---|---|---|---|---|
| Bagel (base) | 6.38 / 6.35 | 6.68 / 6.52 | 3.71 | 4.45 | 4.18 |
| Bagel + RecA | 6.89 / 6.80 | 6.87 / 6.75 | 3.89 | 4.28 | 4.15 |
| Bagel + UniCot | 7.04 / 6.92 | 6.90 / 6.81 | 3.77 | 4.22 | 4.34 |
| Bagel + SFT | 6.62 / 6.49 | 6.71 / 6.54 | 3.73 | 4.48 | 4.12 |
| Bagel + IRG | 6.52 / 6.44 | 6.51 / 6.41 | 3.79 | 3.89 | 4.54 |
| Bagel + UniGame | 6.48 / 6.48 | 6.55 / 6.38 | 3.72 | 4.46 | 4.31 |
| OmniGen2 + SFT | 6.37 / 6.31 | 6.14 / 6.06 | 3.88 | 3.26 | 4.06 |
Reproducing Results
Benchmarks with two-stage evaluation (GenEval, WISE, UEval, Uni-MMMU) provide separate _generate and _score configs. You can also use the base config (mode: full) to run both stages in one command.
GenEval on Bagel
# Step 1: Generate images
PYTHONPATH=src python -m umm.cli.main eval --config configs/eval/geneval/geneval_bagel_generate.yaml
# Step 2: Score generated images
PYTHONPATH=src python -m umm.cli.main eval --config configs/eval/geneval/geneval_bagel_score.yaml
WISE on Bagel
# Step 1: Generate images
PYTHONPATH=src python -m umm.cli.main eval --config configs/eval/wise/wise_bagel_generate.yaml
# Step 2: Score with Qwen models
PYTHONPATH=src python -m umm.cli.main eval --config configs/eval/wise/wise_bagel_score.yaml
UEval on Bagel
# Step 1: Generate text + image answers
PYTHONPATH=src python -m umm.cli.main eval --config configs/eval/ueval/ueval_bagel_generate.yaml
# Step 2: Score with Qwen models
PYTHONPATH=src python -m umm.cli.main eval --config configs/eval/ueval/ueval_bagel_score.yaml
Single-stage benchmarks (DPG Bench, MME, MMMU, MMBench, MM-Vet) run generation and scoring in one step:
PYTHONPATH=src python -m umm.cli.main eval --config configs/eval/mme/mme_bagel.yaml
MathVista is a two-stage benchmark: generation runs in the model environment, and scoring (Qwen3-32B answer extraction) runs in the wise environment which has transformers>=4.51:
# Step 1: Generate (model env)
PYTHONPATH=src python -m umm.cli.main eval --config configs/eval/mathvista/mathvista_bagel.yaml
# Step 2: Score (wise env โ requires transformers>=4.51 for Qwen3)
PYTHONPATH=src python -m umm.cli.main eval --config configs/eval/mathvista/mathvista_bagel_score.yaml
Extending TorchUMM
TorchUMM is designed for extensibility. Below are guides for adding new models, benchmarks, and post-training methods.
Adding a New Model
-
Implement the backbone adapter. Create a new directory
src/umm/backbones/<model_name>/with an adapter class. Your adapter must implement:load(cfg: dict)โ load model weights and initializegeneration(batch, params)โ text-to-image generationunderstanding(batch, params)โ image understanding / VQAediting(batch, params)โ image editing (optional)
Reference implementation:
src/umm/backbones/bagel/adapter.pyAdapter design guidelines:
- Do not catch pipeline exceptions in
editing(). The evaluation pipeline (generate_image_from_context) relies on exceptions to fall back from editing to text-to-image generation. If your adapter catches and wraps errors into a return dict, the fallback is silently skipped. Only the finalgeneration()method should catch exceptions. - Share model components across pipelines. If your model uses separate pipeline objects for different tasks (e.g., one for generation and one for understanding), construct them from shared component references to avoid duplicating large model weights in GPU memory.
- Use a task-appropriate system prompt for understanding. If your model's default prompt biases toward image generation (common for unified models), override it with a text-focused prompt when handling understanding tasks. See the OmniGen2 adapter for an example.
-
Register the backbone. Add a lazy-loading entry in
src/umm/inference/pipeline.pyโregister_builtin_backbones():if "my_model" not in registry.list_registered("backbone"): from umm.backbones.my_model import MyModelBackbone registry.register("backbone", "my_model", MyModelBackbone) -
Create inference configs. Add YAML files in
configs/inference/:inference: backbone: my_model backbone_cfg: model_path: /path/to/weights seed: 42 request: task: generation prompt: "A test prompt" -
Create evaluation configs. Add per-benchmark configs in
configs/eval/<benchmark>/my_model.yaml. -
(Optional) Add Modal support. Define a container image in
modal/images.pyand add the repo directory mapping inmodal/run.py. See modal/README.md. -
Write documentation. Create
docs/models/my_model.mdwith usage instructions, supported benchmarks, and config examples.
Adding a New Benchmark
-
Create evaluation scripts. Add a new directory under
eval/(e.g.,eval/generation/my_benchmark/) with the evaluation logic. -
Create per-model configs. Add YAML configs in
configs/eval/my_benchmark/:eval: benchmark: my_benchmark inference: backbone: bagel backbone_cfg: { ... } my_benchmark: data_root: /path/to/data out_dir: output/my_benchmark/bagel -
Register in the eval router. Add a routing entry in
src/umm/cli/eval.py:if benchmark == "my_benchmark" or "my_benchmark" in raw_cfg: from umm.cli.my_benchmark import run_eval_command as _fn return _fn(args) -
Write a data preparation README. Create
eval/<category>/my_benchmark/README.mdwith download and setup instructions.Reference:
eval/generation/geneval/
Adding a New Post-Training Method
-
Implement training logic. Create
src/umm/post_training/<method>/with your training pipeline. -
Create a config. Add
configs/posttrain/<method>.yaml:train: pipeline: bagel cwd: src/umm/post_training/<method>/ entrypoint: torchrun script: train.py args: learning_rate: 1e-5 -
Run training:
PYTHONPATH=src python -m umm.cli.main train --config configs/posttrain/<method>.yamlReference:
src/umm/post_training/sft/
Post-Training Methods
TorchUMM supports multiple post-training strategies (currently targeting Bagel):
| Method | Description | Config |
|---|---|---|
| SFT | Supervised fine-tuning | configs/posttrain/bagel_sft.yaml |
| IRG | Interleaved Reasoning Generation (2-stage) | configs/posttrain/irg_stage1.yaml / irg_stage2.yaml |
| recA | Reconstruction Alignment | configs/posttrain/recA.yaml |
| UniCot | Unified Chain-of-Thought training (LoRA) | configs/posttrain/unicot.yaml |
| UniGame | Self-adversarial consistency training | configs/posttrain/unigame.yaml |
# Example: SFT on Bagel (local)
PYTHONPATH=src python -m umm.cli.main train --config configs/posttrain/bagel_sft.yaml
For cloud-based post-training, see modal/README.md.
Disclaimers
Important: Please read before using or citing evaluation results.
- Unofficial results. All evaluation results in this repository are independently reproduced by the TorchUMM team. They do NOT represent official results from the original model authors. Differences from published numbers may arise due to variations in inference settings, hardware, random seeds, or evaluation protocols.
- Active development. TorchUMM is under active development. We are continuously adding support for new models, benchmarks, and post-training methods. Some results may be updated as we refine our evaluation pipelines.
- Contributions welcome. We welcome bug reports, corrections, and contributions from the community. If you find discrepancies in our results or want to add support for a new model/benchmark, please open an issue or pull request.
- Community usage. You are welcome to use TorchUMM for your own research and evaluation. If you do, we appreciate a citation (see Citation).
Citation
If you find TorchUMM useful in your research, please consider citing:
@misc{luo2026torchummunifiedmultimodalmodel,
title={TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training},
author={Yinyi Luo and Wenwen Wang and Hayes Bai and Hongyu Zhu and Hao Chen and Pan He and Marios Savvides and Sharon Li and Jindong Wang},
year={2026},
eprint={2604.10784},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2604.10784},
}