README.md
March 2, 2026 Β· View on GitHub
DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing
Shanghai Innovation Institute, DeepGen Team
π₯ News
- Feb 13, 2026: We released DeepGen 1.0, Pre-traning, Supervised Fine-Tuning and Reinforcement Learning checkpoints can be found in Huggingface, support both T2I generation and image editing.
- Feb 13, 2026: We released the training code support Pre-training, Supervised Fine-Tuning, Reinforcement Learning and evaluation code support wide range of benchmarks.
- Feb 13, 2026: We released the DeepGen 1.0 technical report on Arxiv
β¨ Introduction
Broader Scenario and Dimension Coverage We propose DeepGen 1.0, a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilities: general image generation, general image editing, reasoning image generation, reasoning image editing, and text renderingβwithin a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with or surpassing the state-of-the-art unified multimodal models that are 3Γ to 16Γ larger, achieving comprehensive performance, demonstrating that massive scaling is not the sole path to high-performance multimodal generation.

π§ Method
Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts. To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable "think tokens" to provide the generative backbone with structured, reasoning-rich guidance. We further design a data-centric training strategy spanning three progressive stages: (1) Alignment Pre-training on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) Joint Supervised Fine-tuning on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) Reinforcement Learning with MR-GRPO, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts.

π» Train & Eval
Set up environment
git clone https://github.com/deepgenteam/deepgen.git
cd deepgen
conda create -n deepgen python=3.12 -y
conda activate deepgen
pip install -r requirements.txt
pip install flash_attn==2.8.3 --no-build-isolation
pip install xtuner==0.2.0
pip install transformers==4.56.1
pip install triton==2.3.0
pip install -U opencv-python-headless
Data Prepare
Please See DATA for more details. We provide a detailed description of the data download and usage procedures for both the Pre-traning stage and the Supervised Fine-Tuning stage.
Inference
Diffusers (click to expand)
We provide a diffusers-compatible format at π€deepgenteam/DeepGen-1.0-diffusers.
Text-to-Image:
import torch
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(
"deepgenteam/DeepGen-1.0-diffusers",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
pipe.to("cuda")
result = pipe(
prompt="a photo of a blue pizza and a yellow baseball glove",
height=512, width=512,
num_inference_steps=50,
guidance_scale=4.0,
seed=42,
)
result.images[0].save("output.png")
Image Editing:
from PIL import Image
result = pipe(
prompt="Place this guitar on a sandy beach with the sunset in the background.",
image=Image.open("guitar.png"),
negative_prompt="blurry, low quality, low resolution, distorted, deformed, broken content, missing parts, damaged details, artifacts, glitch, noise, pixelated, grainy, compression artifacts, bad composition, wrong proportion, incomplete editing, unfinished, unedited areas.",
height=512, width=512,
num_inference_steps=50,
guidance_scale=4.0,
seed=42,
)
result.images[0].save("edited.png")
Please refer to INFERENCE for more details, including the native pipeline usage.
Train
See TRAIN for more details. We provide a detailed description of the model and training configs for both the Pre-traning stage and the Supervised Fine-Tuning stage.
Eval
We provide the scripts for evaluating wide range of T2I and Editing benchmarks. Please See EVAL for more details.
π Benchmarks
1. General Image Generation
| Model | Params | Geneval β | DPGBench β | UniGenBench β |
|---|---|---|---|---|
| OmniGen2 | 3B + 4B | 0.80 | 83.57 | 63.09 |
| BAGEL | 14B | 0.82 | 85.10 | 61.53 |
| X-Omni | 7B + 12B | 0.83 | 87.65π₯ | 53.77 |
| Lumina-DiMOO | 8B | 0.88π₯ | 86.04 | 71.12 |
| Hunyuan-Image-3.0 | 80B | 0.72 | 86.10 | β |
| Qwen-Image | 7B + 20B | 0.87 π₯ | 88.32 π₯ | 78.81 π₯ |
| LongCat-Image | 7B + 6B | 0.87 π₯ | 86.80 | β |
| Z-Image-Turbo | 4B + 6B | 0.84 | 85.15 | 71.40 |
| GLM-Image | 9B + 7B | β | 84.78 | β |
| DeepGen 1.0 (SFT) | 3B + 2B | 0.86 π₯ | 87.05 | 74.18 π₯ |
| DeepGen 1.0 (RL) | 3B + 2B | 0.87 π₯ | 87.90 π₯ | 75.74 π₯ |
2. General Image Editing
| Model | Params | GEdit-EN β | ImgEdit β |
|---|---|---|---|
| BAGEL | 14B | 6.52 | 3.20 |
| Qwen-Image-Edit [2509] | 7B + 20B | 7.54 π₯ | 4.35 π₯ |
| LongCat-Image-Edit | 7B + 6B | 7.60 π₯ | 4.50 π₯ |
| Mammoth2 | 8B + 3B + 2B | 6.60 | 4.06 |
| DeepGen 1.0 (SFT) | 3B + 2B | 7.12 | 4.09 |
| DeepGen 1.0 (RL) | 3B + 2B | 7.17 π₯ | 4.14 π₯ |
3. Reasoning Image Generation
| Model | Params | WISE β | T2I-CoREBench β |
|---|---|---|---|
| OmniGen2 | 3B + 4B | 0.47 | 36.1 |
| BAGEL | 14B | 0.70 π₯ | 41.1 |
| Hunyuan-Image-3.0 | 80B | 0.57 | 46.0 |
| Qwen-Image | 7B + 20B | 0.62 | 46.3 π₯ |
| LongCat-Image | 7B + 6B | 0.65 | 52.2 π₯ |
| Z-Image-Turbo | 4B + 6B | - | 43.7 |
| DeepGen 1.0 (SFT) | 3B + 2B | 0.72 π₯ | 45.7 |
| DeepGen 1.0 (RL) | 3B + 2B | 0.73 π₯ | 46.5 π₯ |
4. Reasoning Image Editing
| Model | Params | RISE β | UniREditBench β |
|---|---|---|---|
| OmniGen2 | 3B + 4B | - | 43.4 |
| BAGEL | 14B | 11.9 π₯ | 51.0 |
| Qwen-Image-Edit [2509] | 7B + 20B | 8.9 | 56.5 π₯ |
| DeepGen 1.0 (SFT) | 3B + 2B | 13.3 π₯ | 77.5 π₯ |
| DeepGen 1.0 (RL) | 3B + 2B | 10.8 π₯ | 75.7 π₯ |
π§ Contact
dywang24@m.fudan.edu.cn, wjqdev@gmail.com
π¨ Quantitative results

β Citation
@article{wang2026deepgen,
title={DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing},
author={Wang, Dianyi and Li, Ruihang and Han, Feng and Ma, Chaofan and Song, Wei and Wang, Siyuan and Wang, Yibin and Xin, Yi and Liu, Hongjian and Zhang, Zhixiong and others},
journal={arXiv preprint arXiv:2602.12205},
year={2026}
}
π Acknowledgement
The project builds upon the following pioneering works:
- OpenUni: We thank the OpenUni releasing the elegant and concise code and pretrain dataset.
- UniPic2-SD3.5M-Kontext-2B: We use UniPic2-SD3.5M-Kontext-2B as our diffusion module, considering its efficiency and strong performance on both t2i and editing.
- UnifiedReward-Think: We use UnifiedReward-Think as our reward model for RL, considering its strong performance.
- Qwen2.5 VL: We useQwen2.5 VL-3B as our VLM module, considering its efficiency and strong performance on multimodal understanding abilities.
- BLIP3-o: We thank the BLIP3-o team for releasing the precious high-quality tuning dataset.
- OpenGPT-4o-Image: We thank the OpenGPT-4o-Image team for releasing the precious high-quality tuning dataset.
- ShareGPT-4o-Image: We thank the ShareGPT-4o-Image team for releasing the precious high-quality tuning dataset.
- Echo-4o: We thank the Echo-4o team for releasing the precious high-quality tuning dataset.
- OmniGen2: We thank the OmniGen2 team for releasing the precious high-quality editing tuning dataset and code.
- Uniworld-V1: We thank the Uniworld team for releasing the precious high-quality tuning dataset and code.
- Picobanana: We thank the Picobanana team for releasing the precious high-quality editing tuning dataset.
- Nano-consist: We thank the Nano-consist team for releasing the precious high-quality editing tuning dataset.
- NHR-edit: We thank the NHR-edit team for releasing the precious high-quality editing tuning dataset.
- UniREditBench: We thank the UniREditBench team for releasing the precious high-quality reason-based editing tuning dataset.