README.md

March 2, 2026 Β· View on GitHub

DeepGen

DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

Shanghai Innovation Institute, DeepGen Team

Paper PDF Project Page DeepGen RL Model CkPT Data

πŸ”₯ News

  • Feb 13, 2026: We released DeepGen 1.0, Pre-traning, Supervised Fine-Tuning and Reinforcement Learning checkpoints can be found in Huggingface, support both T2I generation and image editing.
  • Feb 13, 2026: We released the training code support Pre-training, Supervised Fine-Tuning, Reinforcement Learning and evaluation code support wide range of benchmarks.
  • Feb 13, 2026: We released the DeepGen 1.0 technical report on Arxiv

✨ Introduction

Broader Scenario and Dimension Coverage We propose DeepGen 1.0, a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilities: general image generation, general image editing, reasoning image generation, reasoning image editing, and text renderingβ€”within a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with or surpassing the state-of-the-art unified multimodal models that are 3Γ— to 16Γ— larger, achieving comprehensive performance, demonstrating that massive scaling is not the sole path to high-performance multimodal generation.

🧠 Method

Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts. To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable "think tokens" to provide the generative backbone with structured, reasoning-rich guidance. We further design a data-centric training strategy spanning three progressive stages: (1) Alignment Pre-training on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) Joint Supervised Fine-tuning on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) Reinforcement Learning with MR-GRPO, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts.

πŸ’» Train & Eval

Set up environment

git clone https://github.com/deepgenteam/deepgen.git
cd deepgen
conda create -n deepgen python=3.12 -y
conda activate deepgen
pip install -r requirements.txt
pip install flash_attn==2.8.3 --no-build-isolation
pip install xtuner==0.2.0
pip install transformers==4.56.1
pip install triton==2.3.0
pip install -U opencv-python-headless

Data Prepare

Please See DATA for more details. We provide a detailed description of the data download and usage procedures for both the Pre-traning stage and the Supervised Fine-Tuning stage.

Inference

Diffusers (click to expand)

We provide a diffusers-compatible format at πŸ€—deepgenteam/DeepGen-1.0-diffusers.

Text-to-Image:

import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "deepgenteam/DeepGen-1.0-diffusers",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
pipe.to("cuda")

result = pipe(
    prompt="a photo of a blue pizza and a yellow baseball glove",
    height=512, width=512,
    num_inference_steps=50,
    guidance_scale=4.0,
    seed=42,
)
result.images[0].save("output.png")

Image Editing:

from PIL import Image

result = pipe(
    prompt="Place this guitar on a sandy beach with the sunset in the background.",
    image=Image.open("guitar.png"),
    negative_prompt="blurry, low quality, low resolution, distorted, deformed, broken content, missing parts, damaged details, artifacts, glitch, noise, pixelated, grainy, compression artifacts, bad composition, wrong proportion, incomplete editing, unfinished, unedited areas.",
    height=512, width=512,
    num_inference_steps=50,
    guidance_scale=4.0,
    seed=42,
)
result.images[0].save("edited.png")

Please refer to INFERENCE for more details, including the native pipeline usage.

Train

See TRAIN for more details. We provide a detailed description of the model and training configs for both the Pre-traning stage and the Supervised Fine-Tuning stage.

Eval

We provide the scripts for evaluating wide range of T2I and Editing benchmarks. Please See EVAL for more details.

πŸ“Š Benchmarks

1. General Image Generation

ModelParamsGeneval ↑DPGBench ↑UniGenBench ↑
OmniGen23B + 4B0.8083.5763.09
BAGEL14B0.8285.1061.53
X-Omni7B + 12B0.8387.65πŸ₯‰53.77
Lumina-DiMOO8B0.88πŸ₯‡86.0471.12
Hunyuan-Image-3.080B0.7286.10β€”
Qwen-Image7B + 20B0.87 πŸ₯ˆ88.32 πŸ₯‡78.81 πŸ₯‡
LongCat-Image7B + 6B0.87 πŸ₯ˆ86.80β€”
Z-Image-Turbo4B + 6B0.8485.1571.40
GLM-Image9B + 7Bβ€”84.78β€”
DeepGen 1.0 (SFT)3B + 2B0.86 πŸ₯‰87.0574.18 πŸ₯‰
DeepGen 1.0 (RL)3B + 2B0.87 πŸ₯ˆ87.90 πŸ₯ˆ75.74 πŸ₯ˆ

2. General Image Editing

ModelParamsGEdit-EN ↑ImgEdit ↑
BAGEL14B6.523.20
Qwen-Image-Edit [2509]7B + 20B7.54 πŸ₯ˆ4.35 πŸ₯ˆ
LongCat-Image-Edit7B + 6B7.60 πŸ₯‡4.50 πŸ₯‡
Mammoth28B + 3B + 2B6.604.06
DeepGen 1.0 (SFT)3B + 2B7.124.09
DeepGen 1.0 (RL)3B + 2B7.17 πŸ₯‰4.14 πŸ₯‰

3. Reasoning Image Generation

ModelParamsWISE ↑T2I-CoREBench ↑
OmniGen23B + 4B0.4736.1
BAGEL14B0.70 πŸ₯‰41.1
Hunyuan-Image-3.080B0.5746.0
Qwen-Image7B + 20B0.6246.3 πŸ₯‰
LongCat-Image7B + 6B0.6552.2 πŸ₯‡
Z-Image-Turbo4B + 6B-43.7
DeepGen 1.0 (SFT)3B + 2B0.72 πŸ₯ˆ45.7
DeepGen 1.0 (RL)3B + 2B0.73 πŸ₯‡46.5 πŸ₯ˆ

4. Reasoning Image Editing

ModelParamsRISE ↑UniREditBench ↑
OmniGen23B + 4B-43.4
BAGEL14B11.9 πŸ₯ˆ51.0
Qwen-Image-Edit [2509]7B + 20B8.956.5 πŸ₯‰
DeepGen 1.0 (SFT)3B + 2B13.3 πŸ₯‡77.5 πŸ₯‡
DeepGen 1.0 (RL)3B + 2B10.8 πŸ₯‰75.7 πŸ₯ˆ

πŸ“§ Contact

dywang24@m.fudan.edu.cn, wjqdev@gmail.com

🎨 Quantitative results

⭐ Citation

@article{wang2026deepgen,
  title={DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing},
  author={Wang, Dianyi and Li, Ruihang and Han, Feng and Ma, Chaofan and Song, Wei and Wang, Siyuan and Wang, Yibin and Xin, Yi and Liu, Hongjian and Zhang, Zhixiong and others},
  journal={arXiv preprint arXiv:2602.12205},
  year={2026}
}

πŸ™ Acknowledgement

The project builds upon the following pioneering works:

  • OpenUni: We thank the OpenUni releasing the elegant and concise code and pretrain dataset.
  • UniPic2-SD3.5M-Kontext-2B: We use UniPic2-SD3.5M-Kontext-2B as our diffusion module, considering its efficiency and strong performance on both t2i and editing.
  • UnifiedReward-Think: We use UnifiedReward-Think as our reward model for RL, considering its strong performance.
  • Qwen2.5 VL: We useQwen2.5 VL-3B as our VLM module, considering its efficiency and strong performance on multimodal understanding abilities.
  • BLIP3-o: We thank the BLIP3-o team for releasing the precious high-quality tuning dataset.
  • OpenGPT-4o-Image: We thank the OpenGPT-4o-Image team for releasing the precious high-quality tuning dataset.
  • ShareGPT-4o-Image: We thank the ShareGPT-4o-Image team for releasing the precious high-quality tuning dataset.
  • Echo-4o: We thank the Echo-4o team for releasing the precious high-quality tuning dataset.
  • OmniGen2: We thank the OmniGen2 team for releasing the precious high-quality editing tuning dataset and code.
  • Uniworld-V1: We thank the Uniworld team for releasing the precious high-quality tuning dataset and code.
  • Picobanana: We thank the Picobanana team for releasing the precious high-quality editing tuning dataset.
  • Nano-consist: We thank the Nano-consist team for releasing the precious high-quality editing tuning dataset.
  • NHR-edit: We thank the NHR-edit team for releasing the precious high-quality editing tuning dataset.
  • UniREditBench: We thank the UniREditBench team for releasing the precious high-quality reason-based editing tuning dataset.