DiffSynth-Studio

November 17, 2025 · View on GitHub

Guidance Buffer Conditioned Video Generation for InfiniCube's Stage 2

We provide comprehensive support for training guidance buffer conditioned video generation for InfiniCube's stage 2 in this repo. We use Wan2.1 1.3B and Wan2.1 14B as the base model. Please refer to InfiniCube for data preparation.

Buffer Generation Result guidance buffer and generated video (center)

WAN-2.1 14B Buffer Conditioned Training

We train both T2V (Text-to-Video) and I2V (Image-to-Video) for buffer conditioned video generation. Training is limited to single-node but typically achieves great results within 1000 steps. On a 8*A100 GPU node, it takes about 1 day to train 1000 steps. Each iteration is about 80 seconds.

T2V Training (Text-to-Video):

bash slurm/singlenode/train_wan_14B_deepspeed.sh

I2V Training (Image-to-Video):

bash slurm/singlenode/train_wan_14B_deepspeed_i2v.sh

remember to update --dataset_base_path and --dataset_metadata_path to your own data path.

WAN-2.1 1.3B Buffer Conditioned Training

The 1.3B model supports both multi-GPU and multi-node training for scalable performance. However, our implementation of simple addition of buffer condition does not work well due to 1.3B model size. We will use Wan 2.1 14B instead.

Quick Start - Single Node:

bash slurm/singlenode/train_wan_1.3B_with_buffer_differential_lr.sh

Multi-Node Training:

bash slurm/launchers/slurm_srun_repeat.sh <config_file> -n <nodes> -a <account>

remember to update --dataset_base_path and --dataset_metadata_path to your own data path. And use zero-init, do not use random init.

WAN-2.1 1.3B Buffer Conditioned Inference

Wan2.1 1.3B Buffer Conditioned Inference:

python examples/wanvideo/model_inference/Wan2.1-1.3B-Buffer-Control.py --checkpoint_path <checkpoint_path>

Wan2.1 14B Buffer Conditioned Inference:

bash examples/wanvideo/model_inference/Wan2.1-T2V-14B-Buffer-Control.py --checkpoint_path <checkpoint_path>
bash examples/wanvideo/model_inference/Wan2.1-I2V-14B-Buffer-Control.py --checkpoint_path <checkpoint_path>

Key Features

  • Multi-Node Training: Supports multi-node training for scalable performance (haven't made it work together with deepspeed yet)
  • Differential Learning Rates: Separate learning rates for base model and buffer embedder
  • Auto-Resume: Automatic checkpoint resumption for interrupted training
  • Visualization: Visualization step added during training progress
  • Buffer Conditioning: Semantic and coordinate buffer inputs for precise control

modelscope%2FDiffSynth-Studio | Trendshift

PyPI license open issues GitHub pull-requests GitHub latest commit

切换到中文

Introduction

Welcome to the magic world of Diffusion models! DiffSynth-Studio is an open-source Diffusion model engine developed and maintained by ModelScope team. We aim to foster technical innovation through framework development, bring together the power of the open-source community, and explore the limits of generative models!

DiffSynth currently includes two open-source projects:

  • DiffSynth-Studio: Focused on aggressive technical exploration, for academia, providing support for more cutting-edge model capabilities.
  • DiffSynth-Engine: Focused on stable model deployment, for industry, offering higher computing performance and more stable features.

DiffSynth-Studio and DiffSynth-Engine are the core projects behind ModelScope AIGC zone, offering powerful AI content generation abilities. Come and try our carefully designed features and start your AI creation journey!

Installation

Install from source (recommended):

git clone https://github.com/modelscope/DiffSynth-Studio.git  
cd DiffSynth-Studio
pip install -e .
Other installation methods

Install from PyPI (version updates may be delayed; for latest features, install from source)

pip install diffsynth

If you meet problems during installation, they might be caused by upstream dependencies. Please check the docs of these packages:

Basic Framework

DiffSynth-Studio redesigns the inference and training pipelines for mainstream Diffusion models (including FLUX, Wan, etc.), enabling efficient memory management and flexible model training.

Qwen-Image Series (🔥New Model)

Details: ./examples/qwen_image/

Image

Quick Start
from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
from PIL import Image
import torch

pipe = QwenImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
    ],
    tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"),
)
prompt = "A detailed portrait of a girl underwater, wearing a blue flowing dress, hair gently floating, clear light and shadow, surrounded by bubbles, calm expression, fine details, dreamy and beautiful."
image = pipe(
    prompt, seed=0, num_inference_steps=40,
    # edit_image=Image.open("xxx.jpg").resize((1328, 1328)) # For Qwen-Image-Edit
)
image.save("image.jpg")
Model Overview
Model IDInferenceLow VRAM InferenceFull TrainingValidation after Full TrainingLoRA TrainingValidation after LoRA Training
Qwen/Qwen-Imagecodecodecodecodecodecode
Qwen/Qwen-Image-Editcodecodecodecodecodecode
Qwen/Qwen-Image-Edit-2509codecodecodecodecodecode
DiffSynth-Studio/Qwen-Image-EliGen-V2codecode--codecode
DiffSynth-Studio/Qwen-Image-EliGen-Postercodecode--codecode
DiffSynth-Studio/Qwen-Image-Distill-Fullcodecodecodecodecodecode
DiffSynth-Studio/Qwen-Image-Distill-LoRAcodecode--codecode
DiffSynth-Studio/Qwen-Image-EliGencodecode--codecode
DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Cannycodecodecodecodecodecode
DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Depthcodecodecodecodecodecode
DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Inpaintcodecodecodecodecodecode
DiffSynth-Studio/Qwen-Image-In-Context-Control-Unioncodecode--codecode
DiffSynth-Studio/Qwen-Image-Edit-Lowres-Fixcodecode----

FLUX Series

Detail page: ./examples/flux/

Image

Quick Start
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig

pipe = FluxImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors"),
        ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
        ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"),
        ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
    ],
)

image = pipe(prompt="a cat", seed=0)
image.save("image.jpg")
Model Overview
Model IDExtra ParametersInferenceLow VRAM InferenceFull TrainingValidate After Full TrainingLoRA TrainingValidate After LoRA Training
FLUX.1-devcodecodecodecodecodecode
FLUX.1-Krea-devcodecodecodecodecodecode
FLUX.1-Kontext-devkontext_imagescodecodecodecodecodecode
FLUX.1-dev-Controlnet-Inpainting-Betacontrolnet_inputscodecodecodecodecodecode
FLUX.1-dev-Controlnet-Union-alphacontrolnet_inputscodecodecodecodecodecode
FLUX.1-dev-Controlnet-Upscalercontrolnet_inputscodecodecodecodecodecode
FLUX.1-dev-IP-Adapteripadapter_images, ipadapter_scalecodecodecodecodecodecode
FLUX.1-dev-InfiniteYouinfinityou_id_image, infinityou_guidance, controlnet_inputscodecodecodecodecodecode
FLUX.1-dev-EliGeneligen_entity_prompts, eligen_entity_masks, eligen_enable_on_negative, eligen_enable_inpaintcodecode--codecode
FLUX.1-dev-LoRA-Encoderlora_encoder_inputs, lora_encoder_scalecodecodecodecode--
FLUX.1-dev-LoRA-Fusion-Previewcode-----
Step1X-Editstep1x_reference_imagecodecodecodecodecodecode
FLEX.2-previewflex_inpaint_image, flex_inpaint_mask, flex_control_image, flex_control_strength, flex_control_stopcodecodecodecodecodecode
Nexus-Gennexus_gen_reference_imagecodecodecodecodecodecode

Wan Series

Detail page: ./examples/wanvideo/

https://github.com/user-attachments/assets/1d66ae74-3b02-40a9-acc3-ea95fc039314

Quick Start
import torch
from diffsynth import save_video
from diffsynth.pipelines.wan_video_new import WanVideoPipeline, ModelConfig

pipe = WanVideoPipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="diffusion_pytorch_model*.safetensors", offload_device="cpu"),
        ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="models_t5_umt5-xxl-enc-bf16.pth", offload_device="cpu"),
        ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="Wan2.1_VAE.pth", offload_device="cpu"),
    ],
)
pipe.enable_vram_management()

video = pipe(
    prompt="A documentary photography style scene: a lively puppy rapidly running on green grass. The puppy has brown-yellow fur, upright ears, and looks focused and joyful. Sunlight shines on its body, making the fur appear soft and shiny. The background is an open field with occasional wildflowers, and faint blue sky and clouds in the distance. Strong sense of perspective captures the motion of the puppy and the vitality of the surrounding grass. Mid-shot side-moving view.",
    negative_prompt="Bright colors, overexposed, static, blurry details, subtitles, style, artwork, image, still, overall gray, worst quality, low quality, JPEG compression artifacts, ugly, deformed, extra fingers, poorly drawn hands, poorly drawn face, malformed limbs, fused fingers, still frame, messy background, three legs, crowded background people, walking backwards",
    seed=0, tiled=True,
)
save_video(video, "video1.mp4", fps=15, quality=5)
Model Overview
Model IDExtra ParametersInferenceFull TrainingValidate After Full TrainingLoRA TrainingValidate After LoRA Training
Wan-AI/Wan2.2-Animate-14Binput_image, animate_pose_video, animate_face_video, animate_inpaint_video, animate_mask_videocodecodecodecodecode
Wan-AI/Wan2.2-S2V-14Binput_image, input_audio, audio_sample_rate, s2v_pose_videocodecodecodecodecode
Wan-AI/Wan2.2-I2V-A14Binput_imagecodecodecodecodecode
Wan-AI/Wan2.2-T2V-A14Bcodecodecodecodecode
Wan-AI/Wan2.2-TI2V-5Binput_imagecodecodecodecodecode
Wan-AI/Wan2.2-VACE-Fun-A14Bvace_control_video, vace_reference_imagecodecodecodecodecode
PAI/Wan2.2-Fun-A14B-InPinput_image, end_imagecodecodecodecodecode
PAI/Wan2.2-Fun-A14B-Controlcontrol_video, reference_imagecodecodecodecodecode
PAI/Wan2.2-Fun-A14B-Control-Cameracontrol_camera_video, input_imagecodecodecodecodecode
Wan-AI/Wan2.1-T2V-1.3Bcodecodecodecodecode
Wan-AI/Wan2.1-T2V-14Bcodecodecodecodecode
Wan-AI/Wan2.1-I2V-14B-480Pinput_imagecodecodecodecodecode
Wan-AI/Wan2.1-I2V-14B-720Pinput_imagecodecodecodecodecode
Wan-AI/Wan2.1-FLF2V-14B-720Pinput_image, end_imagecodecodecodecodecode
PAI/Wan2.1-Fun-1.3B-InPinput_image, end_imagecodecodecodecodecode
PAI/Wan2.1-Fun-1.3B-Controlcontrol_videocodecodecodecodecode
PAI/Wan2.1-Fun-14B-InPinput_image, end_imagecodecodecodecodecode
PAI/Wan2.1-Fun-14B-Controlcontrol_videocodecodecodecodecode
PAI/Wan2.1-Fun-V1.1-1.3B-Controlcontrol_video, reference_imagecodecodecodecodecode
PAI/Wan2.1-Fun-V1.1-14B-Controlcontrol_video, reference_imagecodecodecodecodecode
PAI/Wan2.1-Fun-V1.1-1.3B-InPinput_image, end_imagecodecodecodecodecode
PAI/Wan2.1-Fun-V1.1-14B-InPinput_image, end_imagecodecodecodecodecode
PAI/Wan2.1-Fun-V1.1-1.3B-Control-Cameracontrol_camera_video, input_imagecodecodecodecodecode
PAI/Wan2.1-Fun-V1.1-14B-Control-Cameracontrol_camera_video, input_imagecodecodecodecodecode
iic/VACE-Wan2.1-1.3B-Previewvace_control_video, vace_reference_imagecodecodecodecodecode
Wan-AI/Wan2.1-VACE-1.3Bvace_control_video, vace_reference_imagecodecodecodecodecode
Wan-AI/Wan2.1-VACE-14Bvace_control_video, vace_reference_imagecodecodecodecodecode
DiffSynth-Studio/Wan2.1-1.3b-speedcontrol-v1motion_bucket_idcodecodecodecodecode
krea/krea-realtime-videocodecodecodecodecode
meituan-longcat/LongCat-Videolongcat_videocodecodecodecodecode
ByteDance/Video-As-Prompt-Wan2.1-14Bvap_video, vap_promptcodecodecodecodecode

More Models

Image Generation Models

Detail page: ./examples/image_synthesis/

FLUXStable Diffusion 3
image_1024_cfgimage_1024
KolorsHunyuan-DiT
image_1024image_1024
Stable DiffusionStable Diffusion XL
10241024
Video Generation Models

https://github.com/user-attachments/assets/48dd24bb-0cc6-40d2-88c3-10feed3267e9

https://github.com/user-attachments/assets/5954fdaa-a3cf-45a3-bd35-886e3cc4581b

https://github.com/user-attachments/assets/26b044c1-4a60-44a4-842f-627ff289d006

Image Quality Assessment Models

We have integrated a series of image quality assessment models. These models can be used for evaluating image generation models, alignment training, and similar tasks.

Detail page: ./examples/image_quality_metric/

Innovative Achievements

DiffSynth-Studio is not just an engineering model framework, but also a platform for incubating innovative results.

Nexus-Gen: Unified Architecture for Image Understanding, Generation, and Editing

ArtAug: Aesthetic Enhancement for Image Generation Models
FLUX.1-devFLUX.1-dev + ArtAug LoRA
image_1_baseimage_1_enhance
EliGen: Precise Image Region Control
Entity Control MaskGenerated Image
eligen_example_2_mask_0eligen_example_2_0
ExVideo: Extended Training for Video Generation Models

https://github.com/modelscope/DiffSynth-Studio/assets/35051019/d97f6aa9-8064-4b5b-9d49-ed6001bb9acc

Diffutoon: High-Resolution Anime-Style Video Rendering

https://github.com/Artiprocher/DiffSynth-Studio/assets/35051019/b54c05c5-d747-4709-be5e-b39af82404dd

DiffSynth: The Initial Version of This Project

https://github.com/Artiprocher/DiffSynth-Studio/assets/35051019/59fb2f7b-8de0-4481-b79f-0c3a7361a1ea

Update History

More
  • June 15, 2025 ModelScope's official evaluation framework, EvalScope, now supports text-to-image generation evaluation. Try it with the Best Practices guide.

  • March 25, 2025 Our new open-source project, DiffSynth-Engine, is now open-sourced! Focused on stable model deployment. Geared towards industry. Offers better engineering support, higher computational performance, and more stable functionality.

  • March 31, 2025 We support InfiniteYou, an identity preserving method for FLUX. Please refer to ./examples/InfiniteYou/ for more details.

  • March 13, 2025 We support HunyuanVideo-I2V, the image-to-video generation version of HunyuanVideo open-sourced by Tencent. Please refer to ./examples/HunyuanVideo/ for more details.

  • February 25, 2025 We support Wan-Video, a collection of SOTA video synthesis models open-sourced by Alibaba. See ./examples/wanvideo/.

  • February 17, 2025 We support StepVideo! State-of-the-art video synthesis model! See ./examples/stepvideo.

  • December 31, 2024 We propose EliGen, a novel framework for precise entity-level controlled text-to-image generation, complemented by an inpainting fusion pipeline to extend its capabilities to image inpainting tasks. EliGen seamlessly integrates with existing community models, such as IP-Adapter and In-Context LoRA, enhancing its versatility. For more details, see ./examples/EntityControl.

  • December 19, 2024 We implement advanced VRAM management for HunyuanVideo, making it possible to generate videos at a resolution of 129x720x1280 using 24GB of VRAM, or at 129x512x384 resolution with just 6GB of VRAM. Please refer to ./examples/HunyuanVideo/ for more details.

  • December 18, 2024 We propose ArtAug, an approach designed to improve text-to-image synthesis models through synthesis-understanding interactions. We have trained an ArtAug enhancement module for FLUX.1-dev in the format of LoRA. This model integrates the aesthetic understanding of Qwen2-VL-72B into FLUX.1-dev, leading to an improvement in the quality of generated images.

  • October 25, 2024 We provide extensive FLUX ControlNet support. This project supports many different ControlNet models that can be freely combined, even if their structures differ. Additionally, ControlNet models are compatible with high-resolution refinement and partition control techniques, enabling very powerful controllable image generation. See ./examples/ControlNet/.

  • October 8, 2024. We release the extended LoRA based on CogVideoX-5B and ExVideo. You can download this model from ModelScope or HuggingFace.

  • August 22, 2024. CogVideoX-5B is supported in this project. See here. We provide several interesting features for this text-to-video model, including

    • Text to video
    • Video editing
    • Self-upscaling
    • Video interpolation
  • August 22, 2024. We have implemented an interesting painter that supports all text-to-image models. Now you can create stunning images using the painter, with assistance from AI!

  • August 21, 2024. FLUX is supported in DiffSynth-Studio.

    • Enable CFG and highres-fix to improve visual quality. See here
    • LoRA, ControlNet, and additional models will be available soon.
  • June 21, 2024. We propose ExVideo, a post-tuning technique aimed at enhancing the capability of video generation models. We have extended Stable Video Diffusion to achieve the generation of long videos up to 128 frames.

  • June 13, 2024. DiffSynth Studio is transferred to ModelScope. The developers have transitioned from "I" to "we". Of course, I will still participate in development and maintenance.

  • Jan 29, 2024. We propose Diffutoon, a fantastic solution for toon shading.

    • Project Page
    • The source codes are released in this project.
    • The technical report (IJCAI 2024) is released on arXiv.
  • Dec 8, 2023. We decide to develop a new Project, aiming to release the potential of diffusion models, especially in video synthesis. The development of this project is started.

  • Nov 15, 2023. We propose FastBlend, a powerful video deflickering algorithm.

  • Oct 1, 2023. We release an early version of this project, namely FastSDXL. A try for building a diffusion engine.

    • The source codes are released on GitHub.
    • FastSDXL includes a trainable OLSS scheduler for efficiency improvement.
      • The original repo of OLSS is here.
      • The technical report (CIKM 2023) is released on arXiv.
      • A demo video is shown on Bilibili.
      • Since OLSS requires additional training, we don't implement it in this project.
  • Aug 29, 2023. We propose DiffSynth, a video synthesis framework.