DiffSynth-Studio

November 17, 2025 · View on GitHub

Guidance Buffer Conditioned Video Generation for InfiniCube's Stage 2

We provide comprehensive support for training guidance buffer conditioned video generation for InfiniCube's stage 2 in this repo. We use Wan2.1 1.3B and Wan2.1 14B as the base model. Please refer to InfiniCube for data preparation.

guidance buffer and generated video (center)

WAN-2.1 14B Buffer Conditioned Training

We train both T2V (Text-to-Video) and I2V (Image-to-Video) for buffer conditioned video generation. Training is limited to single-node but typically achieves great results within 1000 steps. On a 8*A100 GPU node, it takes about 1 day to train 1000 steps. Each iteration is about 80 seconds.

T2V Training (Text-to-Video):

bash slurm/singlenode/train_wan_14B_deepspeed.sh

I2V Training (Image-to-Video):

bash slurm/singlenode/train_wan_14B_deepspeed_i2v.sh

remember to update --dataset_base_path and --dataset_metadata_path to your own data path.

WAN-2.1 1.3B Buffer Conditioned Training

The 1.3B model supports both multi-GPU and multi-node training for scalable performance. However, our implementation of simple addition of buffer condition does not work well due to 1.3B model size. We will use Wan 2.1 14B instead.

Quick Start - Single Node:

bash slurm/singlenode/train_wan_1.3B_with_buffer_differential_lr.sh

Multi-Node Training:

bash slurm/launchers/slurm_srun_repeat.sh <config_file> -n <nodes> -a <account>

remember to update --dataset_base_path and --dataset_metadata_path to your own data path. And use zero-init, do not use random init.

WAN-2.1 1.3B Buffer Conditioned Inference

Wan2.1 1.3B Buffer Conditioned Inference:

python examples/wanvideo/model_inference/Wan2.1-1.3B-Buffer-Control.py --checkpoint_path <checkpoint_path>

Wan2.1 14B Buffer Conditioned Inference:

bash examples/wanvideo/model_inference/Wan2.1-T2V-14B-Buffer-Control.py --checkpoint_path <checkpoint_path>
bash examples/wanvideo/model_inference/Wan2.1-I2V-14B-Buffer-Control.py --checkpoint_path <checkpoint_path>

Key Features

Multi-Node Training: Supports multi-node training for scalable performance (haven't made it work together with deepspeed yet)
Differential Learning Rates: Separate learning rates for base model and buffer embedder
Auto-Resume: Automatic checkpoint resumption for interrupted training
Visualization: Visualization step added during training progress
Buffer Conditioning: Semantic and coordinate buffer inputs for precise control

切换到中文

Introduction

Welcome to the magic world of Diffusion models! DiffSynth-Studio is an open-source Diffusion model engine developed and maintained by ModelScope team. We aim to foster technical innovation through framework development, bring together the power of the open-source community, and explore the limits of generative models!

DiffSynth currently includes two open-source projects:

DiffSynth-Studio: Focused on aggressive technical exploration, for academia, providing support for more cutting-edge model capabilities.
DiffSynth-Engine: Focused on stable model deployment, for industry, offering higher computing performance and more stable features.

DiffSynth-Studio and DiffSynth-Engine are the core projects behind ModelScope AIGC zone, offering powerful AI content generation abilities. Come and try our carefully designed features and start your AI creation journey!

Installation

Install from source (recommended):

git clone https://github.com/modelscope/DiffSynth-Studio.git  
cd DiffSynth-Studio
pip install -e .

Other installation methods

Install from PyPI (version updates may be delayed; for latest features, install from source)

pip install diffsynth

If you meet problems during installation, they might be caused by upstream dependencies. Please check the docs of these packages:

Basic Framework

DiffSynth-Studio redesigns the inference and training pipelines for mainstream Diffusion models (including FLUX, Wan, etc.), enabling efficient memory management and flexible model training.

Qwen-Image Series (🔥New Model)

Details: ./examples/qwen_image/

Quick Start

from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
from PIL import Image
import torch

pipe = QwenImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
    ],
    tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"),
)
prompt = "A detailed portrait of a girl underwater, wearing a blue flowing dress, hair gently floating, clear light and shadow, surrounded by bubbles, calm expression, fine details, dreamy and beautiful."
image = pipe(
    prompt, seed=0, num_inference_steps=40,
    # edit_image=Image.open("xxx.jpg").resize((1328, 1328)) # For Qwen-Image-Edit
)
image.save("image.jpg")

Model Overview

Model ID	Inference	Low VRAM Inference	Full Training	Validation after Full Training	LoRA Training	Validation after LoRA Training
Qwen/Qwen-Image	code	code	code	code	code	code
Qwen/Qwen-Image-Edit	code	code	code	code	code	code
Qwen/Qwen-Image-Edit-2509	code	code	code	code	code	code
DiffSynth-Studio/Qwen-Image-EliGen-V2	code	code	-	-	code	code
DiffSynth-Studio/Qwen-Image-EliGen-Poster	code	code	-	-	code	code
DiffSynth-Studio/Qwen-Image-Distill-Full	code	code	code	code	code	code
DiffSynth-Studio/Qwen-Image-Distill-LoRA	code	code	-	-	code	code
DiffSynth-Studio/Qwen-Image-EliGen	code	code	-	-	code	code
DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Canny	code	code	code	code	code	code
DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Depth	code	code	code	code	code	code
DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Inpaint	code	code	code	code	code	code
DiffSynth-Studio/Qwen-Image-In-Context-Control-Union	code	code	-	-	code	code
DiffSynth-Studio/Qwen-Image-Edit-Lowres-Fix	code	code	-	-	-	-

FLUX Series

Detail page: ./examples/flux/

Quick Start

import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig

pipe = FluxImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors"),
        ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
        ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"),
        ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
    ],
)

image = pipe(prompt="a cat", seed=0)
image.save("image.jpg")

Model Overview

Model ID	Extra Parameters	Inference	Low VRAM Inference	Full Training	Validate After Full Training	LoRA Training	Validate After LoRA Training
FLUX.1-dev		code	code	code	code	code	code
FLUX.1-Krea-dev		code	code	code	code	code	code
FLUX.1-Kontext-dev	`kontext_images`	code	code	code	code	code	code
FLUX.1-dev-Controlnet-Inpainting-Beta	`controlnet_inputs`	code	code	code	code	code	code
FLUX.1-dev-Controlnet-Union-alpha	`controlnet_inputs`	code	code	code	code	code	code
FLUX.1-dev-Controlnet-Upscaler	`controlnet_inputs`	code	code	code	code	code	code
FLUX.1-dev-IP-Adapter	`ipadapter_images`, `ipadapter_scale`	code	code	code	code	code	code
FLUX.1-dev-InfiniteYou	`infinityou_id_image`, `infinityou_guidance`, `controlnet_inputs`	code	code	code	code	code	code
FLUX.1-dev-EliGen	`eligen_entity_prompts`, `eligen_entity_masks`, `eligen_enable_on_negative`, `eligen_enable_inpaint`	code	code	-	-	code	code
FLUX.1-dev-LoRA-Encoder	`lora_encoder_inputs`, `lora_encoder_scale`	code	code	code	code	-	-
FLUX.1-dev-LoRA-Fusion-Preview		code	-	-	-	-	-
Step1X-Edit	`step1x_reference_image`	code	code	code	code	code	code
FLEX.2-preview	`flex_inpaint_image`, `flex_inpaint_mask`, `flex_control_image`, `flex_control_strength`, `flex_control_stop`	code	code	code	code	code	code
Nexus-Gen	`nexus_gen_reference_image`	code	code	code	code	code	code

Wan Series

Detail page: ./examples/wanvideo/

https://github.com/user-attachments/assets/1d66ae74-3b02-40a9-acc3-ea95fc039314

Quick Start

import torch
from diffsynth import save_video
from diffsynth.pipelines.wan_video_new import WanVideoPipeline, ModelConfig

pipe = WanVideoPipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="diffusion_pytorch_model*.safetensors", offload_device="cpu"),
        ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="models_t5_umt5-xxl-enc-bf16.pth", offload_device="cpu"),
        ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="Wan2.1_VAE.pth", offload_device="cpu"),
    ],
)
pipe.enable_vram_management()

video = pipe(
    prompt="A documentary photography style scene: a lively puppy rapidly running on green grass. The puppy has brown-yellow fur, upright ears, and looks focused and joyful. Sunlight shines on its body, making the fur appear soft and shiny. The background is an open field with occasional wildflowers, and faint blue sky and clouds in the distance. Strong sense of perspective captures the motion of the puppy and the vitality of the surrounding grass. Mid-shot side-moving view.",
    negative_prompt="Bright colors, overexposed, static, blurry details, subtitles, style, artwork, image, still, overall gray, worst quality, low quality, JPEG compression artifacts, ugly, deformed, extra fingers, poorly drawn hands, poorly drawn face, malformed limbs, fused fingers, still frame, messy background, three legs, crowded background people, walking backwards",
    seed=0, tiled=True,
)
save_video(video, "video1.mp4", fps=15, quality=5)

Model Overview

Model ID	Extra Parameters	Inference	Full Training	Validate After Full Training	LoRA Training	Validate After LoRA Training
Wan-AI/Wan2.2-Animate-14B	`input_image`, `animate_pose_video`, `animate_face_video`, `animate_inpaint_video`, `animate_mask_video`	code	code	code	code	code
Wan-AI/Wan2.2-S2V-14B	`input_image`, `input_audio`, `audio_sample_rate`, `s2v_pose_video`	code	code	code	code	code
Wan-AI/Wan2.2-I2V-A14B	`input_image`	code	code	code	code	code
Wan-AI/Wan2.2-T2V-A14B		code	code	code	code	code
Wan-AI/Wan2.2-TI2V-5B	`input_image`	code	code	code	code	code
Wan-AI/Wan2.2-VACE-Fun-A14B	`vace_control_video`, `vace_reference_image`	code	code	code	code	code
PAI/Wan2.2-Fun-A14B-InP	`input_image`, `end_image`	code	code	code	code	code
PAI/Wan2.2-Fun-A14B-Control	`control_video`, `reference_image`	code	code	code	code	code
PAI/Wan2.2-Fun-A14B-Control-Camera	`control_camera_video`, `input_image`	code	code	code	code	code
Wan-AI/Wan2.1-T2V-1.3B		code	code	code	code	code
Wan-AI/Wan2.1-T2V-14B		code	code	code	code	code
Wan-AI/Wan2.1-I2V-14B-480P	`input_image`	code	code	code	code	code
Wan-AI/Wan2.1-I2V-14B-720P	`input_image`	code	code	code	code	code
Wan-AI/Wan2.1-FLF2V-14B-720P	`input_image`, `end_image`	code	code	code	code	code
PAI/Wan2.1-Fun-1.3B-InP	`input_image`, `end_image`	code	code	code	code	code
PAI/Wan2.1-Fun-1.3B-Control	`control_video`	code	code	code	code	code
PAI/Wan2.1-Fun-14B-InP	`input_image`, `end_image`	code	code	code	code	code
PAI/Wan2.1-Fun-14B-Control	`control_video`	code	code	code	code	code
PAI/Wan2.1-Fun-V1.1-1.3B-Control	`control_video`, `reference_image`	code	code	code	code	code
PAI/Wan2.1-Fun-V1.1-14B-Control	`control_video`, `reference_image`	code	code	code	code	code
PAI/Wan2.1-Fun-V1.1-1.3B-InP	`input_image`, `end_image`	code	code	code	code	code
PAI/Wan2.1-Fun-V1.1-14B-InP	`input_image`, `end_image`	code	code	code	code	code
PAI/Wan2.1-Fun-V1.1-1.3B-Control-Camera	`control_camera_video`, `input_image`	code	code	code	code	code
PAI/Wan2.1-Fun-V1.1-14B-Control-Camera	`control_camera_video`, `input_image`	code	code	code	code	code
iic/VACE-Wan2.1-1.3B-Preview	`vace_control_video`, `vace_reference_image`	code	code	code	code	code
Wan-AI/Wan2.1-VACE-1.3B	`vace_control_video`, `vace_reference_image`	code	code	code	code	code
Wan-AI/Wan2.1-VACE-14B	`vace_control_video`, `vace_reference_image`	code	code	code	code	code
DiffSynth-Studio/Wan2.1-1.3b-speedcontrol-v1	`motion_bucket_id`	code	code	code	code	code
krea/krea-realtime-video		code	code	code	code	code
meituan-longcat/LongCat-Video	`longcat_video`	code	code	code	code	code
ByteDance/Video-As-Prompt-Wan2.1-14B	`vap_video`, `vap_prompt`	code	code	code	code	code

More Models

Image Generation Models

Detail page: ./examples/image_synthesis/

FLUX	Stable Diffusion 3

Kolors	Hunyuan-DiT

Stable Diffusion	Stable Diffusion XL

Video Generation Models

HunyuanVideo: ./examples/HunyuanVideo/

https://github.com/user-attachments/assets/48dd24bb-0cc6-40d2-88c3-10feed3267e9

StepVideo: ./examples/stepvideo/

https://github.com/user-attachments/assets/5954fdaa-a3cf-45a3-bd35-886e3cc4581b

CogVideoX: ./examples/CogVideoX/

https://github.com/user-attachments/assets/26b044c1-4a60-44a4-842f-627ff289d006

Image Quality Assessment Models

We have integrated a series of image quality assessment models. These models can be used for evaluating image generation models, alignment training, and similar tasks.

Detail page: ./examples/image_quality_metric/

Innovative Achievements

DiffSynth-Studio is not just an engineering model framework, but also a platform for incubating innovative results.

Nexus-Gen: Unified Architecture for Image Understanding, Generation, and Editing

Detail page: https://github.com/modelscope/Nexus-Gen
Paper: Nexus-Gen: Unified Image Understanding, Generation, and Editing via Prefilled Autoregression in Shared Embedding Space
Model: ModelScope, HuggingFace
Dataset: ModelScope Dataset
Online Demo: ModelScope Nexus-Gen Studio

ArtAug: Aesthetic Enhancement for Image Generation Models

Detail page: ./examples/ArtAug/
Paper: ArtAug: Enhancing Text-to-Image Generation through Synthesis-Understanding Interaction
Model: ModelScope, HuggingFace
Online Demo: ModelScope AIGC Tab

FLUX.1-dev	FLUX.1-dev + ArtAug LoRA

EliGen: Precise Image Region Control

Detail page: ./examples/EntityControl/
Paper: EliGen: Entity-Level Controlled Image Generation with Regional Attention
Model: ModelScope, HuggingFace
Online Demo: ModelScope EliGen Studio
Dataset: EliGen Train Set

Entity Control Mask	Generated Image

ExVideo: Extended Training for Video Generation Models

Project Page: Project Page
Paper: ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning
Code Example: ./examples/ExVideo/
Model: ModelScope, HuggingFace

https://github.com/modelscope/DiffSynth-Studio/assets/35051019/d97f6aa9-8064-4b5b-9d49-ed6001bb9acc

Diffutoon: High-Resolution Anime-Style Video Rendering

Project Page: Project Page
Paper: Diffutoon: High-Resolution Editable Toon Shading via Diffusion Models
Code Example: ./examples/Diffutoon/

https://github.com/Artiprocher/DiffSynth-Studio/assets/35051019/b54c05c5-d747-4709-be5e-b39af82404dd

DiffSynth: The Initial Version of This Project

Project Page: Project Page
Paper: DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis
Code Example: ./examples/diffsynth/

https://github.com/Artiprocher/DiffSynth-Studio/assets/35051019/59fb2f7b-8de0-4481-b79f-0c3a7361a1ea

Update History

November 4, 2025: We support ByteDance/Video-As-Prompt-Wan2.1-14B model, which is trained on Wan 2.1 and enables motion generation conditioned on reference videos.
October 30, 2025: We support meituan-longcat/LongCat-Video model, which enables text-to-video, image-to-video, and video continuation capabilities. This model adopts Wan's framework for both inference and training in this project.
October 27, 2025: We support krea/krea-realtime-video model, further expanding Wan's ecosystem.
September 23, 2025 DiffSynth-Studio/Qwen-Image-EliGen-Poster is released! This model is jointly developed and open-sourced by us and the Taobao Design Team. The model is built upon Qwen-Image, specifically designed for e-commerce poster scenarios, and supports precise partition layout control. Please refer to our example code.
September 9, 2025: Our training framework now supports multiple training modes and has been adapted for Qwen-Image. In addition to the standard SFT training mode, Direct Distill is now also supported; please refer to our example code. This feature is experimental, and we will continue to improve it to support comprehensive model training capabilities.
August 28, 2025 We support Wan2.2-S2V, an audio-driven cinematic video generation model open-sourced by Alibaba. See ./examples/wanvideo/.
August 21, 2025: DiffSynth-Studio/Qwen-Image-EliGen-V2 is released! Compared to the V1 version, the training dataset has been updated to the Qwen-Image-Self-Generated-Dataset, enabling generated images to better align with the inherent image distribution and style of Qwen-Image. Please refer to our sample code.
August 21, 2025: We open-sourced the DiffSynth-Studio/Qwen-Image-In-Context-Control-Union structure control LoRA model. Following "In Context" routine, it supports various types of structural control conditions, including canny, depth, lineart, softedge, normal, and openpose. Please refer to our sample code.
August 20, 2025 We open-sourced DiffSynth-Studio/Qwen-Image-Edit-Lowres-Fix, which improves the editing performance of Qwen-Image-Edit on low-resolution image inputs. Please refer to our example code.
August 19, 2025 🔥 Qwen-Image-Edit is now open source. Welcome the new member to the image editing model family!
August 18, 2025 We trained and open-sourced the Inpaint ControlNet model for Qwen-Image, DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Inpaint, which adopts a lightweight architectural design. Please refer to our sample code.
August 15, 2025 We open-sourced the Qwen-Image-Self-Generated-Dataset. This is an image dataset generated using the Qwen-Image model, with a total of 160,000 1024 x 1024 images. It includes the general, English text rendering, and Chinese text rendering subsets. We provide caption, entity and control images annotations for each image. Developers can use this dataset to train models such as ControlNet and EliGen for the Qwen-Image model. We aim to promote technological development through open-source contributions!
August 13, 2025 We trained and open-sourced the ControlNet model for Qwen-Image, DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Depth, which adopts a lightweight architectural design. Please refer to our sample code.
August 12, 2025 We trained and open-sourced the ControlNet model for Qwen-Image, DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Canny, which adopts a lightweight architectural design. Please refer to our sample code.
August 11, 2025 We released another distilled acceleration model for Qwen-Image, DiffSynth-Studio/Qwen-Image-Distill-LoRA. It uses the same training process as DiffSynth-Studio/Qwen-Image-Distill-Full, but the model structure is changed to LoRA. This makes it work better with other open-source models.
August 7, 2025 We open-sourced the entity control LoRA of Qwen-Image, DiffSynth-Studio/Qwen-Image-EliGen. Qwen-Image-EliGen is able to achieve entity-level controlled text-to-image generation. See the paper for technical details. Training dataset: EliGenTrainSet.
August 5, 2025 We open-sourced the distilled acceleration model of Qwen-Image, DiffSynth-Studio/Qwen-Image-Distill-Full, achieving approximately 5x speedup.
August 4, 2025 🔥 Qwen-Image is now open source. Welcome the new member to the image generation model family!
August 1, 2025 FLUX.1-Krea-dev with a focus on aesthetic photography is comprehensively supported, including low-GPU-memory layer-by-layer offload, LoRA training and full training. See ./examples/flux/.
July 28, 2025 With the open-sourcing of Wan 2.2, we immediately provided comprehensive support, including low-GPU-memory layer-by-layer offload, FP8 quantization, sequence parallelism, LoRA training, full training. See ./examples/wanvideo/.
July 11, 2025 We propose Nexus-Gen, a unified model that synergizes the language reasoning capabilities of LLMs with the image synthesis power of diffusion models. This framework enables seamless image understanding, generation, and editing tasks.
- Paper: Nexus-Gen: Unified Image Understanding, Generation, and Editing via Prefilled Autoregression in Shared Embedding Space
- Github Repo: https://github.com/modelscope/Nexus-Gen
- Model: ModelScope, HuggingFace
- Training Dataset: ModelScope Dataset
- Online Demo: ModelScope Nexus-Gen Studio

June 15, 2025 ModelScope's official evaluation framework, EvalScope, now supports text-to-image generation evaluation. Try it with the Best Practices guide.
March 25, 2025 Our new open-source project, DiffSynth-Engine, is now open-sourced! Focused on stable model deployment. Geared towards industry. Offers better engineering support, higher computational performance, and more stable functionality.
March 31, 2025 We support InfiniteYou, an identity preserving method for FLUX. Please refer to ./examples/InfiniteYou/ for more details.
March 13, 2025 We support HunyuanVideo-I2V, the image-to-video generation version of HunyuanVideo open-sourced by Tencent. Please refer to ./examples/HunyuanVideo/ for more details.
February 25, 2025 We support Wan-Video, a collection of SOTA video synthesis models open-sourced by Alibaba. See ./examples/wanvideo/.
February 17, 2025 We support StepVideo! State-of-the-art video synthesis model! See ./examples/stepvideo.
December 31, 2024 We propose EliGen, a novel framework for precise entity-level controlled text-to-image generation, complemented by an inpainting fusion pipeline to extend its capabilities to image inpainting tasks. EliGen seamlessly integrates with existing community models, such as IP-Adapter and In-Context LoRA, enhancing its versatility. For more details, see ./examples/EntityControl.
- Paper: EliGen: Entity-Level Controlled Image Generation with Regional Attention
- Model: ModelScope, HuggingFace
- Online Demo: ModelScope EliGen Studio
- Training Dataset: EliGen Train Set
December 19, 2024 We implement advanced VRAM management for HunyuanVideo, making it possible to generate videos at a resolution of 129x720x1280 using 24GB of VRAM, or at 129x512x384 resolution with just 6GB of VRAM. Please refer to ./examples/HunyuanVideo/ for more details.
December 18, 2024 We propose ArtAug, an approach designed to improve text-to-image synthesis models through synthesis-understanding interactions. We have trained an ArtAug enhancement module for FLUX.1-dev in the format of LoRA. This model integrates the aesthetic understanding of Qwen2-VL-72B into FLUX.1-dev, leading to an improvement in the quality of generated images.
- Paper: https://arxiv.org/abs/2412.12888
- Examples: https://github.com/modelscope/DiffSynth-Studio/tree/main/examples/ArtAug
- Model: ModelScope, HuggingFace
- Demo: ModelScope, HuggingFace (Coming soon)
October 25, 2024 We provide extensive FLUX ControlNet support. This project supports many different ControlNet models that can be freely combined, even if their structures differ. Additionally, ControlNet models are compatible with high-resolution refinement and partition control techniques, enabling very powerful controllable image generation. See ./examples/ControlNet/.
October 8, 2024. We release the extended LoRA based on CogVideoX-5B and ExVideo. You can download this model from ModelScope or HuggingFace.
August 22, 2024. CogVideoX-5B is supported in this project. See here. We provide several interesting features for this text-to-video model, including
- Text to video
- Video editing
- Self-upscaling
- Video interpolation
August 22, 2024. We have implemented an interesting painter that supports all text-to-image models. Now you can create stunning images using the painter, with assistance from AI!
- Use it in our WebUI.
August 21, 2024. FLUX is supported in DiffSynth-Studio.
- Enable CFG and highres-fix to improve visual quality. See here
- LoRA, ControlNet, and additional models will be available soon.
June 21, 2024. We propose ExVideo, a post-tuning technique aimed at enhancing the capability of video generation models. We have extended Stable Video Diffusion to achieve the generation of long videos up to 128 frames.
- Project Page
- Source code is released in this repo. See examples/ExVideo.
- Models are released on HuggingFace and ModelScope.
- Technical report is released on arXiv.
- You can try ExVideo in this Demo!
June 13, 2024. DiffSynth Studio is transferred to ModelScope. The developers have transitioned from "I" to "we". Of course, I will still participate in development and maintenance.
Jan 29, 2024. We propose Diffutoon, a fantastic solution for toon shading.
- Project Page
- The source codes are released in this project.
- The technical report (IJCAI 2024) is released on arXiv.
Dec 8, 2023. We decide to develop a new Project, aiming to release the potential of diffusion models, especially in video synthesis. The development of this project is started.
Nov 15, 2023. We propose FastBlend, a powerful video deflickering algorithm.
- The sd-webui extension is released on GitHub.
- Demo videos are shown on Bilibili, including three tasks.
- The technical report is released on arXiv.
- An unofficial ComfyUI extension developed by other users is released on GitHub.
Oct 1, 2023. We release an early version of this project, namely FastSDXL. A try for building a diffusion engine.
- The source codes are released on GitHub.
- FastSDXL includes a trainable OLSS scheduler for efficiency improvement.
  - The original repo of OLSS is here.
  - The technical report (CIKM 2023) is released on arXiv.
  - A demo video is shown on Bilibili.
  - Since OLSS requires additional training, we don't implement it in this project.
Aug 29, 2023. We propose DiffSynth, a video synthesis framework.
- Project Page.
- The source codes are released in EasyNLP.
- The technical report (ECML PKDD 2024) is released on arXiv.