README.md

November 17, 2025 · View on GitHub

ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation

Jay Zhangjie Wu*    Xuanchi Ren*    Tianchang Shen    Tianshi Cao    Kai He
Yifan Lu    Ruiyuan Gao    Enze Xie    Shiyi Lan    Jose M. Alvarez
Jun Gao    Sanja Fidler    Zian Wang    Huan Ling*†

^{* equal contribution † corresponding author}

📖 Project Page | 🤗 ChronoEdit-14B | 📑 Arxiv

TL;DR: ChronoEdit reframes image editing as a video generation task, using input and edited images as start/end frames to leverage pretrained video models with temporal consistency. A temporal reasoning stage introduces reasoning tokens to ensure physically plausible edits and visualize the editing trajectory.

🫨 News

2025/11/16: 👋 ChronoEdit-14B-Diffusers-Paint-Brush-Lora is released on 🤗 HuggingFace. Thanks to @AK for hosting the 🤗 demo.
2025/11/10: 👋 ChronoEdit-14B-Diffusers-Upscaler-Lora is released on 🤗 HuggingFace. Thanks to @AK for hosting the 🤗 demo.
2025/11/10: 👋 ChronoEdit is officially merged into diffusers Pipeline.
2025/10/29: 👋 ChronoEdit-14B is released on 🤗 HuggingFace!
2025/10/04: 👋 ChronoEdit paper is released.

🤗 Open Source Plan

ChronoEdit
- Inference with Diffuser
- LoRA Training with DiffSynth-Studio
- ChronoEdit-14B Checkpoints
- ChronoEdit-14B 8-Steps Distilled LoRA Checkpoints
- ChronoEdit-2B Checkpoints
- ChronoEdit-2B 4-Steps Distilled LoRA Checkpoints
- Full Model Training Infrastructure

📑 Quick Start

Installation

Clone the repo:

git clone https://github.com/nv-tlabs/ChronoEdit
cd ChronoEdit

This repo runs only on Linux systems and requires python 3.10:

conda env create -f environment.yml -n chronoedit_mini
conda activate chronoedit_mini
pip install torch==2.7.1 torchvision==0.22.1
pip install -r requirements_minimal.txt

Optional: Install flash attention with cudatoolkit if you want faster interence. Model can run without installation.

# You may need to set a limit to the number of threads used during compilation to prevent OOM Errors.
export MAX_JOBS=16
pip install flash-attn==2.6.3

Download diffusers checkpoint from HuggingFace:

hf download nvidia/ChronoEdit-14B-Diffusers --local-dir checkpoints/ChronoEdit-14B-Diffusers

Diffusers Inference 🤗

Note

2025/11/10 Update: ChronoEdit is officially merged into diffuser, checkout official pipeline at LINK

(1) Single GPU Inference

Run inference with default hyperparameters.

PYTHONPATH=$(pwd) python scripts/run_inference_diffusers.py \
--input assets/images/input_2.png --offload_model --use-prompt-enhancer \
--prompt "Add a sunglasses to the cat's face"  \
--output output.mp4 \
--model-path ./checkpoints/ChronoEdit-14B-Diffusers

Append tag --enable-temporal-reasoning to enable temporal reasoning for better consistency.

Note

The inference requires ~ 34G GPU memory with --offload_model flag turned on.

In temporal reasoning mode, GPU memory requirement is increased to ~ 38G.

(2) Inference with Prompt Enhancer

Append tag --use-prompt-enhancer to turn on auto prompt enhancer.

You can adjust the --prompt_enhancer_model flag to select a different model. As default, we recommend using Qwen/Qwen3-VL-30B-A3B-Instruct, which delivers the best results but requires up to 60 GB of peak memory. Smaller vision-language models are supported as well, though they may produce lower-quality outputs.

Note

We strongly suggest the users to read Prompt Guidance and use our provided prompt enhancer for best results.

Note

If you prefer not to host the prompt enhancer locally, you can use the provided System prompt with any modern online LLM chat agent.

(3) Inference with 8-Step Distillation LoRA

With distillation LoRA, we recommend to set hyperparameter as --flow-shift 2.0, --guidance-scale 1.0 and --num-inference-steps 8

# Advanced usage with lora settings
PYTHONPATH=$(pwd) python scripts/run_inference_diffusers.py --use-prompt-enhancer --offload_model \
--input assets/images/input_2.png \
--prompt "Add a sunglasses to the cat's face"  \
--output output_lora.mp4 \
--num-inference-steps 8 \
--guidance-scale 1.0 \
--flow-shift 2.0 \
--lora-scale 1.0 \
--seed 42 \
--lora-path ./checkpoints/ChronoEdit-14B-Diffusers/lora/chronoedit_distill_lora.safetensors \
--model-path ./checkpoints/ChronoEdit-14B-Diffusers

(4) Inference with other LoRAs

ChronoEdit-14B-Diffusers-Upscaler-Lora 🤗

Trigger Prompt: The user want to enhance image clarity and resolution while keeping the content identical. super-resolution, high detail, 4K clarity, same composition, natural texture.

hf download nvidia/ChronoEdit-14B-Diffusers-Upscaler-Lora --local-dir checkpoints/ChronoEdit-14B-Diffusers-Upscaler-Lora

The model is tested until 2k resolution.

PYTHONPATH=$(pwd) python scripts/run_inference_diffusers.py \
    --input assets/images/lr.png --width 1584 --height 1056 \
    --prompt "The user want to enhance image clarity and resolution while keeping the content identical. super-resolution, high detail, 4K clarity, same composition, natural texture."  \
    --output output_sr_lora.mp4 \
    --lora-scale 1.0 \
    --seed 42 \
    --lora-path ./checkpoints/ChronoEdit-14B-Diffusers-Upscaler-Lora/upsample_lora_diffusers.safetensors \
    --model-path ./checkpoints/ChronoEdit-14B-Diffusers

ChronoEdit-14B-Diffusers-Paint-Brush-Lora 🤗

Trigger Prompt: Turn the pencil sketch in the image into an actual object that is consistent with the image’s content. The user wants to change the sketch to a {} {} should be filled with simple description of what you are drawing. i.e. a crown and hat that matches the original image’s style.

Note

The LoRA was trained with black paintbrush. Other colors's sketch could also work but works worse than black

hf download nvidia/ChronoEdit-14B-Diffusers-Paint-Brush-Lora --local-dir checkpoints/ChronoEdit-14B-Diffusers-Paint-Brush-Lora

Note

We recommand to use paintbrush LoRA together with 8 steps distill LoRA It works better than without 8 steps distill LoRA in our testing cases.

PYTHONPATH=$(pwd) python scripts/run_inference_diffusers.py \
    --input assets/images/input_paintbrush.png \
    --prompt "Turn the pencil sketch in the image into an actual object that is consistent with the image’s content. The user wants to change the sketch to a crown and a hat."  \
    --output output_paintbrush_lora.png \
    --num-inference-steps 8 \
    --guidance-scale 1.0 \
    --flow-shift 2.0 \
    --lora-scale 1.0 \
    --seed 42 \
    --lora-path ./checkpoints/ChronoEdit-14B-Diffusers/lora/chronoedit_distill_lora.safetensors ./checkpoints/ChronoEdit-14B-Diffusers-Paint-Brush-Lora/paintbrush_lora_diffusers.safetensors \
    --model-path ./checkpoints/ChronoEdit-14B-Diffusers

Gradio Demo with interactive brush:

PYTHONPATH=$(pwd) python scripts/gradio_paintbrush.py

(5) Inference with multiple LoRAs

For example, to use both distill LoRA and paintbrush LoRA:

PYTHONPATH=$(pwd) python scripts/run_inference_diffusers.py \
    --input assets/images/input_paintbrush.png \
    --prompt "Turn the pencil sketch in the image into an actual object that is consistent with the image’s content. The user wants to change the sketch to a crown and a hat."  \
    --output output_paintbrush_lora.png \
    --num-inference-steps 8 \
    --guidance-scale 1.0 \
    --flow-shift 2.0 \
    --lora-scale 1.0 \
    --seed 42 \
    --lora-path ./checkpoints/ChronoEdit-14B-Diffusers/lora/chronoedit_distill_lora.safetensors ./checkpoints/ChronoEdit-14B-Diffusers-Paint-Brush-Lora/paintbrush_lora_diffusers.safetensors \
    --model-path ./checkpoints/ChronoEdit-14B-Diffusers

📑 LoRA Finetune with Diffsynth-Studio

Install Diffsynth-Studio:

pip install git+https://github.com/modelscope/DiffSynth-Studio.git

Training LoRA with Diffsynth. See Dataset Doc for dataset prepartion guidance:

PYTHONPATH=$(pwd) accelerate launch scripts/train_diffsynth.py \
    --dataset_base_path data/example_dataset \
    --dataset_metadata_path data/example_dataset/metadata.csv \
    --height 1024 \
    --width 1024 \
    --num_frames 5 \
    --dataset_repeat 1 \
    --model_paths '[["checkpoints/ChronoEdit-14B-Diffusers/transformer/diffusion_pytorch_model-00001-of-00014.safetensors","checkpoints/ChronoEdit-14B-Diffusers/transformer/diffusion_pytorch_model-00002-of-00014.safetensors","checkpoints/ChronoEdit-14B-Diffusers/transformer/diffusion_pytorch_model-00003-of-00014.safetensors","checkpoints/ChronoEdit-14B-Diffusers/transformer/diffusion_pytorch_model-00004-of-00014.safetensors","checkpoints/ChronoEdit-14B-Diffusers/transformer/diffusion_pytorch_model-00005-of-00014.safetensors","checkpoints/ChronoEdit-14B-Diffusers/transformer/diffusion_pytorch_model-00006-of-00014.safetensors","checkpoints/ChronoEdit-14B-Diffusers/transformer/diffusion_pytorch_model-00007-of-00014.safetensors","checkpoints/ChronoEdit-14B-Diffusers/transformer/diffusion_pytorch_model-00008-of-00014.safetensors","checkpoints/ChronoEdit-14B-Diffusers/transformer/diffusion_pytorch_model-00009-of-00014.safetensors","checkpoints/ChronoEdit-14B-Diffusers/transformer/diffusion_pytorch_model-00010-of-00014.safetensors","checkpoints/ChronoEdit-14B-Diffusers/transformer/diffusion_pytorch_model-00011-of-00014.safetensors","checkpoints/ChronoEdit-14B-Diffusers/transformer/diffusion_pytorch_model-00012-of-00014.safetensors","checkpoints/ChronoEdit-14B-Diffusers/transformer/diffusion_pytorch_model-00013-of-00014.safetensors","checkpoints/ChronoEdit-14B-Diffusers/transformer/diffusion_pytorch_model-00014-of-00014.safetensors"]]' \
    --model_id_with_origin_paths "Wan-AI/Wan2.1-I2V-14B-720P:models_t5_umt5-xxl-enc-bf16.pth,Wan-AI/Wan2.1-I2V-14B-720P:Wan2.1_VAE.pth,Wan-AI/Wan2.1-I2V-14B-720P:models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth" \
    --learning_rate 1e-4 \
    --num_epochs 5 \
    --remove_prefix_in_ckpt "pipe.dit." \
    --output_path "./models/train/ChronoEdit-14B_lora" \
    --lora_base_model "dit" \
    --lora_target_modules "q,k,v,o,ffn.0,ffn.2" \
    --lora_rank 32 \
    --extra_inputs "input_image" \
    --use_gradient_checkpointing_offload

Inference with Diffsynth:

PYTHONPATH=$(pwd) python scripts/run_inference_diffsynth.py

Inference with Diffsynth (Multi-GPUs):

PYTHONPATH=$(pwd) torchrun --standalone --nproc_per_node=8 scripts/run_inference_diffsynth.py

📑 Full Model Training Framework

We release ChronoEdit’s full training infrastructure and codebase, enabling distributed inference and large-scale fine-tuning of pretrained video diffusion models. See Training Doc for details.

📑 Create Your Own Training Dataset

We provide an automated editing labeling script to generate high-quality editing instructions from pairs of images (before and after editing). The script uses state-of-the-art vision-language models to analyze image pairs and generate precise editing prompts with Chain-of-Thought (CoT) reasoning. See dataset guidance doc for details.

Acknowledgments

The authors would like to thank Product Managers Aditya Mahajan and Matt Cragun for their valuable guidance and support. We further acknowledge the Cosmos Team at NVIDIA, especially Qinsheng Zhang and Hanzi Mao, for their consultation on Cosmos-Pred2.5-2B. We also thank Yuyang Zhao, Junsong Chen, and Jincheng Yu for their insightful discussions. Finally, we are grateful to Ben Cashman, Yuting Yang, and Amanda Moran for their infrastructure support.

Also shout-out to Wiedemer et al., Video Models are Zero-Shot Learners and Reasoners (2025) — while the two projects were developed concurrently, several of our examples were inspired by this excellent work.

Citation

@article{wu2025chronoedit,
    title={ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation},
    author={Wu, Jay Zhangjie and Ren, Xuanchi and Shen, Tianchang and Cao, Tianshi and He, Kai and Lu, Yifan and Gao, Ruiyuan and Xie, Enze and Lan, Shiyi and Alvarez, Jose M. and Gao, Jun and Fidler, Sanja and Wang, Zian and Ling, Huan},
    journal={arXiv preprint arXiv:2510.04290},
    year={2025}
}