ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment
April 8, 2026 Β· View on GitHub
Yiyang Chen1, Xuanhua He2*, Xiujun Ma1*, Jack Ma2*
1State Key Laboratory of General Artificial Intelligence, Peking University
2The Hong Kong University of Science and Technology
π Abstract
ContextFlow is a novel training-free framework for DiT-based video object editing, supporting object insertion, swapping, and deletion. Built upon Wan2.1-I2V-14B-480P, our method introduces three key innovations:
- High-Fidelity Inversion via RF-Solver: A second-order Rectified Flow solver replaces lossy DDIM Inversion, establishing a near-lossless and highly reversible noise anchor for editing.
- Adaptive Context Enrichment (ACE): Instead of crude "hard" feature replacement, ACE concatenates Key-Value pairs from parallel reconstruction and editing paths, enabling the self-attention to dynamically fuse source and target information without contextual conflicts.
- Vital Layer Analysis via Guidance Responsiveness: A data-driven metric identifies the most influential DiT blocks for each task, enabling targeted and efficient guidance injection.
Extensive experiments demonstrate that ContextFlow significantly outperforms existing training-free methods and even surpasses several state-of-the-art training-based approaches.
π₯ News
- [2026.03] π Paper and code released!
π Table of Contents
π¬ Method Overview
1. High-Fidelity Inversion
We adopt RF-Solver, a second-order Rectified Flow solver, to map the source video to a noise latent z_T. Unlike first-order DDIM Inversion, RF-Solver provides near-lossless reconstruction, creating an unambiguous anchor for editing.
2. Adaptive Context Enrichment (ACE)
During denoising, we maintain two parallel paths from the shared z_T:
- Reconstruction Path: Conditioned on the original first frame and null prompt, preserving source context.
- Editing Path: Conditioned on the edited first frame and target prompt, synthesizing the desired edit.
At selected layers, we concatenate Key-Value pairs from both paths:
K_aug = Concat([K_edit, K_recon])
V_aug = Concat([V_edit, V_recon])
Attention = softmax(Q_edit Β· K_aug^T / βd) Β· V_aug
This "soft guidance" empowers the model to dynamically balance between content preservation and object synthesis on a per-token basisβwithout destructive feature replacement.
3. Vital Layer Analysis
We propose a Guidance Responsiveness (GR) metric to identify the most influential layers for each task:
| Task | Dominant Responsive Layers | Interpretation |
|---|---|---|
| Insertion | Early layers (1β10) | Spatial layout establishment |
| Swapping | Deep layers (26β32) | Semantic concept replacement |
| Deletion | Middle + Deep layers (15β21, 26β32) | Dual semantic operation |
Only the top-k (k=4) most responsive layers receive context enrichment, ensuring both precision and efficiency.
π Installation
ContextFlow is built on top of Wan2.1. Please follow the official Wan2.1 environment setup first.
1. Clone this repository
git clone https://github.com/yychen233/ContextFlow.git
cd ContextFlow
2. Install Wan2.1 dependencies
# Ensure PyTorch >= 2.4.0
pip install -r requirements.txt
π‘ For detailed environment setup and troubleshooting, please refer to the Wan2.1 Official Repository.
π¦ Model Preparation
Base Model
Download the Wan2.1-I2V-14B-480P checkpoint, which serves as our base model:
| Model | Download | Notes |
|---|---|---|
| Wan2.1-I2V-14B-480P | π€ HuggingFace / π€ ModelScope | Required for ContextFlow |
pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./Wan2.1-I2V-14B-480P
First-Frame Editing Tools
ContextFlow uses off-the-shelf image editors to prepare the edited first frame. Depending on the task, you may need:
| Task | Tool | Link |
|---|---|---|
| Object Insertion | AnyDoor | GitHub |
| Object Swapping | InsertAnything | GitHub |
| Object Deletion | MagicQuill | GitHub |
π Inference
ContextFlow follows a simple two-stage pipeline. Since our method is entirely training-free, no fine-tuning or optimization is required.
Step 1: Video Inversion
First, invert the source video into a noise latent using RF-Solver:
bash inversion.sh
βοΈ Key parameters:
- Inversion steps: 50
- Solver: RF-Solver (second-order)
Please modify the paths in
inversion.shto point to your source video and model checkpoint.
π‘ Pre-inverted noise available: We provide pre-computed noise latents for demo videos. Download from: π PKU Cloud Drive
If using the pre-inverted noise, you can skip this step and proceed directly to Step 2.
Step 2: Object Editing
With the inverted noise latent ready, run the corresponding demo script for your desired task:
π’ Object Insertion
bash demo_insert.sh
Insert a new object into the video while preserving the original background and motion.
π΅ Object Swapping
bash demo_swap.sh
Replace an existing object with a new one, maintaining spatial layout and temporal coherence.
π΄ Object Deletion
bash demo_delete.sh
Remove an object from the video and seamlessly inpaint the background.
β οΈ Important: Please update the file paths in each script (e.g., model checkpoint, source video, edited first frame, inverted noise, and target prompt) before running.
Default Hyperparameters
| Parameter | Value | Description |
|---|---|---|
num_inference_steps | 50 | Denoising sampling steps |
guidance_scale | 3.0 | Classifier-free guidance scale |
Ο (timestep threshold) | 0.5 | ACE active for the first 50% of timesteps |
k (vital layers) | 4 | Top-k layers selected by GR metric (~10% of 40 layers) |
| Base model | Wan2.1-I2V-14B-480P | 40-layer Diffusion Transformer |
π Results
Quantitative Comparison
| Task | Method | CLIP-I β | DINO-I β | CLIP-score β | Smoothness β | Dynamic β | Aesthetic β |
|---|---|---|---|---|---|---|---|
| Insert | AnyV2V | 0.5943 | 0.4053 | 0.2776 | 0.9804 | 0.3077 | 0.5287 |
| VACE | 0.5683 | 0.3967 | 0.2569 | 0.9921 | 0.3077 | 0.5724 | |
| I2VEdit | 0.6710 | 0.4595 | 0.3124 | 0.9827 | 0.3077 | 0.5846 | |
| Ours | 0.6504 | 0.4566 | 0.3107 | 0.9918 | 0.4231 | 0.6227 | |
| Swap | AnyV2V | 0.6046 | 0.5641 | 0.3210 | 0.9848 | 0.0769 | 0.5739 |
| VACE | 0.6080 | 0.5917 | 0.3226 | 0.9926 | 0.1538 | 0.6144 | |
| I2VEdit | 0.6683 | 0.6003 | 0.3282 | 0.9819 | 0.0769 | 0.5995 | |
| Ours | 0.6644 | 0.6004 | 0.3391 | 0.9924 | 0.0769 | 0.6176 | |
| Delete | AnyV2V | β | β | 0.2891 | 0.9781 | 0.1500 | 0.5378 |
| VACE | β | β | 0.2794 | 0.9889 | 0.3000 | 0.5645 | |
| Ours | β | β | 0.2854 | 0.9900 | 0.3500 | 0.5405 |
Evaluated on the UNIC-Benchmark. Video quality metrics from VBench.
Qualitative Results
For more video results, please visit our π Project Page.
π Project Structure
ContextFlow/
βββ inversion.sh # Video inversion script (RF-Solver)
βββ demo_insert.sh # Object insertion demo
βββ demo_swap.sh # Object swapping demo
βββ demo_delete.sh # Object deletion demo
βββ requirements.txt # Dependencies (Wan2.1 base)
βββ assets/ # README images
βββ examples/ # Example inputs (videos, edited frames)
βββ ... # Core source code
π Citation
If you find ContextFlow useful in your research, please consider citing:
@inproceedings{chen2026contextflow,
title={ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment},
author={Chen, Yiyang and He, Xuanhua and Ma, Xiujun and Ma, Jack},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
year={2026}
}
Also consider citing the base model:
@article{wan2025,
title={Wan: Open and Advanced Large-Scale Video Generative Models},
author={Team Wan and Ang Wang and Baole Ai and others},
journal={arXiv preprint arXiv:2503.20314},
year={2025}
}
π Acknowledgements
We gratefully acknowledge the following projects that made this work possible:
- Wan2.1 β Our base I2V Diffusion Transformer model
- RF-Solver β High-fidelity Rectified Flow inversion
- AnyDoor β First-frame editing for object insertion
- InsertAnything β First-frame editing for object swapping
- MagicQuill β First-frame editing for object deletion
- AnyV2V β Training-free video editing baseline
- VBench β Video generation evaluation benchmark
π License
This project is released under the Apache 2.0 License. The base model Wan2.1 is also licensed under Apache 2.0. Please ensure your usage complies with all applicable licenses and legal regulations.
If you have any questions, feel free to open an issue or contact us at chenyy@stu.pku.edu.cn.