ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment

April 8, 2026 Β· View on GitHub

arXiv Project Page Wan2.1

Yiyang Chen1, Xuanhua He2*, Xiujun Ma1*, Jack Ma2*
1State Key Laboratory of General Artificial Intelligence, Peking University
2The Hong Kong University of Science and Technology

ContextFlow Showcase

πŸ“– Abstract

ContextFlow is a novel training-free framework for DiT-based video object editing, supporting object insertion, swapping, and deletion. Built upon Wan2.1-I2V-14B-480P, our method introduces three key innovations:

  1. High-Fidelity Inversion via RF-Solver: A second-order Rectified Flow solver replaces lossy DDIM Inversion, establishing a near-lossless and highly reversible noise anchor for editing.
  2. Adaptive Context Enrichment (ACE): Instead of crude "hard" feature replacement, ACE concatenates Key-Value pairs from parallel reconstruction and editing paths, enabling the self-attention to dynamically fuse source and target information without contextual conflicts.
  3. Vital Layer Analysis via Guidance Responsiveness: A data-driven metric identifies the most influential DiT blocks for each task, enabling targeted and efficient guidance injection.

Extensive experiments demonstrate that ContextFlow significantly outperforms existing training-free methods and even surpasses several state-of-the-art training-based approaches.


πŸ”₯ News

  • [2026.03] πŸŽ‰ Paper and code released!

πŸ“‘ Table of Contents


πŸ”¬ Method Overview

ContextFlow Pipeline

1. High-Fidelity Inversion

We adopt RF-Solver, a second-order Rectified Flow solver, to map the source video to a noise latent z_T. Unlike first-order DDIM Inversion, RF-Solver provides near-lossless reconstruction, creating an unambiguous anchor for editing.

2. Adaptive Context Enrichment (ACE)

During denoising, we maintain two parallel paths from the shared z_T:

  • Reconstruction Path: Conditioned on the original first frame and null prompt, preserving source context.
  • Editing Path: Conditioned on the edited first frame and target prompt, synthesizing the desired edit.

At selected layers, we concatenate Key-Value pairs from both paths:

K_aug = Concat([K_edit, K_recon])
V_aug = Concat([V_edit, V_recon])
Attention = softmax(Q_edit · K_aug^T / √d) · V_aug

This "soft guidance" empowers the model to dynamically balance between content preservation and object synthesis on a per-token basisβ€”without destructive feature replacement.

3. Vital Layer Analysis

We propose a Guidance Responsiveness (GR) metric to identify the most influential layers for each task:

TaskDominant Responsive LayersInterpretation
InsertionEarly layers (1–10)Spatial layout establishment
SwappingDeep layers (26–32)Semantic concept replacement
DeletionMiddle + Deep layers (15–21, 26–32)Dual semantic operation

Only the top-k (k=4) most responsive layers receive context enrichment, ensuring both precision and efficiency.


πŸ›  Installation

ContextFlow is built on top of Wan2.1. Please follow the official Wan2.1 environment setup first.

1. Clone this repository

git clone https://github.com/yychen233/ContextFlow.git
cd ContextFlow

2. Install Wan2.1 dependencies

# Ensure PyTorch >= 2.4.0
pip install -r requirements.txt

πŸ’‘ For detailed environment setup and troubleshooting, please refer to the Wan2.1 Official Repository.

πŸ“¦ Model Preparation

Base Model

Download the Wan2.1-I2V-14B-480P checkpoint, which serves as our base model:

ModelDownloadNotes
Wan2.1-I2V-14B-480PπŸ€— HuggingFace / πŸ€– ModelScopeRequired for ContextFlow
pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./Wan2.1-I2V-14B-480P

First-Frame Editing Tools

ContextFlow uses off-the-shelf image editors to prepare the edited first frame. Depending on the task, you may need:

TaskToolLink
Object InsertionAnyDoorGitHub
Object SwappingInsertAnythingGitHub
Object DeletionMagicQuillGitHub

πŸš€ Inference

ContextFlow follows a simple two-stage pipeline. Since our method is entirely training-free, no fine-tuning or optimization is required.

Step 1: Video Inversion

First, invert the source video into a noise latent using RF-Solver:

bash inversion.sh

βš™οΈ Key parameters:

  • Inversion steps: 50
  • Solver: RF-Solver (second-order)

Please modify the paths in inversion.sh to point to your source video and model checkpoint.

πŸ’‘ Pre-inverted noise available: We provide pre-computed noise latents for demo videos. Download from: πŸ“ PKU Cloud Drive

If using the pre-inverted noise, you can skip this step and proceed directly to Step 2.

Step 2: Object Editing

With the inverted noise latent ready, run the corresponding demo script for your desired task:

🟒 Object Insertion

bash demo_insert.sh

Insert a new object into the video while preserving the original background and motion.

πŸ”΅ Object Swapping

bash demo_swap.sh

Replace an existing object with a new one, maintaining spatial layout and temporal coherence.

πŸ”΄ Object Deletion

bash demo_delete.sh

Remove an object from the video and seamlessly inpaint the background.

⚠️ Important: Please update the file paths in each script (e.g., model checkpoint, source video, edited first frame, inverted noise, and target prompt) before running.

Default Hyperparameters

ParameterValueDescription
num_inference_steps50Denoising sampling steps
guidance_scale3.0Classifier-free guidance scale
Ο„ (timestep threshold)0.5ACE active for the first 50% of timesteps
k (vital layers)4Top-k layers selected by GR metric (~10% of 40 layers)
Base modelWan2.1-I2V-14B-480P40-layer Diffusion Transformer

πŸ“Š Results

Quantitative Comparison

TaskMethodCLIP-I ↑DINO-I ↑CLIP-score ↑Smoothness ↑Dynamic ↑Aesthetic ↑
InsertAnyV2V0.59430.40530.27760.98040.30770.5287
VACE0.56830.39670.25690.99210.30770.5724
I2VEdit0.67100.45950.31240.98270.30770.5846
Ours0.65040.45660.31070.99180.42310.6227
SwapAnyV2V0.60460.56410.32100.98480.07690.5739
VACE0.60800.59170.32260.99260.15380.6144
I2VEdit0.66830.60030.32820.98190.07690.5995
Ours0.66440.60040.33910.99240.07690.6176
DeleteAnyV2V––0.28910.97810.15000.5378
VACE––0.27940.98890.30000.5645
Ours––0.28540.99000.35000.5405

Evaluated on the UNIC-Benchmark. Video quality metrics from VBench.

Qualitative Results

Qualitative Comparison

For more video results, please visit our 🌐 Project Page.


πŸ“‚ Project Structure

ContextFlow/
β”œβ”€β”€ inversion.sh              # Video inversion script (RF-Solver)
β”œβ”€β”€ demo_insert.sh            # Object insertion demo
β”œβ”€β”€ demo_swap.sh              # Object swapping demo
β”œβ”€β”€ demo_delete.sh            # Object deletion demo
β”œβ”€β”€ requirements.txt          # Dependencies (Wan2.1 base)
β”œβ”€β”€ assets/                   # README images
β”œβ”€β”€ examples/                 # Example inputs (videos, edited frames)
└── ...                       # Core source code

πŸ“œ Citation

If you find ContextFlow useful in your research, please consider citing:

@inproceedings{chen2026contextflow,
  title={ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment},
  author={Chen, Yiyang and He, Xuanhua and Ma, Xiujun and Ma, Jack},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  year={2026}
}

Also consider citing the base model:

@article{wan2025,
  title={Wan: Open and Advanced Large-Scale Video Generative Models},
  author={Team Wan and Ang Wang and Baole Ai and others},
  journal={arXiv preprint arXiv:2503.20314},
  year={2025}
}

πŸ™ Acknowledgements

We gratefully acknowledge the following projects that made this work possible:

  • Wan2.1 β€” Our base I2V Diffusion Transformer model
  • RF-Solver β€” High-fidelity Rectified Flow inversion
  • AnyDoor β€” First-frame editing for object insertion
  • InsertAnything β€” First-frame editing for object swapping
  • MagicQuill β€” First-frame editing for object deletion
  • AnyV2V β€” Training-free video editing baseline
  • VBench β€” Video generation evaluation benchmark

πŸ“„ License

This project is released under the Apache 2.0 License. The base model Wan2.1 is also licensed under Apache 2.0. Please ensure your usage complies with all applicable licenses and legal regulations.


If you have any questions, feel free to open an issue or contact us at chenyy@stu.pku.edu.cn.