🎬ASTRA🎬: Let Arbitrary Subjects Transform in Video Editing

April 15, 2026 · View on GitHub

ASTRA: Let Arbitrary Subjects Transform in Video Editing [Project] [Code]

Fei Shen, Weihao Xu, Rui Yan, Dong Zhang, Xiangbo Shu, Jinhui Tang, Maocheng Zhao

📅 Release

[2025/10/13] 🎉 We release the Inference code, Evaluate metric code and MSVBench dataset.
[2025/10/01] 🎉 We launch the project page of ASTRA.

🚀 Key Features

Training-Free, Arbitrary Subjects: ASTRA (arbitrary-subjects training-free retargeting and alignment) transforms any number of designated subjects in open-domain video without finetuning or retraining, while strictly preserving the background and non-target regions.
Prompt-Guided Multimodal Alignment: Leverages large foundation models—e.g., a text-to-image prior plus a vision–language model—to produce aligned multimodal conditions (augmented text and visual instructions), mitigating insufficient prompt-side conditioning and attention dilution in dense, multi-subject layouts.
Prior-Based Mask Retargeting: Tracks per-frame mask state transitions to obtain temporally coherent mask motion that follows source dynamics, alleviating mask boundary entanglement and attribute leakage under heavy occlusion and crowded scenes.
Plug-and-Play with Mask-Driven Video Models: Drop-in compatible with diverse mask-driven video generators; on MSVBench (100 challenging sequences spanning varying subject counts and interactions), ASTRA consistently surpasses strong baselines in multi-subject editing.

Generative models have advanced video editing, yet many methods still focus on single or few subjects and degrade in complex multi-subject settings. In dense layouts with heavy occlusions, common failure modes include mask boundary entanglement, attention dilution, attribute leakage, and temporal instability—edits bleed across instances or drift away from the text prompt.

We present ASTRA (arbitrary-subjects training-free retargeting and alignment), a framework for mask-driven, text-guided editing where an arbitrary number of designated subjects are transformed while the background and non-target regions stay intact—with no model finetuning. ASTRA couples two modules with a pretrained mask-driven video generator:

Prompt-guided multimodal alignment isolates target subjects in the prompt, queries a visual prior from a text-to-image model, and uses a vision–language model to fuse prompt and prior into strong multimodal conditioning.
Prior-based mask retargeting propagates masks over time so that mask motion stays consistent with the source video, reducing entanglement-driven errors.

These conditions and mask sequences are fed into the generator to synthesize the edited video. ASTRA is a versatile plug-in for different mask-driven backbones. We also introduce MSVBench, a multi-subject benchmark of 100 challenging clips covering diverse subject counts, interactions, and scene complexity; experiments show ASTRA consistently outperforms state-of-the-art methods. Code, models, and data are available at this repository.

🔥 Examples

_{Three [People -> Super Mario] sitting in car backseat.}	_{Four [People -> Robots] standing on football court.}
_{Four [Hungry Dogs -> Robot Wolves] surrounding a bowl of food outdoors.}	_{A group of [People -> Astronauts] practicing boxing in a fitness studio.}
_{A team of [Men -> Spider-Men] rowing together on a river.}	_{Eight [Hurdlers -> Iron Men] leap mid-race over purple hurdles.}

🌈Multi-Scenario Applications

_{Automn Forest -> Winter Forest}	_{Snowy Forest -> Lunar Surface}
_{The Eiffel Tower -> The Space Needle}	_{Glasses -> Sunglasses}
_{Left -> Ultraman; Right -> Robot}	_{Left -> Gorilla; Right -> Polar Bear}
_{Left -> Lightning McQueen; Right -> Yellow Cartoon Porsche}	_{Two People (arm wrestling) -> Two Supermen}

_{Original Video}

_{Turn 1: Horse Riders -> Gokus}

_{Turn 2: The two above (Gokus -> Iron-Men)}

_{Original Video}	_{Add Glasses}	_{Change Face To "Durant"}	_{Change Face To "James"}
_{Original Video}	_{Remove Glasses}	_{Plaid Shirt -> Business Suit}	_{Plaid Shirt -> Hawaiian Shirt}

🔧 Requirements

Our method is tested using CUDA 12.2/12.4, Python 3.10.13, and PyTorch >= 2.5.1 on a single A800.

git clone https://github.com/XWH-A/ASTRA.git
cd ASTRA
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

If you want to use the Evaluate metric code, you need to additionally configure GroundingDINO.

🌐 Download Weights

The required weights can be downloaded from Hugging Face. Below is a list of the weights you need to download, along with the links.

🎉 How to Use

Important Reminder

Before executing the following command, you should first modify the path in the run.sh file to your own correct path.

bash run.sh

If you want to make your own data for testing, we recommend you use Grounded-SAM-2 to make your video mask.

🙏 Acknowledgement

We thank the contributors of WAN, VACE,SDXL,Qwen-VL,Grounded-SAM-2,Depth-Anything-V2, for their open research and inspiration.

The ASTRA code is released for academic use. Users must comply with local laws and take responsibility for their own generations. The authors disclaim liability for misuse.

📝 Citation

If you find ASTRA useful for your research, please cite:

@misc{shen2026astraletarbitrarysubjects,
      title={ASTRA: Let Arbitrary Subjects Transform in Video Editing}, 
      author={Fei Shen and Weihao Xu and Rui Yan and Dong Zhang and Xiangbo Shu and Jinhui Tang and Maocheng Zhao},
      year={2026},
      eprint={2510.01186},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.01186}, 
}

🕒 TODO List

👉 Our other projects:

ASTRA: Arbitrary-subject, training-free video editing with multimodal alignment and mask retargeting (MSVBench). [多主体视频编辑 / 免训练对齐与掩码重定向]
IMAGDressing: Controllable dressing generation. [可控穿衣生成]
IMAGGarment: Fine-grained controllable garment generation. [可控服装生成]
IMAGHarmony: Controllable image editing with consistent object layout. [可控多目标图像编辑]
IMAGPose: Pose-guided person generation with high fidelity. [可控多模式人物生成]
RCDMs: Rich-contextual conditional diffusion for story visualization. [可控故事生成]
PCDMs: Progressive conditional diffusion for pose-guided image synthesis. [可控人物生成]
V-Express: Explores strong and weak conditional relationships for portrait video generation. [可控数字人生成]
FaceShot: Talkingface plugin for any character. [可控动漫数字人生成]
CharacterShot: Controllable and consistent 4D character animation framework. [可控4D角色生成]
StyleTailor: An Agent for personalized fashion styling. [个性化时尚Agent]
SignVip: Controllable sign language video generation. [可控手语生成]

📨 Contact

If you have any questions, please feel free to contact with whxu@njust.edu.cn at or shenfei29@nus.edu.sg.