๐ฌ๐จ VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement
April 7, 2026 ยท View on GitHub
Daeun Lee, Jaehong Yoon, Jaemin Cho, Mohit Bansal
ACL 2026 Findings
โจ VideoRepair can (1) detect misalignments by generating fine-grained evaluation questions and answering, (2) plan refinement, (3) decompose the region and finally (4) conduct localized refinement.
๐ง Setup
Environment Setup
You can install all packages from requirements.txt.
conda create -n videorepair python==3.10
conda activate videorepair
pip install -r requirements.txt
Additionally, for Semantic-SAM, you should install detectron2 like below:
python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'
OpenAI API Setup
Our VideoRepair is based on GPT4 / GPT4o, so you need to setup your Azure OpenAI API config in the below files.
You can set your own API infomation in config.ini.
[openai]
azure_endpoint = your endpoint
api_key = your key
api_version = your version
Download Models
You can download pre-trained models here.
- T2V-turbo
- VideoCrafter2
- MolmoE-1B-0924
- Semantic-SAM (L)
- BLIP-BLUE
git lfs install
git clone https://huggingface.co/Salesforce/blip2-opt-2.7b
Next, please locate all downloaded models in the ./checkpoints directory! The code structure will like below:
./checkpoints
โโโ blip2-opt-2.7b
โโโ t2v-turbo
โ โโโ unet_lora.pt
โ โโโ inference_t2v_512_v2.0.yaml # downloaded from T2V-turbo official repo
โโโ VideoCrafter
โ โโโ model.ckpt
./SemanticSAM/checkpoint
โโโ swinl_only_sam_many2many.pth
๐จ Apply to your own prompt
We provide demo (run_demo.sh) for your own prompt! This demo use main_iter_demo.py.
output_root="your output root"
prompt="your own prompt"
CUDA_VISIBLE_DEVICES=1,2 python main_iter_demo.py --prompt="$prompt" \
--model="t2vturbo" \ # base t2v-model
--output_root="$output_root" \
--seed=123 \ # global random seed (use for initial video generation)
--load_molmo \
--selection_score='dsg_blip' \ # video ranking method
--round=1 \
--seed=369 # localized generation seeds
๐ฟ Apply to Benchmark
VideoRepair is tested on EvalCrafter and T2V-CompBench.
We provide our questions in ./datasets. The structure is like below:
./datasets
โโโ compbench
โ โโโ consistent_attr.json
โ โโโ numeracy.json
โ โโโ spatial_relationship.json
โโโ evalcrafter
โ โโโ dsg_action.json
โ โโโ dsg_color.json
โ โโโ dsg_count.json
โ โโโ dsg_none.json
Based on above question set, you can run benchmarks as follows:
output_root="your output path" # output path
eval_sections=("count", "action", "color") # eval dimension for each benchmark (e.g., count, )
for section in "${eval_sections[@]}"
do
CUDA_VISIBLE_DEVICES=1,2,3 python main_iter.py \
--output_root="$output_root" \
--eval_section="$section" \
--model='t2vturbo' \ # t2v model backbone
--selection_score='dsg_blip' \ # video ranking metric
--seed=123 \ # random seed
--round=1 \ # iteration round
--k=10 \ # number of video candidates
done
๐ TODO List
- Release EvalCrafter section-wise evaluation code.
- Release the demo + question generation code.
- Release the benchmark generation code.
๐ BibTeX
๐ If you enjoy our VideoRepair and find some beneficial things, citing our paper would be the best support for us!
@article{lee2024videorepair,
title={VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement},
author={Lee, Daeun and Yoon, Jaehong and Cho, Jaemin and Bansal, Mohit},
journal={arXiv preprint arXiv:2411.15115},
year={2024}
}