🎬🎨 VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement

April 7, 2026 · View on GitHub

Daeun Lee, Jaehong Yoon, Jaemin Cho, Mohit Bansal

ACL 2026 Findings

✨ VideoRepair can (1) detect misalignments by generating fine-grained evaluation questions and answering, (2) plan refinement, (3) decompose the region and finally (4) conduct localized refinement.

🔧 Setup

Environment Setup

You can install all packages from requirements.txt.

conda create -n videorepair python==3.10
conda activate videorepair
pip install -r requirements.txt

Additionally, for Semantic-SAM, you should install detectron2 like below:

python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'

OpenAI API Setup

Our VideoRepair is based on GPT4 / GPT4o, so you need to setup your Azure OpenAI API config in the below files. You can set your own API infomation in config.ini.

[openai]
azure_endpoint = your endpoint   
api_key = your key 
api_version = your version

Download Models

You can download pre-trained models here.

git lfs install
git clone https://huggingface.co/Salesforce/blip2-opt-2.7b

Next, please locate all downloaded models in the ./checkpoints directory! The code structure will like below:

./checkpoints
    ├── blip2-opt-2.7b
    ├── t2v-turbo 
    │   ├── unet_lora.pt
    │   ├── inference_t2v_512_v2.0.yaml     # downloaded from T2V-turbo official repo 
    ├── VideoCrafter
    │   ├── model.ckpt

./SemanticSAM/checkpoint
    ├── swinl_only_sam_many2many.pth

🎨 Apply to your own prompt

We provide demo (run_demo.sh) for your own prompt! This demo use main_iter_demo.py.

output_root="your output root"
prompt="your own prompt"

CUDA_VISIBLE_DEVICES=1,2 python main_iter_demo.py --prompt="$prompt" \
                                                    --model="t2vturbo" \              # base t2v-model 
                                                    --output_root="$output_root" \
                                                    --seed=123 \                      # global random seed (use for initial video generation) 
                                                    --load_molmo \            
                                                    --selection_score='dsg_blip' \    # video ranking method 
                                                    --round=1 \
                                                    --seed=369                        # localized generation seeds

🌿 Apply to Benchmark

VideoRepair is tested on EvalCrafter and T2V-CompBench.

We provide our $dsg^{obj}$ questions in ./datasets. The structure is like below:

./datasets
    ├── compbench
    │   ├── consistent_attr.json
    │   ├── numeracy.json
    │   ├── spatial_relationship.json
    ├── evalcrafter
    │   ├── dsg_action.json
    │   ├── dsg_color.json
    │   ├── dsg_count.json
    │   ├── dsg_none.json

Based on above question set, you can run benchmarks as follows:

output_root="your output path"                # output path 
eval_sections=("count", "action", "color")                       # eval dimension for each benchmark (e.g., count, )

for section in "${eval_sections[@]}"
do
    CUDA_VISIBLE_DEVICES=1,2,3 python main_iter.py \
                        --output_root="$output_root" \
                        --eval_section="$section" \
                        --model='t2vturbo' \              # t2v model backbone 
                        --selection_score='dsg_blip' \    # video ranking metric 
                        --seed=123 \                      # random seed 
                        --round=1 \                       # iteration round 
                        --k=10 \                          # number of video candidates 
done

📝 TODO List

Release EvalCrafter section-wise evaluation code.
Release the demo + question generation code.
Release the benchmark generation code.

📚 BibTeX

💗 If you enjoy our VideoRepair and find some beneficial things, citing our paper would be the best support for us!

@article{lee2024videorepair,
  title={VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement},
  author={Lee, Daeun and Yoon, Jaehong and Cho, Jaemin and Bansal, Mohit},
  journal={arXiv preprint arXiv:2411.15115},
  year={2024}
}