๐ŸŽฌ๐ŸŽจ VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement

April 7, 2026 ยท View on GitHub

Project Website arXiv

Daeun Lee, Jaehong Yoon, Jaemin Cho, Mohit Bansal

ACL 2026 Findings



โœจ VideoRepair can (1) detect misalignments by generating fine-grained evaluation questions and answering, (2) plan refinement, (3) decompose the region and finally (4) conduct localized refinement.

๐Ÿ”ง Setup

Environment Setup

You can install all packages from requirements.txt.

conda create -n videorepair python==3.10
conda activate videorepair
pip install -r requirements.txt 

Additionally, for Semantic-SAM, you should install detectron2 like below:

python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'

OpenAI API Setup

Our VideoRepair is based on GPT4 / GPT4o, so you need to setup your Azure OpenAI API config in the below files. You can set your own API infomation in config.ini.

[openai]
azure_endpoint = your endpoint   
api_key = your key 
api_version = your version 

Download Models

You can download pre-trained models here.

git lfs install
git clone https://huggingface.co/Salesforce/blip2-opt-2.7b

Next, please locate all downloaded models in the ./checkpoints directory! The code structure will like below:

./checkpoints
    โ”œโ”€โ”€ blip2-opt-2.7b
    โ”œโ”€โ”€ t2v-turbo 
    โ”‚   โ”œโ”€โ”€ unet_lora.pt
    โ”‚   โ”œโ”€โ”€ inference_t2v_512_v2.0.yaml     # downloaded from T2V-turbo official repo 
    โ”œโ”€โ”€ VideoCrafter
    โ”‚   โ”œโ”€โ”€ model.ckpt

./SemanticSAM/checkpoint
    โ”œโ”€โ”€ swinl_only_sam_many2many.pth

๐ŸŽจ Apply to your own prompt

We provide demo (run_demo.sh) for your own prompt! This demo use main_iter_demo.py.

output_root="your output root"
prompt="your own prompt"

CUDA_VISIBLE_DEVICES=1,2 python main_iter_demo.py --prompt="$prompt" \
                                                    --model="t2vturbo" \              # base t2v-model 
                                                    --output_root="$output_root" \
                                                    --seed=123 \                      # global random seed (use for initial video generation) 
                                                    --load_molmo \            
                                                    --selection_score='dsg_blip' \    # video ranking method 
                                                    --round=1 \
                                                    --seed=369                        # localized generation seeds 

๐ŸŒฟ Apply to Benchmark

VideoRepair is tested on EvalCrafter and T2V-CompBench.

We provide our dsgobjdsg^{obj} questions in ./datasets. The structure is like below:

./datasets
    โ”œโ”€โ”€ compbench
    โ”‚   โ”œโ”€โ”€ consistent_attr.json
    โ”‚   โ”œโ”€โ”€ numeracy.json
    โ”‚   โ”œโ”€โ”€ spatial_relationship.json
    โ”œโ”€โ”€ evalcrafter
    โ”‚   โ”œโ”€โ”€ dsg_action.json
    โ”‚   โ”œโ”€โ”€ dsg_color.json
    โ”‚   โ”œโ”€โ”€ dsg_count.json
    โ”‚   โ”œโ”€โ”€ dsg_none.json

Based on above question set, you can run benchmarks as follows:

output_root="your output path"                # output path 
eval_sections=("count", "action", "color")                       # eval dimension for each benchmark (e.g., count, )

for section in "${eval_sections[@]}"
do
    CUDA_VISIBLE_DEVICES=1,2,3 python main_iter.py \
                        --output_root="$output_root" \
                        --eval_section="$section" \
                        --model='t2vturbo' \              # t2v model backbone 
                        --selection_score='dsg_blip' \    # video ranking metric 
                        --seed=123 \                      # random seed 
                        --round=1 \                       # iteration round 
                        --k=10 \                          # number of video candidates 
done

๐Ÿ“ TODO List

  • Release EvalCrafter section-wise evaluation code.
  • Release the demo + question generation code.
  • Release the benchmark generation code.

๐Ÿ“š BibTeX

๐Ÿ’— If you enjoy our VideoRepair and find some beneficial things, citing our paper would be the best support for us!

@article{lee2024videorepair,
  title={VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement},
  author={Lee, Daeun and Yoon, Jaehong and Cho, Jaemin and Bansal, Mohit},
  journal={arXiv preprint arXiv:2411.15115},
  year={2024}
}