Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion
July 28, 2025 · View on GitHub
Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion
Quick Start
Download requirements
pip install -r requirment.txt
Installation
git clone https://github.com/oooolga/Ctrl-V
cd Ctrl-V
python setup.py develop
Data Directory
In the training and evaluation script, set the DATASET_PATH to the root of the dataset folder. Within this folder, you will find the extracted dataset subfolders. The dataset root folder should be organized in the following format:
Datasets/
├── bdd100k
├── kitti
├── vkitti_2.0.3
└── nuscenes
Preprocessing
We have provided a script to render bounding-box frames and save them to your data directory, which can save time during training. However, this step is optional. You can also choose to render the bounding boxes on the fly by setting the use_preplotted_bbox parameter to False in the get_dataloader call.
To render the bounding-box frames before training, run the following commands
python tools/preprocessing/preprocess_dataset.py $DATASET_PATH
Train a Bounding-box Predictor
A demo script is available at demo_train_bbox_predict.sh.
To run demo_train_bbox_predict.sh, set the $DATASET_PATH and $OUT_DIR to your desired path, and then execute
bash ./scripts/train_scripts/demo_train_bbox_predict.sh
To resume training, set the $NAME variable to the name of the stopped experiment (e.g., bdd100k_bbox_predict_240616_000000). Ensure that you include --resume_from_checkpoint latest and that all the hyperparameter settings match those of the stopped experiment. After this setup, you can resume training by re-executing the training command.
To train on different sets, simply modify DATASET variable's value to kitti, vkitti or bdd100k. You can adjust the number of input frame conditions for your bounding-box predictor by changing the value of num_cond_bbox_frames. To change the last condition bounding-box frame to its trajectory frame, enable if_last_frame_trajectory.
Train a Box2Video Model
A demo script is available at demo_train_video_box2video.sh.
Prior to training the Box2Video model, results may be improved by finetuning the SVD model to the current dataset. To do this run demo_train_video_diffusion.sh, set the $DATASET_PATH and $OUT_DIR to your desired path, and then execute
bash ./scripts/train_scripts/demo_train_video_diffusion.sh
To run demo_train_video_box2video.sh, set the $DATASET_PATH and $OUT_DIR to your desired path, and set OUT_DIR from the previous finetuning step and then execute
bash ./scripts/train_scripts/demo_train_video_box2video.sh
(Note: if you do not wish to start from a finetuned model, simply remove the --finetuned_svd_path argument in demo_train_video_box2video.sh and this will load the (non-finetuned) model from --pretrained_model_name_or_path.
To resume training, set the $NAME variable to the name of the stopped experiment (e.g., bdd100k_ctrlv_240616_000000). Ensure that you include --resume_from_checkpoint latest and that all the hyperparameter settings match those of the stopped experiment. After this setup, you can resume training by re-executing the training command.
To train on different sets, simply modify DATASET variable's value to kitti, vkitti or bdd100k.
Generate and Evaluate Videos
Generate Videos
Running the whole generation pipeline (bounding-box predictor+box2video)
Demo scripts are available at eval_scripts.
To generate videos using the entire generation pipeline (predict bounding boxes and generate videos based on the predicted bounding box sequences), set the following variables in the demo_eval_overall_{}.sh scripts: $DATASET_PATH, $OUT_DIR, $BOX2VIDEO_DIR, and $BBOX_MODEL_DIR, and then execute
bash ./scripts/eval_exripts/demo_eval_overall_{}.sh
For each input sample, the pipeline will predict five bounding-box sequences and select the one with the highest mask-IoU score to generate the final video. We evaluate bounding-box prediction metrics during the generation process, and the results are uploaded to the W&B dashboard.
The generated videos are also uploaded to the W&B dashboard. You can find a local copy of the generated videos in your W&B folder at $OUT_DIR/wandb/run-{run_id}/files/media.
Running the teacher-forced Box2Video generation pipeline
A demo script is available at demo_eval_box2video_tf.sh.
To generate videos using the ground-truth bounding boxes, set the $DATASET_PATH and $OUT_DIR variables in the script ($OUT_DIR should be the same location you used when training the Box2Video model), and then execute the following command:
bash ./scripts/eval_scripts/demo_eval_box2video_tf.sh
The generated videos are also uploaded to the W&B dashboard. You can find a local copy of the generated videos in your W&B folder at $OUT_DIR/wandb/run-{run_id}/files/media.
Evaluations
FVD, LPIPS, SSIM and PSNR
(See src/ctrlv/metrics/fvd.py)
TODO
YOLOv8 Detector and mAP Scores
To compute the mAP and AP scores, run the following command
DATASET_NAME="..." #kitti/vkitti/bdd100k/nuscenes
ABSOLUTE_PATH_TO_WANDB_DIR="/..."
RUN_ID="..."
python tools/run_tracking_metrics.py $ABSOLUTE_PATH_TO_WANDB_DIR/wandb/$RUN_ID/files/media/videos $DATASET_NAME
This code would automatic save the YOLOv8 detection results to $ABSOLUTE_PATH_TO_WANDB_DIR.
Credits
Our library is built on the work of many brilliant researchers and developers. We're grateful for their contributions, which have helped us with this project. Special thanks to the following repositories for providing valuable tools that have enhanced our project:
- @huggingface's diffusion model library.
- @ultralytics's yolov8 library.
Citation
@article{
luo2025ctrlv,
title={Ctrl-V: Higher Fidelity Autonomous Vehicle Video Generation with Bounding-Box Controlled Object Motion},
author={Ge Ya Luo and ZhiHao Luo and Anthony Gosselin and Alexia Jolicoeur-Martineau and Christopher Pal},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2025},
url={https://openreview.net/forum?id=BMGikHBjlx},
}