GETTING_STARTED.md

October 2, 2023 · View on GitHub

🌟 Instructions for training.

Inference with new images

You can apply InstructCV to new images by following the steps below.

Step 1. Download the pre-trained weights we provided. Or you can download it manually from Google Drive | BaiduNet Disk

bash scripts/download_pretain_weights.sh

Step 2. Run the following command:

python edit_cli.py --input <path_to_the_dictionary_you_created> --output <path_to_save> --edit <language_instructions>
# a specific example:
python edit_cli.py --input imgs/ --output outputs/ --edit "segment the cat."

Training

Training Log

Download pre-trained models

We trained our model using the checkpoint provided by Stable Diffusion V1.5

#  Stable Diffusion V1.5
bash scripts/download_checkpoints.sh
#  The checkpoint we provided (finetune with our training data for 50 epochs)
bash scripts/download_pretrained_weights.sh

Train with multi-gpus

python main.py --name <exp_name> --base configs/train.yaml --train --gpus 0,1,2,3,4,5,6,7

Train on slurm clusters

sbatch scripts/slurm_train

Baseline

Specialized model - Classification

Resnet-50 (Pretained on ImageNet)

python baselines/classification/cls.py --model supervised --dataset pets --steps 100
python baselines/classification/cls.py --model supervised --dataset caltech --steps 100

ViT-16 (Pretained on ImageNet21k)

python baselines/classification/cls.py --model ViT-16 --dataset pets --steps 300

Specialized model - Semantic Segmentation

SegFormer

download the pretrained weights (SegFormer-B5) from here.

python tools/test.py local_configs/segformer/B1/segformer.b1.512x512.ade.160k.py /path/to/checkpoint_file

Mask2Former

download the pretrained weights (Swin-L IN2k with 160k iterations) from here

Specialized model - Monocular Depth Estimation

BTS

We follow instructions here to reproduce the results.

Binsformer

We follow instructions here to reproduce the results.

Specialized model - Object Detection

Faster RCNN We run Faster R-CNN models in Detectron2

Mask RCNN We run Mask R-CNN models (Backbone: R-101-FPN, Lr schd: 2x) in mmdetection

DETR We follow instructions here to reproduce the results.

Vision generalists

Generalist models

Unified-IO we use xl_1000k.bin as the pre-trained model. It takes ~27s to inference single image. Pixel2Seq To repoduce their results using repo they provided, you need to change all the dict[str, tf.Tensor] to dict, as it will exist error like "TypeError: 'type' is not subscriptable" caused by dependencies version differences.

Change the data_root in dataset_configs.py