GETTING_STARTED.md

October 2, 2023 ยท View on GitHub

๐ŸŒŸ Instructions for training.

Inference with new images

You can apply InstructCV to new images by following the steps below.

Step 1. Download the pre-trained weights we provided. Or you can download it manually from Google Drive | BaiduNet Disk

bash scripts/download_pretain_weights.sh

Step 2. Run the following command:

python edit_cli.py --input <path_to_the_dictionary_you_created> --output <path_to_save> --edit <language_instructions>
# a specific example:
python edit_cli.py --input imgs/ --output outputs/ --edit "segment the cat."

Training

Training Log

Download pre-trained models

We trained our model using the checkpoint provided by Stable Diffusion V1.5

#  Stable Diffusion V1.5
bash scripts/download_checkpoints.sh
#  The checkpoint we provided (finetune with our training data for 50 epochs)
bash scripts/download_pretrained_weights.sh

Train with multi-gpus

python main.py --name <exp_name> --base configs/train.yaml --train --gpus 0,1,2,3,4,5,6,7

Train on slurm clusters

sbatch scripts/slurm_train

Baseline

Specialized model - Classification

Resnet-50 (Pretained on ImageNet)

python baselines/classification/cls.py --model supervised --dataset pets --steps 100
python baselines/classification/cls.py --model supervised --dataset caltech --steps 100

ViT-16 (Pretained on ImageNet21k)

python baselines/classification/cls.py --model ViT-16 --dataset pets --steps 300

Specialized model - Semantic Segmentation

SegFormer

download the pretrained weights (SegFormer-B5) from here.

python tools/test.py local_configs/segformer/B1/segformer.b1.512x512.ade.160k.py /path/to/checkpoint_file

Mask2Former

download the pretrained weights (Swin-L IN2k with 160k iterations) from here


Specialized model - Monocular Depth Estimation

BTS

We follow instructions here to reproduce the results.

Binsformer

We follow instructions here to reproduce the results.


Specialized model - Object Detection

Faster RCNN We run Faster R-CNN models in Detectron2

Mask RCNN We run Mask R-CNN models (Backbone: R-101-FPN, Lr schd: 2x) in mmdetection

DETR We follow instructions here to reproduce the results.


Vision generalists

Generalist models

Unified-IO we use xl_1000k.bin as the pre-trained model. It takes ~27s to inference single image. Pixel2Seq To repoduce their results using repo they provided, you need to change all the dict[str, tf.Tensor] to dict, as it will exist error like "TypeError: 'type' is not subscriptable" caused by dependencies version differences.

Change the data_root in dataset_configs.py