TRAIN_AND_VALIDATE.md

December 15, 2023 · View on GitHub

We provide the off-the-shelf scripts in the scripts folder.

Training LanguageBind

Cache of pretrained weight	Baidu Yun	Google Cloud	Peking University Yun
Large	Link	Link	Link
Huge	Link	-	Link

For example, to train LanguageBind on Depth-Language with 8 GPUs (1 nodes x 8 GPUs).

First download the cache of pretrained weight above. and specify CACHE_DIR=path/to/LanguageBind.
The second step is to develop a path to ANNOTATION and DATA here according to the dataset preparation.
Then you can run

CACHE_DIR="/path/to/LanguageBind"
ANNOTATION="path/to/data"
cd /path/to/LanguageBind
TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nnodes=1 --nproc_per_node 8 \
    -m main  \
    --train-data ${ANNOTATION} \
    --train-num-samples 3020000 \
    --clip-type "dl" --max-depth 10 \
    --do_train \
    --lock-text --lock-image --text-type "polish_mplug" \
    --init-temp 0.07 --learn-temp \
    --model "ViT-L-14" --cache-dir ${CACHE_DIR} \
    --convert_to_lora --lora_r 2 \
    --lr 5e-4 --coef-lr 1e-3 \
    --beta1 0.9 --beta2 0.98 --wd 0.2 --eps 1e-6 \
    --num-frames 1 --force-patch-dropout 0.5 \
    --epochs 1 --batch-size 128 --accum-freq 1 --warmup 200 \
    --precision "amp" --workers 10 --video-decode-backend "imgs" \
    --save-frequency 1 --log-every-n-steps 20 --report-to "tensorboard" --resume "latest" \
    --do_eval \
    --val_d_cls_data "NYUV2"

Validating LanguageBind

For example, to validate LanguageBind on Depth-Language with 1 GPUs.

First specify RESUME.
The second step is to prepare the downstream dataset.
Then you can run

CACHE_DIR="/path/to/LanguageBind"
RESUME="thermal_language.pt"
ANNOTATION="path/to/data"
cd /path/to/LanguageBind
TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nproc_per_node 1 \
    -m main  \
    --train-data ${ANNOTATION} \
    --train-num-samples 3020000 \
    --clip-type "dl" --max-depth 10 \
    --lock-text --lock-image --text-type "polish_mplug" \
    --init-temp 0.07 --learn-temp \
    --model "ViT-L-14" --cache-dir ${CACHE_DIR} \
    --convert_to_lora --lora_r 2 \
    --lr 5e-4 --coef-lr 1e-3 \
    --beta1 0.9 --beta2 0.98 --wd 0.2 --eps 1e-6 \
    --num-frames 1 --force-patch-dropout 0.5 \
    --epochs 1 --batch-size 128 --accum-freq 1 --warmup 200 \
    --precision "amp" --workers 10 --video-decode-backend "imgs" \
    --save-frequency 1 --log-every-n-steps 20 --report-to "tensorboard" --resume ${RESUME} \
    --do_eval \
    --val_d_cls_data "NYUV2"

Downstream datasets

Depth

NYU V2 dataset is downloaded from this repo and we reformat them to conform to the standard ImageNet format. We also provide data as follows. Change the data_root here.

Datasets	Baidu Yun	Google Cloud	Peking University Yun
NYU	Link	Link	Link

Video

Video datasets are downloaded from this repo and we show the folder structure. Change the data_root here.

Audio

Audio datasets are downloaded from this repo and Audioset from here.We reformat them to conform to the standard ImageNet format. Change the data_root here1 and here2.

Infrared (Thermal)

We download LLVIP from official website, and FLIR from here. We reformat them to conform to the standard ImageNet format. Change the data_root here. We also provide the processed data as follows.

Datasets	Baidu Yun	Google Cloud	Peking University Yun
LLVIP	Link	Link	Link
FLIR V1	Link	Link	Link
FLIR V2	Link	Link	Link

Folder structure

downstream_datasets
├── Audio
│   ├── audiocaps
│   │   └── audio
│   │       ├── test
│   │       ├── train
│   │       └── val
│   ├── audioset
│   │   ├── balanced_train_segments
│   │   ├── eval_segments
│   │   └── unbalanced_train_segments
│   │       ├── unbalanced_train_segments_part00
│   │       ├── unbalanced_train_segments_part01
│   │       ├── ...
│   │       └── unbalanced_train_segments_part40
│   ├── clotho
│   │   ├── CLOTHO_retrieval_dataset
│   │   └── evaluation
│   ├── esc50
│   │   └── test
│   │       ├── airplane
│   │       ├── breathing
│   │       ├── ...
│   │       └── wind
├── laionaudio
│   │   ├── audios
│   │   ├── freesound_no_overlap
│   │   └── jsons
├── vggsound
│       └── test
│           ├── air\ conditioning\ noise
│           ├── air\ horn
│           ├── ...
│           └── zebra\ braying
├── Depth
│   ├── nyuv2
│   │   ├── data
│   │   │   └── val
│   │   │       ├── bathroom
│   │   │       ├── bedroom
│   │   │       ├── bookstore
│   │   │       ├── classroom
│   │   │       ├── dining_room
│   │   │       ├── home_office
│   │   │       ├── kitchen
│   │   │       ├── living_room
│   │   │       ├── office
│   │   │       └── others
├── Thermal
│   ├── flirv1
│   │   └── val
│   │       ├── bicycle
│   │       ├── car
│   │       ├── dog
│   │       └── person
│   ├── flirv2
│   │   └── val
│   │       ├── bike
│   │       ├── bus
│   │       ├── car
│   │       ├── hydrant
│   │       ├── light
│   │       ├── motor
│   │       ├── other\ vehicle
│   │       ├── person
│   │       ├── sign
│   │       ├── skateboard
│   │       ├── stroller
│   │       └── truck
│   ├── llvip
│   │   ├── train
│   │   │   ├── background
│   │   │   └── person
│   │   └── val
│   │       ├── background
│   │       └── person
└── VideoTextRetrieval
    ├── vtRetdata
    │   ├── ActivityNet
    │   │   └── Videos
    │   │       └── Activity_Videos
    │   ├── Didemo
    │   │   └── videos
    │   ├── MSRVTT
    │   │   └── MSRVTT_Videos
    │   └── MSVD
    │       └── MSVD_Videos