TRAIN_AND_VALIDATE.md
December 15, 2023 · View on GitHub
We provide the off-the-shelf scripts in the scripts folder.
Training LanguageBind
| Cache of pretrained weight | Baidu Yun | Google Cloud | Peking University Yun |
|---|---|---|---|
| Large | Link | Link | Link |
| Huge | Link | - | Link |
For example, to train LanguageBind on Depth-Language with 8 GPUs (1 nodes x 8 GPUs).
- First download the cache of pretrained weight above. and specify
CACHE_DIR=path/to/LanguageBind. - The second step is to develop a path to
ANNOTATIONandDATAhere according to the dataset preparation. - Then you can run
CACHE_DIR="/path/to/LanguageBind"
ANNOTATION="path/to/data"
cd /path/to/LanguageBind
TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nnodes=1 --nproc_per_node 8 \
-m main \
--train-data ${ANNOTATION} \
--train-num-samples 3020000 \
--clip-type "dl" --max-depth 10 \
--do_train \
--lock-text --lock-image --text-type "polish_mplug" \
--init-temp 0.07 --learn-temp \
--model "ViT-L-14" --cache-dir ${CACHE_DIR} \
--convert_to_lora --lora_r 2 \
--lr 5e-4 --coef-lr 1e-3 \
--beta1 0.9 --beta2 0.98 --wd 0.2 --eps 1e-6 \
--num-frames 1 --force-patch-dropout 0.5 \
--epochs 1 --batch-size 128 --accum-freq 1 --warmup 200 \
--precision "amp" --workers 10 --video-decode-backend "imgs" \
--save-frequency 1 --log-every-n-steps 20 --report-to "tensorboard" --resume "latest" \
--do_eval \
--val_d_cls_data "NYUV2"
Validating LanguageBind
For example, to validate LanguageBind on Depth-Language with 1 GPUs.
- First specify
RESUME. - The second step is to prepare the downstream dataset.
- Then you can run
CACHE_DIR="/path/to/LanguageBind"
RESUME="thermal_language.pt"
ANNOTATION="path/to/data"
cd /path/to/LanguageBind
TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nproc_per_node 1 \
-m main \
--train-data ${ANNOTATION} \
--train-num-samples 3020000 \
--clip-type "dl" --max-depth 10 \
--lock-text --lock-image --text-type "polish_mplug" \
--init-temp 0.07 --learn-temp \
--model "ViT-L-14" --cache-dir ${CACHE_DIR} \
--convert_to_lora --lora_r 2 \
--lr 5e-4 --coef-lr 1e-3 \
--beta1 0.9 --beta2 0.98 --wd 0.2 --eps 1e-6 \
--num-frames 1 --force-patch-dropout 0.5 \
--epochs 1 --batch-size 128 --accum-freq 1 --warmup 200 \
--precision "amp" --workers 10 --video-decode-backend "imgs" \
--save-frequency 1 --log-every-n-steps 20 --report-to "tensorboard" --resume ${RESUME} \
--do_eval \
--val_d_cls_data "NYUV2"
Downstream datasets
Depth
NYU V2 dataset is downloaded from this repo and we reformat them to conform to the standard ImageNet format. We also provide data as follows. Change the data_root here.
Video
Video datasets are downloaded from this repo and we show the folder structure. Change the data_root here.
Audio
Audio datasets are downloaded from this repo and Audioset from here.We reformat them to conform to the standard ImageNet format. Change the data_root here1 and here2.
Infrared (Thermal)
We download LLVIP from official website, and FLIR from here. We reformat them to conform to the standard ImageNet format. Change the data_root here. We also provide the processed data as follows.
| Datasets | Baidu Yun | Google Cloud | Peking University Yun |
|---|---|---|---|
| LLVIP | Link | Link | Link |
| FLIR V1 | Link | Link | Link |
| FLIR V2 | Link | Link | Link |
Folder structure
downstream_datasets
├── Audio
│ ├── audiocaps
│ │ └── audio
│ │ ├── test
│ │ ├── train
│ │ └── val
│ ├── audioset
│ │ ├── balanced_train_segments
│ │ ├── eval_segments
│ │ └── unbalanced_train_segments
│ │ ├── unbalanced_train_segments_part00
│ │ ├── unbalanced_train_segments_part01
│ │ ├── ...
│ │ └── unbalanced_train_segments_part40
│ ├── clotho
│ │ ├── CLOTHO_retrieval_dataset
│ │ └── evaluation
│ ├── esc50
│ │ └── test
│ │ ├── airplane
│ │ ├── breathing
│ │ ├── ...
│ │ └── wind
├── laionaudio
│ │ ├── audios
│ │ ├── freesound_no_overlap
│ │ └── jsons
├── vggsound
│ └── test
│ ├── air\ conditioning\ noise
│ ├── air\ horn
│ ├── ...
│ └── zebra\ braying
├── Depth
│ ├── nyuv2
│ │ ├── data
│ │ │ └── val
│ │ │ ├── bathroom
│ │ │ ├── bedroom
│ │ │ ├── bookstore
│ │ │ ├── classroom
│ │ │ ├── dining_room
│ │ │ ├── home_office
│ │ │ ├── kitchen
│ │ │ ├── living_room
│ │ │ ├── office
│ │ │ └── others
├── Thermal
│ ├── flirv1
│ │ └── val
│ │ ├── bicycle
│ │ ├── car
│ │ ├── dog
│ │ └── person
│ ├── flirv2
│ │ └── val
│ │ ├── bike
│ │ ├── bus
│ │ ├── car
│ │ ├── hydrant
│ │ ├── light
│ │ ├── motor
│ │ ├── other\ vehicle
│ │ ├── person
│ │ ├── sign
│ │ ├── skateboard
│ │ ├── stroller
│ │ └── truck
│ ├── llvip
│ │ ├── train
│ │ │ ├── background
│ │ │ └── person
│ │ └── val
│ │ ├── background
│ │ └── person
└── VideoTextRetrieval
├── vtRetdata
│ ├── ActivityNet
│ │ └── Videos
│ │ └── Activity_Videos
│ ├── Didemo
│ │ └── videos
│ ├── MSRVTT
│ │ └── MSRVTT_Videos
│ └── MSVD
│ └── MSVD_Videos