E.T. Chat
January 20, 2025 ยท View on GitHub
E.T. Chat is a novel time-sensitive Video-LLM that reformulates timestamp prediction as an embedding matching problem, serving as a strong baseline on E.T. Bench. E.T. Chat consists of a visual encoder , a frame compressor , and a LLM. A special token <vid> is introduced to trigger frame embedding matching for timestamp prediction.
๐ ๏ธ Installation
Please refer to the following environmental settings that we use. You may install these packages by yourself if you meet any problem during automatic installation.
- CUDA 11.8
- Python 3.12.2
- PyTorch 2.4.0
- Transformers 4.44.2
- DeepSpeed 0.14.5
- NNCore 0.4.5
Install from source
- Clone the repository from GitHub.
git clone https://github.com/PolyU-ChenLab/ETBench.git
cd ETBench
- Initialize conda environment.
conda create -n etchat python=3.12 -y
conda activate etchat
- Install dependencies.
pip install -r requirements.txt
๐ Getting Started
We apply a three-stage training receipe for E.T. Chat, where the first stage is for modality alignment, the second stage is for acquiring general chatting abilities, and the third stage is for enhancing time-sensitive chatting abilities.
Prepare model checkpoints
We compare the learnable modules in each stage, and provide their checkpoints as follows.
Note
We additionally trained a model by mixing the data of stage 2 and stage 3, yielding much better general chatting capabilities but slightly sub-optimal grounding performance. The checkpoint is listed below under the Stage-2+3 tag.
| Encoder | Q-Former | Aggregator | Projector | LLM (LoRA) | Checkpoint | |
|---|---|---|---|---|---|---|
Stage-1 | โ๏ธ | โ๏ธ | ๐ฅ | ๐ฅ | โ๏ธ | |
Stage-2 | โ๏ธ | ๐ฅ | ๐ฅ | ๐ฅ | ๐ฅ | |
Stage-3 | โ๏ธ | ๐ฅ / โ๏ธ | ๐ฅ | ๐ฅ | ๐ฅ | |
Stage-2+3 | โ๏ธ | ๐ฅ | ๐ฅ | ๐ฅ | ๐ฅ |
If you want to start from stage-1, the pre-trained weights from Phi3-Mini-4K-Instruct, EVA-ViT-G, and Q-Former are required for initializing the model. The downloaded checkpoints shall be saved in the model_zoo folder.
Prepare datasets
The training data used in each stage is summarized as follows. We follow the same setting as LLaMA-VID in Stage-1 and Stage-2, while an additional stage-3 is introduced together with the new E.T. Instruct 164K dataset.
| Video Data | Image Data | Annotations | |
|---|---|---|---|
Stage-1 | WebVid | LCS-558K | llava_558k_with_webvid.json |
Stage-2 | ActivityNet / VideoChatGPT | LLaVA-1.5-Instruct | llava_v1_5_mix665k_with_video_chatgpt.json |
Stage-3 | ET-Instruct-164K | - | et_instruct_164k_vid.json |
Download the required datasets and place them in the data folder. It is strongly recommended to compress the videos (to 3 FPS & 224ss) using the script provided in E.T. Bench. After processing, make sure the files are organized in the following structure.
ETBench
โโ data
โ โโ llamavid
โ โ โโ llava_558k_with_webvid.json
โ โ โโ llava_v1_5_mix665k_with_video_chatgpt.json
โ โโ llava_pretrain โโ
โ โ โโ images โ For
โ โโ webvid โ Stage-1
โ โ โโ videos โโ
โ โโ llava_instruct โโ
โ โ โโ coco โ
โ โ โโ gqa โ
โ โ โโ ocr_vqa โ For
โ โ โโ textvqa โ Stage-2
โ โ โโ vg โ
โ โโ video_chatgpt โ
โ โ โโ activitynet โโ
โ โโ et_instruct_164k โโ
โ โ โโ videos โ For
โ โ โโ et_instruct_164k_txt.json โ Stage-3
โ โ โโ et_instruct_164k_vid.json โโ
โ โโ etbench โโ
โ โ โโ annotations โ For
โ โ โโ videos โ Evaluation
โ โ โโ videos_compressed โโ
โโ model_zoo
โ โโ Phi-3-mini-4k-instruct
โ โโ eva_vit_g.pth
โ โโ instruct_blip_vicuna7b_trimmed.pth
โโ etchat
โโ scripts
โโ README.md
๐ฎ Training
Use the following commands to train E.T. Chat. The default setting is to use 8 * NVIDIA V100 (32G) GPUs. You may modify nproc_per_node, per_device_train_batch_size, and gradient_accumulation_steps to keep the same global batch size if you have different device configurations.
# Stage-1 (around 6 hours on 8*V100)
bash scripts/train_stage_1.sh
# Stage-2 (around 32 hours on 8*V100)
bash scripts/train_stage_2.sh [<path-to-stage-1-checkpoint>]
# Stage-3 (around 20 hours on 8*V100)
bash scripts/train_stage_3.sh [<path-to-stage-2-checkpoint>]
The training logs and checkpoints will be saved in the work_dirs folder.
๐ป Inference
Use the following command to run inference on E.T. Bench.
bash scripts/inference.sh [<path-to-checkpoint>]
This will start 8 processes (on one GPU each) and generate 8 JSON files in the <path-to-checkpoint>/etbench folder. You may pass the path to this folder to E.T. Bench's evaluation script to compute metrics.
python compute_metrics.py <path-to-checkpoint>/etbench