E.T. Chat

January 20, 2025 · View on GitHub

E.T. Chat is a novel time-sensitive Video-LLM that reformulates timestamp prediction as an embedding matching problem, serving as a strong baseline on E.T. Bench. E.T. Chat consists of a visual encoder $E_v$ , a frame compressor $E_c$ , and a LLM. A special token <vid> is introduced to trigger frame embedding matching for timestamp prediction.

🛠️ Installation

Please refer to the following environmental settings that we use. You may install these packages by yourself if you meet any problem during automatic installation.

CUDA 11.8
Python 3.12.2
PyTorch 2.4.0
Transformers 4.44.2
DeepSpeed 0.14.5
NNCore 0.4.5

Install from source

Clone the repository from GitHub.

git clone https://github.com/PolyU-ChenLab/ETBench.git
cd ETBench

Initialize conda environment.

conda create -n etchat python=3.12 -y
conda activate etchat

Install dependencies.

pip install -r requirements.txt

🚀 Getting Started

We apply a three-stage training receipe for E.T. Chat, where the first stage is for modality alignment, the second stage is for acquiring general chatting abilities, and the third stage is for enhancing time-sensitive chatting abilities.

Prepare model checkpoints

We compare the learnable modules in each stage, and provide their checkpoints as follows.

Note

We additionally trained a model by mixing the data of stage 2 and stage 3, yielding much better general chatting capabilities but slightly sub-optimal grounding performance. The checkpoint is listed below under the Stage-2+3 tag.

	Encoder	Q-Former	Aggregator	Projector	LLM (LoRA)
`Stage-1`	❄️	❄️	🔥	🔥	❄️
`Stage-2`	❄️	🔥	🔥	🔥	🔥
`Stage-3`	❄️	🔥 / ❄️	🔥	🔥	🔥
`Stage-2+3`	❄️	🔥	🔥	🔥	🔥

If you want to start from stage-1, the pre-trained weights from Phi3-Mini-4K-Instruct, EVA-ViT-G, and Q-Former are required for initializing the model. The downloaded checkpoints shall be saved in the model_zoo folder.

Prepare datasets

The training data used in each stage is summarized as follows. We follow the same setting as LLaMA-VID in Stage-1 and Stage-2, while an additional stage-3 is introduced together with the new E.T. Instruct 164K dataset.

	Video Data	Image Data	Annotations
`Stage-1`	WebVid	LCS-558K	llava_558k_with_webvid.json
`Stage-2`	ActivityNet / VideoChatGPT	LLaVA-1.5-Instruct	llava_v1_5_mix665k_with_video_chatgpt.json
`Stage-3`	ET-Instruct-164K	-	et_instruct_164k_vid.json

Download the required datasets and place them in the data folder. It is strongly recommended to compress the videos (to 3 FPS & 224ss) using the script provided in E.T. Bench. After processing, make sure the files are organized in the following structure.

ETBench
├─ data
│  ├─ llamavid
│  │  ├─ llava_558k_with_webvid.json
│  │  └─ llava_v1_5_mix665k_with_video_chatgpt.json
│  ├─ llava_pretrain                 ─┐
│  │  └─ images                       │ For
│  ├─ webvid                          │ Stage-1
│  │  └─ videos                      ─┘
│  ├─ llava_instruct                 ─┐
│  │  ├─ coco                         │
│  │  ├─ gqa                          │
│  │  ├─ ocr_vqa                      │ For
│  │  ├─ textvqa                      │ Stage-2
│  │  └─ vg                           │
│  ├─ video_chatgpt                   │
│  │  └─ activitynet                 ─┘
│  ├─ et_instruct_164k               ─┐
│  │  ├─ videos                       │ For
│  │  ├─ et_instruct_164k_txt.json    │ Stage-3
│  │  └─ et_instruct_164k_vid.json   ─┘
│  ├─ etbench                        ─┐
│  │  ├─ annotations                  │ For
│  │  ├─ videos                       │ Evaluation
│  │  └─ videos_compressed           ─┘
├─ model_zoo
│  ├─ Phi-3-mini-4k-instruct
│  ├─ eva_vit_g.pth
│  └─ instruct_blip_vicuna7b_trimmed.pth
├─ etchat
├─ scripts
└─ README.md

🔮 Training

Use the following commands to train E.T. Chat. The default setting is to use 8 * NVIDIA V100 (32G) GPUs. You may modify nproc_per_node, per_device_train_batch_size, and gradient_accumulation_steps to keep the same global batch size if you have different device configurations.

# Stage-1 (around 6 hours on 8*V100)
bash scripts/train_stage_1.sh

# Stage-2 (around 32 hours on 8*V100)
bash scripts/train_stage_2.sh [<path-to-stage-1-checkpoint>]

# Stage-3 (around 20 hours on 8*V100)
bash scripts/train_stage_3.sh [<path-to-stage-2-checkpoint>]

The training logs and checkpoints will be saved in the work_dirs folder.

💻 Inference

Use the following command to run inference on E.T. Bench.

bash scripts/inference.sh [<path-to-checkpoint>]

This will start 8 processes (on one GPU each) and generate 8 JSON files in the <path-to-checkpoint>/etbench folder. You may pass the path to this folder to E.T. Bench's evaluation script to compute metrics.

python compute_metrics.py <path-to-checkpoint>/etbench