E.T. Chat

January 20, 2025 ยท View on GitHub

E.T. Chat is a novel time-sensitive Video-LLM that reformulates timestamp prediction as an embedding matching problem, serving as a strong baseline on E.T. Bench. E.T. Chat consists of a visual encoder EvE_v, a frame compressor EcE_c, and a LLM. A special token <vid> is introduced to trigger frame embedding matching for timestamp prediction.

๐Ÿ› ๏ธ Installation

Please refer to the following environmental settings that we use. You may install these packages by yourself if you meet any problem during automatic installation.

Install from source

  1. Clone the repository from GitHub.
git clone https://github.com/PolyU-ChenLab/ETBench.git
cd ETBench
  1. Initialize conda environment.
conda create -n etchat python=3.12 -y
conda activate etchat
  1. Install dependencies.
pip install -r requirements.txt

๐Ÿš€ Getting Started

We apply a three-stage training receipe for E.T. Chat, where the first stage is for modality alignment, the second stage is for acquiring general chatting abilities, and the third stage is for enhancing time-sensitive chatting abilities.

Prepare model checkpoints

We compare the learnable modules in each stage, and provide their checkpoints as follows.

Note

We additionally trained a model by mixing the data of stage 2 and stage 3, yielding much better general chatting capabilities but slightly sub-optimal grounding performance. The checkpoint is listed below under the Stage-2+3 tag.

EncoderQ-FormerAggregatorProjectorLLM (LoRA)Checkpoint
Stage-1โ„๏ธโ„๏ธ๐Ÿ”ฅ๐Ÿ”ฅโ„๏ธHugging Face
Stage-2โ„๏ธ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅHugging Face
Stage-3โ„๏ธ๐Ÿ”ฅ / โ„๏ธ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅHugging Face
Stage-2+3โ„๏ธ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅHugging Face

If you want to start from stage-1, the pre-trained weights from Phi3-Mini-4K-Instruct, EVA-ViT-G, and Q-Former are required for initializing the model. The downloaded checkpoints shall be saved in the model_zoo folder.

Prepare datasets

The training data used in each stage is summarized as follows. We follow the same setting as LLaMA-VID in Stage-1 and Stage-2, while an additional stage-3 is introduced together with the new E.T. Instruct 164K dataset.

Video DataImage DataAnnotations
Stage-1WebVidLCS-558Kllava_558k_with_webvid.json
Stage-2ActivityNet / VideoChatGPTLLaVA-1.5-Instructllava_v1_5_mix665k_with_video_chatgpt.json
Stage-3ET-Instruct-164K-et_instruct_164k_vid.json

Download the required datasets and place them in the data folder. It is strongly recommended to compress the videos (to 3 FPS & 224ss) using the script provided in E.T. Bench. After processing, make sure the files are organized in the following structure.

ETBench
โ”œโ”€ data
โ”‚  โ”œโ”€ llamavid
โ”‚  โ”‚  โ”œโ”€ llava_558k_with_webvid.json
โ”‚  โ”‚  โ””โ”€ llava_v1_5_mix665k_with_video_chatgpt.json
โ”‚  โ”œโ”€ llava_pretrain                 โ”€โ”
โ”‚  โ”‚  โ””โ”€ images                       โ”‚ For
โ”‚  โ”œโ”€ webvid                          โ”‚ Stage-1
โ”‚  โ”‚  โ””โ”€ videos                      โ”€โ”˜
โ”‚  โ”œโ”€ llava_instruct                 โ”€โ”
โ”‚  โ”‚  โ”œโ”€ coco                         โ”‚
โ”‚  โ”‚  โ”œโ”€ gqa                          โ”‚
โ”‚  โ”‚  โ”œโ”€ ocr_vqa                      โ”‚ For
โ”‚  โ”‚  โ”œโ”€ textvqa                      โ”‚ Stage-2
โ”‚  โ”‚  โ””โ”€ vg                           โ”‚
โ”‚  โ”œโ”€ video_chatgpt                   โ”‚
โ”‚  โ”‚  โ””โ”€ activitynet                 โ”€โ”˜
โ”‚  โ”œโ”€ et_instruct_164k               โ”€โ”
โ”‚  โ”‚  โ”œโ”€ videos                       โ”‚ For
โ”‚  โ”‚  โ”œโ”€ et_instruct_164k_txt.json    โ”‚ Stage-3
โ”‚  โ”‚  โ””โ”€ et_instruct_164k_vid.json   โ”€โ”˜
โ”‚  โ”œโ”€ etbench                        โ”€โ”
โ”‚  โ”‚  โ”œโ”€ annotations                  โ”‚ For
โ”‚  โ”‚  โ”œโ”€ videos                       โ”‚ Evaluation
โ”‚  โ”‚  โ””โ”€ videos_compressed           โ”€โ”˜
โ”œโ”€ model_zoo
โ”‚  โ”œโ”€ Phi-3-mini-4k-instruct
โ”‚  โ”œโ”€ eva_vit_g.pth
โ”‚  โ””โ”€ instruct_blip_vicuna7b_trimmed.pth
โ”œโ”€ etchat
โ”œโ”€ scripts
โ””โ”€ README.md

๐Ÿ”ฎ Training

Use the following commands to train E.T. Chat. The default setting is to use 8 * NVIDIA V100 (32G) GPUs. You may modify nproc_per_node, per_device_train_batch_size, and gradient_accumulation_steps to keep the same global batch size if you have different device configurations.

# Stage-1 (around 6 hours on 8*V100)
bash scripts/train_stage_1.sh

# Stage-2 (around 32 hours on 8*V100)
bash scripts/train_stage_2.sh [<path-to-stage-1-checkpoint>]

# Stage-3 (around 20 hours on 8*V100)
bash scripts/train_stage_3.sh [<path-to-stage-2-checkpoint>]

The training logs and checkpoints will be saved in the work_dirs folder.

๐Ÿ’ป Inference

Use the following command to run inference on E.T. Bench.

bash scripts/inference.sh [<path-to-checkpoint>]

This will start 8 processes (on one GPU each) and generate 8 JSON files in the <path-to-checkpoint>/etbench folder. You may pass the path to this folder to E.T. Bench's evaluation script to compute metrics.

python compute_metrics.py <path-to-checkpoint>/etbench