TokenBench

January 13, 2025 · View on GitHub

Cosmos-Tokenizer Code | Technical Report

https://github.com/user-attachments/assets/72536cfc-5cb5-4b48-88fa-b06f3c8c4495

TokenBench is a comprehensive benchmark to standardize the evaluation for Cosmos-Tokenizer, which covers a wide variety of domains including robotic manipulation, driving, egocentric, and web videos. It consists of high-resolution, long-duration videos, and is designed to evaluate the performance of video tokenizers. We resort to existing video datasets that are commonly used for various tasks, including BDD100K, EgoExo-4D, BridgeData V2, and Panda-70M. This repo provides instructions on how to download and preprocess the videos for TokenBench.

Installation

Clone the source code

git clone https://github.com/NVlabs/TokenBench.git
cd TokenBench

Install via pip

pip3 install -r requirements.txt
apt-get install -y ffmpeg

Preferably, build a docker image using the provided Dockerfile

docker build -t token-bench -f Dockerfile .

# You can run the container as:
docker run --gpus all -it --rm -v /home/${USER}:/home/${USER} \
    --workdir ${PWD} token-bench /bin/bash

Download StyleGAN Checkpoints from Hugging Face

You can use this snippet to download StyleGAN checkpoints from huggingface.co/LanguageBind/Open-Sora-Plan-v1.0.0:

from huggingface_hub import login, snapshot_download
import os

login(token="<YOUR-HF-TOKEN>", add_to_git_credential=True)
model_name="LanguageBind/Open-Sora-Plan-v1.0.0"
local_dir = "pretrained_ckpts/" + model_name
os.makedirs(local_dir, exist_ok=True)
print(f"downloading `{model_name}` ...")
snapshot_download(repo_id=f"{model_name}", local_dir=local_dir)

Under pretrained_ckpts/Open-Sora-Plan-v1.0.0, you can find the StyleGAN checkpoints required for FVD metrics.

├── opensora/eval/fvd/styleganv/
│   ├── fvd.py
│   ├── i3d_torchscript.pt

Instructions to build TokenBench

Download the datasets from the official websites:

EgoExo4D: https://docs.ego-exo4d-data.org/
BridgeData V2: https://rail-berkeley.github.io/bridgedata/
Panda70M: https://snap-research.github.io/Panda-70M/
BDD100K: http://bdd-data.berkeley.edu/

Pick the videos as specified in the token_bench/video/list.txt file.
Preprocess the videos using the script token_bench/video/preprocessing_script.py.

Evaluation on the token-bench

We provide the basic scripts to compute the common evaluation metrics for video tokenizer reonctruction, including PSNR, SSIM, and lpips. Use the code to compute metrics between two folders as below

python3 -m token_bench.metrics_cli --mode=lpips \
        --gtpath <ground truth folder> \
        --targetpath <reconstruction folder>

Continuous video tokenizer leaderboard

Tokenizer	Compression Ratio (T x H x W)	Formulation	PSNR	SSIM	rFVD
CogVideoX	4 × 8 × 8	VAE	33.149	0.908	6.970
OmniTokenizer	4 × 8 × 8	VAE	29.705	0.830	35.867
Cosmos-CV	4 × 8 × 8	AE	37.270	0.928	6.849
Cosmos-CV	8 × 8 × 8	AE	36.856	0.917	11.624
Cosmos-CV	8 × 16 × 16	AE	35.158	0.875	43.085

Discrete video tokenizer leaderboard

Tokenizer	Compression Ratio (T x H x W)	Quantization	PSNR	SSIM	rFVD
VideoGPT	4 × 4 × 4	VQ	35.119	0.914	13.855
OmniTokenizer	4 × 8 × 8	VQ	30.152	0.827	53.553
Cosmos-DV	4 × 8 × 8	FSQ	35.137	0.887	19.672
Cosmos-DV	8 × 8 × 8	FSQ	34.746	0.872	43.865
Cosmos-DV	8 × 16 × 16	FSQ	33.718	0.828	113.481

Core contributors

Fitsum Reda, Jinwei Gu, Xian Liu, Songwei Ge, Ting-Chun Wang, Haoxiang Wang, Ming-Yu Liu

Citation

If you find TokenBench useful in your works, please acknowledge it appropriately by citing:

@article{agarwal2025cosmos,
  title={Cosmos World Foundation Model Platform for Physical AI},
  author={NVIDIA et. al.},
  journal={arXiv preprint arXiv:2501.03575},
  year={2025}
}