TokenBench

January 13, 2025 · View on GitHub

Cosmos-Tokenizer Code | Technical Report

https://github.com/user-attachments/assets/72536cfc-5cb5-4b48-88fa-b06f3c8c4495

TokenBench is a comprehensive benchmark to standardize the evaluation for Cosmos-Tokenizer, which covers a wide variety of domains including robotic manipulation, driving, egocentric, and web videos. It consists of high-resolution, long-duration videos, and is designed to evaluate the performance of video tokenizers. We resort to existing video datasets that are commonly used for various tasks, including BDD100K, EgoExo-4D, BridgeData V2, and Panda-70M. This repo provides instructions on how to download and preprocess the videos for TokenBench.

Installation

  • Clone the source code
git clone https://github.com/NVlabs/TokenBench.git
cd TokenBench
  • Install via pip
pip3 install -r requirements.txt
apt-get install -y ffmpeg

Preferably, build a docker image using the provided Dockerfile

docker build -t token-bench -f Dockerfile .

# You can run the container as:
docker run --gpus all -it --rm -v /home/${USER}:/home/${USER} \
    --workdir ${PWD} token-bench /bin/bash

Download StyleGAN Checkpoints from Hugging Face

You can use this snippet to download StyleGAN checkpoints from huggingface.co/LanguageBind/Open-Sora-Plan-v1.0.0:

from huggingface_hub import login, snapshot_download
import os

login(token="<YOUR-HF-TOKEN>", add_to_git_credential=True)
model_name="LanguageBind/Open-Sora-Plan-v1.0.0"
local_dir = "pretrained_ckpts/" + model_name
os.makedirs(local_dir, exist_ok=True)
print(f"downloading `{model_name}` ...")
snapshot_download(repo_id=f"{model_name}", local_dir=local_dir)

Under pretrained_ckpts/Open-Sora-Plan-v1.0.0, you can find the StyleGAN checkpoints required for FVD metrics.

├── opensora/eval/fvd/styleganv/
   ├── fvd.py
   ├── i3d_torchscript.pt

Instructions to build TokenBench

  1. Download the datasets from the official websites:
  1. Pick the videos as specified in the token_bench/video/list.txt file.
  2. Preprocess the videos using the script token_bench/video/preprocessing_script.py.

Evaluation on the token-bench

We provide the basic scripts to compute the common evaluation metrics for video tokenizer reonctruction, including PSNR, SSIM, and lpips. Use the code to compute metrics between two folders as below

python3 -m token_bench.metrics_cli --mode=lpips \
        --gtpath <ground truth folder> \
        --targetpath <reconstruction folder>

Continuous video tokenizer leaderboard

TokenizerCompression Ratio (T x H x W)FormulationPSNRSSIMrFVD
CogVideoX4 × 8 × 8VAE33.1490.9086.970
OmniTokenizer4 × 8 × 8VAE29.7050.83035.867
Cosmos-CV4 × 8 × 8AE37.2700.9286.849
Cosmos-CV8 × 8 × 8AE36.8560.91711.624
Cosmos-CV8 × 16 × 16AE35.1580.87543.085

Discrete video tokenizer leaderboard

TokenizerCompression Ratio (T x H x W)QuantizationPSNRSSIMrFVD
VideoGPT4 × 4 × 4VQ35.1190.91413.855
OmniTokenizer4 × 8 × 8VQ30.1520.82753.553
Cosmos-DV4 × 8 × 8FSQ35.1370.88719.672
Cosmos-DV8 × 8 × 8FSQ34.7460.87243.865
Cosmos-DV8 × 16 × 16FSQ33.7180.828113.481

Core contributors

Fitsum Reda, Jinwei Gu, Xian Liu, Songwei Ge, Ting-Chun Wang, Haoxiang Wang, Ming-Yu Liu

Citation

If you find TokenBench useful in your works, please acknowledge it appropriately by citing:

@article{agarwal2025cosmos,
  title={Cosmos World Foundation Model Platform for Physical AI},
  author={NVIDIA et. al.},
  journal={arXiv preprint arXiv:2501.03575},
  year={2025}
}