TokenBench
January 13, 2025 · View on GitHub
Cosmos-Tokenizer Code | Technical Report
https://github.com/user-attachments/assets/72536cfc-5cb5-4b48-88fa-b06f3c8c4495
TokenBench is a comprehensive benchmark to standardize the evaluation for Cosmos-Tokenizer, which covers a wide variety of domains including robotic manipulation, driving, egocentric, and web videos. It consists of high-resolution, long-duration videos, and is designed to evaluate the performance of video tokenizers. We resort to existing video datasets that are commonly used for various tasks, including BDD100K, EgoExo-4D, BridgeData V2, and Panda-70M. This repo provides instructions on how to download and preprocess the videos for TokenBench.
Installation
- Clone the source code
git clone https://github.com/NVlabs/TokenBench.git
cd TokenBench
- Install via pip
pip3 install -r requirements.txt
apt-get install -y ffmpeg
Preferably, build a docker image using the provided Dockerfile
docker build -t token-bench -f Dockerfile .
# You can run the container as:
docker run --gpus all -it --rm -v /home/${USER}:/home/${USER} \
--workdir ${PWD} token-bench /bin/bash
Download StyleGAN Checkpoints from Hugging Face
You can use this snippet to download StyleGAN checkpoints from huggingface.co/LanguageBind/Open-Sora-Plan-v1.0.0:
from huggingface_hub import login, snapshot_download
import os
login(token="<YOUR-HF-TOKEN>", add_to_git_credential=True)
model_name="LanguageBind/Open-Sora-Plan-v1.0.0"
local_dir = "pretrained_ckpts/" + model_name
os.makedirs(local_dir, exist_ok=True)
print(f"downloading `{model_name}` ...")
snapshot_download(repo_id=f"{model_name}", local_dir=local_dir)
Under pretrained_ckpts/Open-Sora-Plan-v1.0.0, you can find the StyleGAN checkpoints required for FVD metrics.
├── opensora/eval/fvd/styleganv/
│ ├── fvd.py
│ ├── i3d_torchscript.pt
Instructions to build TokenBench
- Download the datasets from the official websites:
- EgoExo4D: https://docs.ego-exo4d-data.org/
- BridgeData V2: https://rail-berkeley.github.io/bridgedata/
- Panda70M: https://snap-research.github.io/Panda-70M/
- BDD100K: http://bdd-data.berkeley.edu/
- Pick the videos as specified in the
token_bench/video/list.txtfile. - Preprocess the videos using the script
token_bench/video/preprocessing_script.py.
Evaluation on the token-bench
We provide the basic scripts to compute the common evaluation metrics for video tokenizer reonctruction, including PSNR, SSIM, and lpips. Use the code to compute metrics between two folders as below
python3 -m token_bench.metrics_cli --mode=lpips \
--gtpath <ground truth folder> \
--targetpath <reconstruction folder>
Continuous video tokenizer leaderboard
| Tokenizer | Compression Ratio (T x H x W) | Formulation | PSNR | SSIM | rFVD |
|---|---|---|---|---|---|
| CogVideoX | 4 × 8 × 8 | VAE | 33.149 | 0.908 | 6.970 |
| OmniTokenizer | 4 × 8 × 8 | VAE | 29.705 | 0.830 | 35.867 |
| Cosmos-CV | 4 × 8 × 8 | AE | 37.270 | 0.928 | 6.849 |
| Cosmos-CV | 8 × 8 × 8 | AE | 36.856 | 0.917 | 11.624 |
| Cosmos-CV | 8 × 16 × 16 | AE | 35.158 | 0.875 | 43.085 |
Discrete video tokenizer leaderboard
| Tokenizer | Compression Ratio (T x H x W) | Quantization | PSNR | SSIM | rFVD |
|---|---|---|---|---|---|
| VideoGPT | 4 × 4 × 4 | VQ | 35.119 | 0.914 | 13.855 |
| OmniTokenizer | 4 × 8 × 8 | VQ | 30.152 | 0.827 | 53.553 |
| Cosmos-DV | 4 × 8 × 8 | FSQ | 35.137 | 0.887 | 19.672 |
| Cosmos-DV | 8 × 8 × 8 | FSQ | 34.746 | 0.872 | 43.865 |
| Cosmos-DV | 8 × 16 × 16 | FSQ | 33.718 | 0.828 | 113.481 |
Core contributors
Fitsum Reda, Jinwei Gu, Xian Liu, Songwei Ge, Ting-Chun Wang, Haoxiang Wang, Ming-Yu Liu
Citation
If you find TokenBench useful in your works, please acknowledge it appropriately by citing:
@article{agarwal2025cosmos,
title={Cosmos World Foundation Model Platform for Physical AI},
author={NVIDIA et. al.},
journal={arXiv preprint arXiv:2501.03575},
year={2025}
}