Setup Guide
June 1, 2026 · View on GitHub
Skill:
.agents/skills/cosmos3-setup/SKILL.md
Table of Contents
System Requirements
- NVIDIA GPUs with Ampere architecture (RTX 30 Series, A100) or newer — Hopper (H100) or Blackwell (B200) recommended for full training throughput
- NVIDIA driver compatible with CUDA version
- NVIDIA CUDA >=12.8
- Linux x86-64/aarch64
- glibc >=2.35 (e.g. Ubuntu >=22.04)
- Python >=3.10
- Multi-node training additionally requires a working NCCL setup (IB/RoCE recommended) and a shared filesystem visible to all ranks for checkpoint I/O
- Free disk: ~150 GiB recommended for a first-run inference or training workflow (Hugging Face cache ~90 GiB, uv cache ~20 GiB, run outputs ~30 GiB). See FAQ → Expected disk footprint for the breakdown and how to relocate caches.
Recommended Base Image
Recommended Base Image
For CUDA 13 builds, the NVIDIA NGC PyTorch container is the recommended starting point — it bundles PyTorch + CUDA 13 + cuDNN + NCCL tuned for NVIDIA hardware, plus Apex, TransformerEngine, and Megatron utilities that training infra users commonly need.
FROM nvcr.io/nvidia/pytorch:25.09-py3
For CUDA 12.8 builds, pin to an earlier NGC tag (e.g. nvcr.io/nvidia/pytorch:25.06-py3) that still ships CUDA 12.
Installation
If you encounter issues, see Troubleshooting.
Clone the repository:
git clone git@github.com:NVIDIA/cosmos-framework.git
cd cosmos-framework
The two supported install paths are the recommended base image and the Docker container. For other paths (standalone venv, custom torch/cuda) see Advanced.
Quickstart: From the Recommended Base Image
Quickstart: From the Recommended Base Image
If you started from the recommended base image (nvcr.io/nvidia/pytorch:25.09-py3), the following commands set up the full environment in one go. Run them from the root of this repository (i.e. inside the Cosmos/ directory you just cloned):
apt-get update
apt-get install -y --no-install-recommends curl ffmpeg git-lfs libx11-dev tree wget
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
# CUDA 13.0 (recommended); for CUDA 12.8 use `--group=cu128-train`
uv sync --all-extras --group=cu130-train
source .venv/bin/activate && export LD_LIBRARY_PATH=
Docker Container
Docker Container
Please make sure you have access to Docker on your machine and the NVIDIA Container Toolkit is installed.
Build the container:
image_tag=$(docker build -q .)
Run the container:
docker run -it --runtime=nvidia --ipc=host --rm \
-v .:/workspace -v /workspace/.venv \
-v /root/.cache:/root/.cache \
-e HF_TOKEN="$HF_TOKEN" \
$image_tag
For multi-node training, also bind-mount your shared dataset and checkpoint directories so all ranks see the same filesystem.
Optional arguments:
--ipc=host: Use host system's shared memory, since parallel torchrun consumes a large amount of shared memory. If not allowed by security policy, increase--shm-size(documentation).-v /root/.cache:/root/.cache: Mount host cache to avoid re-downloading cache entries.
If you get docker: Error response from daemon: unknown or invalid runtime name: nvidia, you need to configure docker:
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
See docker/README.md for additional images and build options.
Advanced
Advanced
Use these paths only when the recommended base image or Docker container are not viable for your environment.
Virtual Environment
Virtual Environment
Install system dependencies:
sudo apt-get install -y --no-install-recommends curl ffmpeg git-lfs libx11-dev tree wget
Install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
Install the package using one of the following methods:
UV Sync: fully reproducible environment
Choose the dependency group that matches your CUDA toolkit (see CUDA Variants):
# CUDA 13.0 (recommended)
uv sync --all-extras --group=cu130-train
# Or, for CUDA 12.8:
# uv sync --all-extras --group=cu128-train
source .venv/bin/activate && export LD_LIBRARY_PATH=
UV Pip: virtual environment
# Create virtual environment (skip if using an existing environment)
uv venv --clear && source .venv/bin/activate && export LD_LIBRARY_PATH=
uv pip install -r pyproject.toml --all-extras --group=cu130-train
uv pip install -e .
UV Pip: system environment
uv pip install --system --break-system-packages -r pyproject.toml --all-extras --group=cu130-train
Custom torch/cuda versions
cuda_name=cu130
torch_name=torch210
# 1. Create and activate the virtual environment
uv venv --clear && source .venv/bin/activate
# 2. Install the desired torch/cuda versions
uv pip install "torch==2.10.0" "torchvision" --torch-backend=$cuda_name
# 3. Install the package with desired extras
uv pip install -r pyproject.toml --all-extras --group=cu130-train
# 4. Install one of the following attention backends:
# * Blackwell
uv pip install "natten==0.21.6.dev6+$cuda_name.$torch_name" -f https://nvidia-cosmos.github.io/cosmos-dependencies/v1.5.0/natten
# * Hopper
uv pip install "flash-attn-3-nv==1.0.3+$cuda_name.$torch_name" -f https://nvidia-cosmos.github.io/cosmos-dependencies/v1.5.0/flash-attn-3-nv
# * Ada/Ampere
uv pip install "flash-attn==2.7.4.post1+$cuda_name.$torch_name" -f https://nvidia-cosmos.github.io/cosmos-dependencies/v1.5.0/flash-attn
If there is no attention backend wheel for your torch/cuda versions, you can build one using cosmos-dependencies.
Optional package extras:
train: Training infrastructure (FSDP, parallelism, checkpointing, datasets)eval: Evaluation harnesses for trained checkpoints
CUDA Variants
This repository is training-focused, so the *-train dependency groups are the supported install path. Inference-only groups exist for evaluating trained checkpoints in-tree but are not required for training.
| CUDA Version | Training (recommended) | Notes |
|---|---|---|
| CUDA 13.0 (recommended) | --group=cu130-train | NVIDIA Driver |
| CUDA 12.8 | --group=cu128-train | NVIDIA Driver |
Environment Variables
Export the following before downloading checkpoints or launching training. See environment_variables.md for the full reference.
| Variable | Purpose |
|---|---|
HF_TOKEN | Hugging Face access token for gated model/dataset downloads. Alternative to uvx hf auth login. |
HF_HOME | Cache directory for Hugging Face models and datasets. Recommend ≥ 1 TB free. |
IMAGINAIRE_OUTPUT_ROOT | Output root for training DCP checkpoints and logs. Recommend ≥ 1 TB free. |
UV_CACHE_DIR | Cache directory for uv-managed dependencies. |
LD_LIBRARY_PATH= | Clear (set to empty) after sourcing the venv to avoid host library bleed-through into PyTorch imports. |
Downloading Base Checkpoints
Training in this repo typically starts from a pretrained base checkpoint that you fine-tune or post-train. The recommended source is the Hugging Face Hub.
-
Get a Hugging Face Access Token with
Readpermission. -
Authenticate using either mechanism (they are equivalent — pick one, do not set both with different tokens):
HF_TOKENenvironment variable — preferred for Docker and non-interactive shells. Export it once and anyhuggingface_hubcall (CLI or library) picks it up.uvx hf auth login— preferred for local interactive use. Writes the token to~/.cache/huggingface/token, persisted across sessions (and across Docker runs if you bind-mount/root/.cache).
-
Accept the license for any gated model you intend to use (e.g. the NVIDIA Open Model License Agreement where applicable).
-
Test access:
uvx hf@latest download --repo-type model nvidia/Cosmos-Guardrail1 \ --revision d6d4bfa899a71454a700907664f3e88f503950cf --include "README.md"
If you encounter issues:
- Check that you don't have conflicting environment variables — e.g. an
HF_TOKENset to a different token than the one cached byhf auth login:printenv | grep HF_. - Check that your token has sufficient permissions.
Checkpoints are downloaded on demand during training and evaluation. To change the cache location, set HF_HOME. See training.md for DCP conversion and Hugging Face safetensors export.
Troubleshooting
PyTorch Import Issue
Errors:
ImportError: cannot import name '_functionalization' from 'torch._C'
Clear the library path in your current shell:
export LD_LIBRARY_PATH=
This applies to the current session only. To persist, add the line to your Dockerfile or ~/.bashrc.
If this doesn't fix the issue, try reinstalling venv.
Dependency Issue
Errors:
ModuleNotFoundError: No module named <module_name>
Reinstall venv:
uv sync --all-extras --group=cu130-train --reinstall
source .venv/bin/activate && export LD_LIBRARY_PATH=
If this doesn't fix the issue, try reinstalling uv.
Python Issue
Errors:
fatal error: Python.h: No such file or directory
Reinstall uv and venv:
curl -LsSf https://astral.sh/uv/install.sh | sh
uv python install --reinstall
rm -rf .venv
uv sync --all-extras --group=cu130-train --reinstall
source .venv/bin/activate && export LD_LIBRARY_PATH=
CUDA Issue
OSError: <lib_name>: cannot open shared object file: No such file or directory
Ensure you have CUDA installed. The major version must match between the system and virtual environment CUDA versions.
sudo apt-get install -y --no-install-recommends cuda-toolkit-<cuda_major_version>
Alternatively, use the Docker container.