README.md
June 1, 2026 · View on GitHub
EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses
Enrico Pallotta*1,2 · Sina Mokhtarzadeh Azar*1,2 · Lars Doorenboos1,2
Serdar Ozsoy1,2 · Umar Iqbal3 · Juergen Gall1,2
1University of Bonn 2Lamarr Institute for Machine Learning and Artificial Intelligence 3NVIDIA
*Equal Contribution
⭐ CVPR 2026
Official implementation of EgoControl, a method for controllable egocentric video generation conditioned on 3D full-body poses.
📋 Table of Contents
Getting Started
Installation
We follow the original setup guide of cosmos-predict2; refer to that guide for additional details.
System Requirements:
- NVIDIA GPU with Ampere architecture (RTX 30xx, A100) or newer
- NVIDIA driver compatible with CUDA 12.6
- Linux x86-64
- glibc ≥ 2.31 (e.g. Ubuntu ≥ 22.04)
- Python 3.10
We highly recommend using uv.
Install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
Option A — new environment:
uv sync --extra cu126
source .venv/bin/activate
Option B — active environment (e.g. conda):
uv sync --extra cu126 --active --inexact
If you run scripts directly without installing the package, add the local sources to PYTHONPATH:
export PYTHONPATH="$PWD/cosmos-predict2:$PYTHONPATH"
Downloading Checkpoints
- Get a Hugging Face Access Token with Read permission.
- Install the Hugging Face CLI:
uv tool install -U "huggingface_hub[cli]" - Log in:
hf auth login - Accept the Llama-Guard-3-8B terms.
Download the base cosmos-predict2 2B weights:
python cosmos-predict2/scripts/download_checkpoints.py --model_types video2world --model_sizes 2B --resolution 480
Download the EgoControl fine-tuned checkpoint from Hugging Face:
hf download PallottaEnrico/EgoControl egocontrol-480p.pt --local-dir ./cosmos-predict2/checkpoints/
Dataset
Follow the Nymeria repo to download the dataset and store it as:
nymeria_root/
└── train/
└── video_name_01/
└── recording_head.mp4 # video from the main RGB camera
We recommend using scripts/data/precompute_latents.py to precompute and cache latents for faster training. It stores a subfolder latents_{height}x{width}_{target_fps}fps_{frames_per_clip}f/ inside each video_name_XX/ directory.
The current dataset implementation assumes this structure. You must also process Nymeria poses and place .pt pose files in the same video folder as the latent files.
Feel free to modify the dataset class to adapt to a different pipeline.
Inference
All inference scripts share a common --dit_path argument pointing to the downloaded EgoControl checkpoint.
First:
cd cosmos-predict2
Simple Inference
Use scripts/inference/simple.py to generate a video from a single input clip and one or more pose files.
Single pose file:
python ../scripts/inference/simple.py \
--video ../examples/videos/20231113_s0_patricia_gutierrez_act3_0fk89s_chunk_0014.mp4 \
--poses ../examples/poses/000080_003600_003644.pt \
--dit_path path/to/egocontrol-480p.pt \
--output ../outputs/ \
--viz_pose
--viz_posesaves a side-by-side video of the generation and the rendered pose.
Folder of pose files (one output video per pose file):
python ../scripts/inference/simple.py \
--video ../examples/videos/20231113_s0_patricia_gutierrez_act3_0fk89s_chunk_0014.mp4 \
--poses ../examples/poses/ \
--dit_path path/to/egocontrol-480p.pt \
--output ../outputs/
Autoregressive generation (chain multiple pose files into a longer video):
python ../scripts/inference/simple.py \
--video ../examples/videos/20231113_s0_patricia_gutierrez_act3_0fk89s_chunk_0014.mp4 \
--poses ../examples/autoregressive/ \
--dit_path path/to/egocontrol-480p.pt \
--output ../outputs/ar/ \
--autoregressive
Evaluation Inference
Use scripts/inference/evaluation_inference.py to run inference over a folder of ground-truth video chunks. Output videos are saved in a sibling directory named after --exp_name, preserving original filenames for easy metric computation.
Pose files are looked up automatically from --pose_root using the convention:
<video>_chunk_<id>.mp4 → <pose_root>/<video>/pose_<fps>fps_<frames>f/00<id>_*.pt
Single-GPU:
python ../scripts/inference/evaluation_inference.py \
--gt_folder path/to/gt_val_chunks/ \
--pose_root path/to/dataset/val/ \
--dit_path path/to/egocontrol-480p.pt \
--exp_name my_experiment \
--skip 16
--skip 16subsamples the evaluation set by skipping every 16 chunks.
Multi-GPU (context parallelism):
torchrun --nproc_per_node=4 ../scripts/inference/evaluation_inference.py \
--gt_folder path/to/gt_val_chunks/ \
--pose_root path/to/dataset/val/ \
--dit_path path/to/egocontrol-480p.pt \
--exp_name my_experiment \
--num_gpus 4
Evaluation
We provide two evaluation scripts under scripts/eval/.
Visual Quality Metrics (SSIM, LPIPS, DreamSim, FID, FVD)
scripts/eval/visual_eval.py computes frame-level and dataset-level visual quality metrics between generated and ground-truth videos. FVD must be run separately from the other metrics.
Frame-level metrics (SSIM, LPIPS, DreamSim) + FID:
python scripts/eval/visual_eval.py \
--gen_folder path/to/generated/ \
--gt_folder path/to/gt_chunks/ \
--metrics ssim,lpips,dreamsim,fid \
--cond_frames 13 \
--output_path results.json
Dataset-level FVD:
python scripts/eval/visual_eval.py \
--gen_folder path/to/generated/ \
--gt_folder path/to/gt_chunks/ \
--metrics fvd \
--output_path results.json
Results are saved to a JSON file and merged across runs, so you can compute metrics incrementally.
Body Control Accuracy (mIoU)
scripts/eval/body_control_eval.py evaluates how well generated videos follow the input body poses by computing the mean IoU between SAM2 segmentation masks of body/arms in ground-truth and generated frames. Masks should be pre-computed as .npz files (one per video).
First set up SAM2. We provide a segmentation script with manually annotated input points:
python scripts/eval/utils/sam2_video_segment.py \
--videos_dir path/to/mp4_videos/ \
--annotations_xml scripts/eval/segmentations/sam2_input_points.xml \
--output_dir path/to/output_segmentations/
Then compute mIoU:
python scripts/eval/body_control_eval.py \
path/to/gt_masks/ \
path/to/pred_masks/ \
--out_json body_control_results.json
Citation
@InProceedings{Pallotta_2026_CVPR,
author = {Pallotta, Enrico and Azar, Sina Mokhtarzadeh and Doorenbos, Lars and Ozsoy, Serdar and Iqbal, Umar and Gall, Juergen},
title = {EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2026},
pages = {4269-4279}
}