ViPRA: Video Prediction for Robot Actions

January 27, 2026 Β· View on GitHub

ViPRA teaser

Paper Project Page Code Hugging Face

Sandeep Routray1,2, Hengkai Pan1, Unnat Jain2,3, Shikhar Bahl2, Deepak Pathak1,2

1Carnegie Mellon University 2Skild AI 3University of California, Irvine

Corresponding author: Sandeep Routray

News


Overview

  • A recipe to learn generalist robot policies from large-scale human and robot videos without action labels.
  • A novel approach to extract motion-centric latent actions that capture fine-grained physical dynamics.
  • A flow matching action decoder with action chunking for high-frequency continuous control.
  • Outperforms prior latent action methods and VLA baselines trained on ground-truth actions.

Latent Action Model

The latent action model learns motion-centric abstract representations from actionless video. These latents capture fine-grained temporal dynamics and are discretized into tokens that serve as "latent actions" for downstream policy learning.

Key Features

  • Actionless Learning: Learns from videos directly; no action annotations required.
  • Motion-Centric: Focuses on fine-grained temporal dynamics rather than static appearance.
  • Multi-Dataset: Trained on diverse human and robot data.
  • Optical Flow Consistency: Uses optical flow for temporal consistency regularization.

Architecture

  • Spatial Encoder: DINOv2-initialized vision transformer for spatial features.
  • Spatio-Temporal Encoder: Non-causal transformer encoder over video clips.
  • Vector Quantizer: Noise Substitution Vector Quantization (NSVQ) for discretizing latent action.
  • Spatio-Temporal Decoder: Causal transformer decoder for reconstruction.
  • Flow Network: RAFT-based optical flow estimation for consistency loss.

Environment Setup

cd laq/
conda env create -f environment.yml -n laq
conda activate laq

Configuration

Training configs live in laq/configs/config.py. Key parameters:

  • Model: 768-dim transformer, 6 encoder layers, 8 decoder layers.
  • Data: 224Γ—224 crops, 8-frame sequences.
  • Quantization: 32-dim latent space, NSVQ codebook.
  • Losses: L1 reconstruction, LPIPS perceptual loss, optical-flow consistency loss.
  • Training: ~300k steps, batch size 18, bf16 on 8Γ—H200 GPUs, grad norm clip 6.0.

Dataset Structure Requirements

You can match these layouts or extend laq/model/data.py to support your own.

Something-Something-v2 (SSv2)

ssv2/
β”œβ”€β”€ labels/
β”‚   β”œβ”€β”€ train.json
β”‚   β”œβ”€β”€ validation.json
β”‚   └── test.json
β”œβ”€β”€ 20bn-something-something-v2/
β”‚   β”œβ”€β”€ [video_id].webm
β”‚   └── ...

Example config:

ssv2 = dict(
    root_dir=Path("/path/to/ssv2"),
    split="trainval",   # "train", "val", "trainval", "test", "all"
    stepsize=2,         # frame sampling stride
)

OpenX Datasets (Fractal, Bridge, Kuka)

dataset_name/
β”œβ”€β”€ processed/
β”‚   β”œβ”€β”€ trajectory_001/
β”‚   β”‚   └── images/
β”‚   β”‚       β”œβ”€β”€ 000000.jpg
β”‚   β”‚       β”œβ”€β”€ 000001.jpg
β”‚   β”‚       └── ...
β”‚   β”œβ”€β”€ trajectory_002/
β”‚   └── ...

Example config:

bridge = dict(
    root_dir=Path("/path/to/bridge"),
    split="trainval",
    num_trajs=dict(trainval=25460, val=2546),
    stepsize=1,
)

LIBERO

LIBERO/
β”œβ”€β”€ libero_10_modified/
β”‚   └── images/trajectory_001/000000.jpg
β”œβ”€β”€ libero_goal_modified/
β”‚   └── images/...
β”œβ”€β”€ libero_object_modified/
β”‚   └── images/...
└── libero_spatial_modified/
    └── images/...

Example config:

libero = dict(
    root_dir=Path("/path/to/LIBERO"),
    split="trainval",
    num_trajs=dict(trainval=1.0, val=0.1),  # float = percentage
    stepsize=1,
)

Custom Dataset

  1. Add a discovery function in laq/model/data.py:
def discover_custom_sequences(data_root: Path, mode: str, **kwargs) -> List[str]:
    # return list of frame directories / trajectories
    return list_of_paths
  1. Add your dataset case in VideoDatasetCoTrain.
  2. Add your config block to laq/configs/config.py.

Training

Launch training using the provided script, configured for bf16 training on a single node with 8 H200 GPUs:

bash run_train_laq.sh

Inference and Evaluation

To reproduce codebook analysis and figures shown in the paper:

# Codebook usage analysis (reproduces codebook utilization figures)
python -m codebook_usage

# Rollout transfer evaluation (reproduces reconstruction and transfer results)
python -m rollout_transfer

To use the LAQ model to generate training data with latent actions for ViPRA policy pretraining, use the dataset-specific latent generation scripts:

# LIBERO
python -m inference.libero.libero_latent

# OpenX-style datasets (Fractal, BridgeData V2, Kuka)
python -m inference.openx.openx_latent --dataset bridge
python -m inference.openx.openx_latent --dataset kuka

# SSv2
python -m inference.ssv2.ssv2_latent

These scripts generate training data in JSONL format with multi-GPU processing and automatic shard merging. Each line contains a training sample with latent actions:

Sample JSONL Entry:

{
  "instruction": "pick up the red block and place it in the blue bowl",
  "raw_action": [0.1, -0.2, 0.05, 0.0, 0.0, 0.0, 1.0],
  "image": ["libero_10_modified/images/traj_001/step0000.jpg", "libero_10_modified/images/traj_001/step0001.jpg"],
  "latent_state": ["libero_10_modified/images/traj_001/step0015.jpg"],
  "latent_action_idxs": [3, 7, 1, 4, 2, 6, 0, 5, 1, 3, 7, 2, 4, 0, 6, 1],
  "fields_la": "[instruction],[vision],latent_action",
  "fields_ls": "[instruction],[vision],latent_state", 
  "fields_ls_la": "[instruction],[vision],latent_state,latent_action"
}

ViPRA Policy

The ViPRA policy builds on a video-language foundation model, Large World Model (LWM). We use the LWM-Chat-1M-Jax as the base model and extend it with additional modules for latent action prediction and flow matching for continuous control.

Environment Setup

cd vipra/
conda env create -f environment.yml -n vipra
conda activate vipra

Before training, download the VQ-GAN image tokenizer, text tokenizer and pretrained model parameters from LWM-Chat-1M-Jax and place them under vipra/lwm/:

mkdir lwm
huggingface-cli download LargeWorldModel/LWM-Chat-1M-Jax --local-dir lwm/

Pretraining Data

We release a pre-tokenized, horizon-14 dynamics dataset on Hugging Face:

mkdir cotrain_data
huggingface-cli download vipra-project/cotrain-dynamics14 --local-dir cotrain_data/

cotrain-dynamics14 merges multiple robot datasets (LIBERO, BridgeData V2, Fractal, Kuka) with human video data from SSv2. Each training sample includes:

  • history frames
  • latent state target
  • latent action tokens from LAQ
  • natural language task text

This dataset is already chunked into 14-step latent action sequences.

Vision Cache (Optional, speeds up training)

We also release a VQGAN vision cache on Hugging Face so you don't have to repeatedly tokenize raw pixels:

mkdir vision_cache
huggingface-cli download vipra-project/cotrain-vqgan-vision-cache --local-dir vision_cache/

This contains precomputed VQGAN token sequences for each frame, which can be used instead of running the image tokenizer online.

If you don't use the cache, set vqgan_path to the VQ-GAN weights from LWM-Chat-1M-Jax so ViPRA can tokenize frames on the fly.

Running Pretraining

Launch pretraining using the provided script (configured for 8Γ—H200 GPUs):

cd vipra/
bash scripts/pretrain.sh

See vipra/scripts/pretrain.sh for full hyperparameters.


Finetuning

Download the pretrained checkpoint weights, VQ-GAN image tokenizer, and text tokenizer from Hugging Face:

cd vipra && mkdir vipra_checkpoints
huggingface-cli download vipra-project/vipra-7b-pretrained --local-dir vipra_checkpoints/

For task-specific finetuning, prepare your dataset in JSONL format where each line represents a single timestep with the following structure:

{
  "id": "ep00000/step0000",
  "image": "ep00000/step0000.png",
  "raw_action": [0.016, 0.0, -0.0, 0.0, 0.0, -0.0, -1.0],
  "proprio": [0.003, -0.141, 0.011, -2.431, ...],
  "instruction": "<s> You are a helpful assistant. USER: What action should the robot take to `put the white mug on the left plate` ASSISTANT:"
}

We provide a full data processing pipeline example (shown here with LIBERO Long):

Step 1: Action Discretization

python data/finetune_preprocess_libero.py \
  --input_path ./libero_10_raw.jsonl \
  --output_filename ./libero_10_quant.jsonl \
  --csv_filename ./quant_bins.csv \
  --discretize_bins 2047 \
  --task_name libero_10

Step 2: Dynamics Formatting (14-step horizon, history, proprio)

python data/dynamics14_libero.py \
  --input_jsonl ./libero_10_quant.jsonl \
  --data_root ./ \
  --csv_path ./quant_bins.csv \
  --horizon 14 \
  --action_type delta-eef \
  --task_name libero_10

Step 3: Action / Proprio Normalization

python data/normalize_libero.py \
  --raw_jsonl ./libero_10_raw.jsonl \
  --dynamics_jsonl ./libero_10_dynamics14_v2.jsonl \
  --output_jsonl ./libero_10_final.jsonl \
  --action_stats_json ./action_stats.json \
  --proprio_stats_json ./proprio_stats.json

To launch finetuning (for LIBERO Long example):

cd vipra/
bash scripts/finetune_libero_long.sh

See vipra/scripts/finetune_libero_long.sh for full hyperparameters.


Deployment

ViPRA uses a client–server architecture for deployment: a server that runs inference and a lightweight client that sends observations and receives actions.

Server

Start the inference server:

cd vipra/
bash scripts/run_server.sh [GPU_ID] [PORT]

# Examples:
bash scripts/run_server.sh 0 8005
bash scripts/run_server.sh 1         # GPU 1, default port 8005
bash scripts/run_server.sh           # GPU 0, default port 8005

The server is configured by the ViPRAConfig class in vipra/inference/dynamics_action_cont_server.py. Default endpoint: http://localhost:8005

Client

The ViPRAClient class in the client script in vipra/inference/dynamics_action_cont_client.py provides a simple interface to communicate with the inference server and obtain robot actions. The client can be customized for your particular use case and robot platform.

from inference.dynamics_action_cont_client import ViPRAClient
import numpy as np

client = ViPRAClient(
    server_url="http://localhost:8005",
    timeout=(1.0, 5.0),
    image_size=256
)

task_description = "pick up the red block and place it in the blue bowl"
client.reset_policy(task_description)

image1 = np.random.randint(0, 255, (256, 256, 3), dtype=np.uint8)
image2 = np.random.randint(0, 255, (256, 256, 3), dtype=np.uint8)

# Two request modes available:
actions = client.get_action([image1, image2], mode="json")   # JSON mode (baseline)
actions = client.get_action([image1, image2], mode="bytes")  # JPEG mode (faster)

API Endpoints

  1. POST /step – JSON payload with images in nested lists.
  2. POST /step_bytes – multipart form data with JPEG-compressed images (recommended).
  3. POST /reset – reset policy and set a new task instruction.

Client-Only Environment

conda env create -f client_environment.yml -n vipra-client
conda activate vipra-client
  • Lightweight: only requests, OpenCV, numpy
  • No JAX / PyTorch required
  • Can run on edge devices, laptops, etc.

Citation

If you find our code or models useful in your work, please cite ViPRA:

@misc{routray2025vipra,
      title={ViPRA: Video Prediction for Robot Actions}, 
      author={Sandeep Routray and Hengkai Pan and Unnat Jain and Shikhar Bahl and Deepak Pathak},
      year={2025},
      eprint={2511.07732},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2511.07732}, 
}

Acknowledgements

ViPRA builds on LWM and LAPA. We thank the authors of these projects for open-sourcing their code and models.


License

ViPRA’s code and model weights are released under the Apache License 2.0.