ViPRA: Video Prediction for Robot Actions
January 27, 2026 Β· View on GitHub
Sandeep Routray1,2, Hengkai Pan1, Unnat Jain2,3, Shikhar Bahl2, Deepak Pathak1,2
1Carnegie Mellon University 2Skild AI 3University of California, Irvine
Corresponding author: Sandeep Routray
News
- [2026/01/26] ViPRA accepted at ICLR 2026.
- [2025/12/06] ViPRA won the Best Paper Award at NeurIPS 2025 EWM Workshop.
- [2025/10/13] ViPRA accepted for an Oral at NeurIPS 2025 EWM Workshop.
- [2025/10/01] ViPRA accepted at NeurIPS 2025 SpaVLE Workshop.
Overview
- A recipe to learn generalist robot policies from large-scale human and robot videos without action labels.
- A novel approach to extract motion-centric latent actions that capture fine-grained physical dynamics.
- A flow matching action decoder with action chunking for high-frequency continuous control.
- Outperforms prior latent action methods and VLA baselines trained on ground-truth actions.
Latent Action Model
The latent action model learns motion-centric abstract representations from actionless video. These latents capture fine-grained temporal dynamics and are discretized into tokens that serve as "latent actions" for downstream policy learning.
Key Features
- Actionless Learning: Learns from videos directly; no action annotations required.
- Motion-Centric: Focuses on fine-grained temporal dynamics rather than static appearance.
- Multi-Dataset: Trained on diverse human and robot data.
- Optical Flow Consistency: Uses optical flow for temporal consistency regularization.
Architecture
- Spatial Encoder: DINOv2-initialized vision transformer for spatial features.
- Spatio-Temporal Encoder: Non-causal transformer encoder over video clips.
- Vector Quantizer: Noise Substitution Vector Quantization (NSVQ) for discretizing latent action.
- Spatio-Temporal Decoder: Causal transformer decoder for reconstruction.
- Flow Network: RAFT-based optical flow estimation for consistency loss.
Environment Setup
cd laq/
conda env create -f environment.yml -n laq
conda activate laq
Configuration
Training configs live in laq/configs/config.py. Key parameters:
- Model: 768-dim transformer, 6 encoder layers, 8 decoder layers.
- Data: 224Γ224 crops, 8-frame sequences.
- Quantization: 32-dim latent space, NSVQ codebook.
- Losses: L1 reconstruction, LPIPS perceptual loss, optical-flow consistency loss.
- Training: ~300k steps, batch size 18, bf16 on 8ΓH200 GPUs, grad norm clip 6.0.
Dataset Structure Requirements
You can match these layouts or extend laq/model/data.py to support your own.
Something-Something-v2 (SSv2)
ssv2/
βββ labels/
β βββ train.json
β βββ validation.json
β βββ test.json
βββ 20bn-something-something-v2/
β βββ [video_id].webm
β βββ ...
Example config:
ssv2 = dict(
root_dir=Path("/path/to/ssv2"),
split="trainval", # "train", "val", "trainval", "test", "all"
stepsize=2, # frame sampling stride
)
OpenX Datasets (Fractal, Bridge, Kuka)
dataset_name/
βββ processed/
β βββ trajectory_001/
β β βββ images/
β β βββ 000000.jpg
β β βββ 000001.jpg
β β βββ ...
β βββ trajectory_002/
β βββ ...
Example config:
bridge = dict(
root_dir=Path("/path/to/bridge"),
split="trainval",
num_trajs=dict(trainval=25460, val=2546),
stepsize=1,
)
LIBERO
LIBERO/
βββ libero_10_modified/
β βββ images/trajectory_001/000000.jpg
βββ libero_goal_modified/
β βββ images/...
βββ libero_object_modified/
β βββ images/...
βββ libero_spatial_modified/
βββ images/...
Example config:
libero = dict(
root_dir=Path("/path/to/LIBERO"),
split="trainval",
num_trajs=dict(trainval=1.0, val=0.1), # float = percentage
stepsize=1,
)
Custom Dataset
- Add a discovery function in
laq/model/data.py:
def discover_custom_sequences(data_root: Path, mode: str, **kwargs) -> List[str]:
# return list of frame directories / trajectories
return list_of_paths
- Add your dataset case in
VideoDatasetCoTrain. - Add your config block to
laq/configs/config.py.
Training
Launch training using the provided script, configured for bf16 training on a single node with 8 H200 GPUs:
bash run_train_laq.sh
Inference and Evaluation
To reproduce codebook analysis and figures shown in the paper:
# Codebook usage analysis (reproduces codebook utilization figures)
python -m codebook_usage
# Rollout transfer evaluation (reproduces reconstruction and transfer results)
python -m rollout_transfer
To use the LAQ model to generate training data with latent actions for ViPRA policy pretraining, use the dataset-specific latent generation scripts:
# LIBERO
python -m inference.libero.libero_latent
# OpenX-style datasets (Fractal, BridgeData V2, Kuka)
python -m inference.openx.openx_latent --dataset bridge
python -m inference.openx.openx_latent --dataset kuka
# SSv2
python -m inference.ssv2.ssv2_latent
These scripts generate training data in JSONL format with multi-GPU processing and automatic shard merging. Each line contains a training sample with latent actions:
Sample JSONL Entry:
{
"instruction": "pick up the red block and place it in the blue bowl",
"raw_action": [0.1, -0.2, 0.05, 0.0, 0.0, 0.0, 1.0],
"image": ["libero_10_modified/images/traj_001/step0000.jpg", "libero_10_modified/images/traj_001/step0001.jpg"],
"latent_state": ["libero_10_modified/images/traj_001/step0015.jpg"],
"latent_action_idxs": [3, 7, 1, 4, 2, 6, 0, 5, 1, 3, 7, 2, 4, 0, 6, 1],
"fields_la": "[instruction],[vision],latent_action",
"fields_ls": "[instruction],[vision],latent_state",
"fields_ls_la": "[instruction],[vision],latent_state,latent_action"
}
ViPRA Policy
The ViPRA policy builds on a video-language foundation model, Large World Model (LWM). We use the LWM-Chat-1M-Jax as the base model and extend it with additional modules for latent action prediction and flow matching for continuous control.
Environment Setup
cd vipra/
conda env create -f environment.yml -n vipra
conda activate vipra
Before training, download the VQ-GAN image tokenizer, text tokenizer and pretrained model parameters from LWM-Chat-1M-Jax and place them under vipra/lwm/:
mkdir lwm
huggingface-cli download LargeWorldModel/LWM-Chat-1M-Jax --local-dir lwm/
Pretraining Data
We release a pre-tokenized, horizon-14 dynamics dataset on Hugging Face:
mkdir cotrain_data
huggingface-cli download vipra-project/cotrain-dynamics14 --local-dir cotrain_data/
cotrain-dynamics14 merges multiple robot datasets (LIBERO, BridgeData V2, Fractal, Kuka) with human video data from SSv2.
Each training sample includes:
- history frames
- latent state target
- latent action tokens from LAQ
- natural language task text
This dataset is already chunked into 14-step latent action sequences.
Vision Cache (Optional, speeds up training)
We also release a VQGAN vision cache on Hugging Face so you don't have to repeatedly tokenize raw pixels:
mkdir vision_cache
huggingface-cli download vipra-project/cotrain-vqgan-vision-cache --local-dir vision_cache/
This contains precomputed VQGAN token sequences for each frame, which can be used instead of running the image tokenizer online.
If you don't use the cache, set vqgan_path to the VQ-GAN weights from LWM-Chat-1M-Jax so ViPRA can tokenize frames on the fly.
Running Pretraining
Launch pretraining using the provided script (configured for 8ΓH200 GPUs):
cd vipra/
bash scripts/pretrain.sh
See vipra/scripts/pretrain.sh for full hyperparameters.
Finetuning
Download the pretrained checkpoint weights, VQ-GAN image tokenizer, and text tokenizer from Hugging Face:
cd vipra && mkdir vipra_checkpoints
huggingface-cli download vipra-project/vipra-7b-pretrained --local-dir vipra_checkpoints/
For task-specific finetuning, prepare your dataset in JSONL format where each line represents a single timestep with the following structure:
{
"id": "ep00000/step0000",
"image": "ep00000/step0000.png",
"raw_action": [0.016, 0.0, -0.0, 0.0, 0.0, -0.0, -1.0],
"proprio": [0.003, -0.141, 0.011, -2.431, ...],
"instruction": "<s> You are a helpful assistant. USER: What action should the robot take to `put the white mug on the left plate` ASSISTANT:"
}
We provide a full data processing pipeline example (shown here with LIBERO Long):
Step 1: Action Discretization
python data/finetune_preprocess_libero.py \
--input_path ./libero_10_raw.jsonl \
--output_filename ./libero_10_quant.jsonl \
--csv_filename ./quant_bins.csv \
--discretize_bins 2047 \
--task_name libero_10
Step 2: Dynamics Formatting (14-step horizon, history, proprio)
python data/dynamics14_libero.py \
--input_jsonl ./libero_10_quant.jsonl \
--data_root ./ \
--csv_path ./quant_bins.csv \
--horizon 14 \
--action_type delta-eef \
--task_name libero_10
Step 3: Action / Proprio Normalization
python data/normalize_libero.py \
--raw_jsonl ./libero_10_raw.jsonl \
--dynamics_jsonl ./libero_10_dynamics14_v2.jsonl \
--output_jsonl ./libero_10_final.jsonl \
--action_stats_json ./action_stats.json \
--proprio_stats_json ./proprio_stats.json
To launch finetuning (for LIBERO Long example):
cd vipra/
bash scripts/finetune_libero_long.sh
See vipra/scripts/finetune_libero_long.sh for full hyperparameters.
Deployment
ViPRA uses a clientβserver architecture for deployment: a server that runs inference and a lightweight client that sends observations and receives actions.
Server
Start the inference server:
cd vipra/
bash scripts/run_server.sh [GPU_ID] [PORT]
# Examples:
bash scripts/run_server.sh 0 8005
bash scripts/run_server.sh 1 # GPU 1, default port 8005
bash scripts/run_server.sh # GPU 0, default port 8005
The server is configured by the ViPRAConfig class in vipra/inference/dynamics_action_cont_server.py.
Default endpoint: http://localhost:8005
Client
The ViPRAClient class in the client script in vipra/inference/dynamics_action_cont_client.py provides a simple interface to communicate with the inference server and obtain robot actions. The client can be customized for your particular use case and robot platform.
from inference.dynamics_action_cont_client import ViPRAClient
import numpy as np
client = ViPRAClient(
server_url="http://localhost:8005",
timeout=(1.0, 5.0),
image_size=256
)
task_description = "pick up the red block and place it in the blue bowl"
client.reset_policy(task_description)
image1 = np.random.randint(0, 255, (256, 256, 3), dtype=np.uint8)
image2 = np.random.randint(0, 255, (256, 256, 3), dtype=np.uint8)
# Two request modes available:
actions = client.get_action([image1, image2], mode="json") # JSON mode (baseline)
actions = client.get_action([image1, image2], mode="bytes") # JPEG mode (faster)
API Endpoints
POST /stepβ JSON payload with images in nested lists.POST /step_bytesβ multipart form data with JPEG-compressed images (recommended).POST /resetβ reset policy and set a new task instruction.
Client-Only Environment
conda env create -f client_environment.yml -n vipra-client
conda activate vipra-client
- Lightweight: only requests, OpenCV, numpy
- No JAX / PyTorch required
- Can run on edge devices, laptops, etc.
Citation
If you find our code or models useful in your work, please cite ViPRA:
@misc{routray2025vipra,
title={ViPRA: Video Prediction for Robot Actions},
author={Sandeep Routray and Hengkai Pan and Unnat Jain and Shikhar Bahl and Deepak Pathak},
year={2025},
eprint={2511.07732},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2511.07732},
}
Acknowledgements
ViPRA builds on LWM and LAPA. We thank the authors of these projects for open-sourcing their code and models.
License
ViPRAβs code and model weights are released under the Apache License 2.0.