NanoI2V: Building an Image-to-Video Model from Scratch

June 19, 2026 · View on GitHub

Overview

NanoI2V is a from-scratch implementation of an Image-to-Video (I2V) generation pipeline.

The project focuses on understanding and implementing the core concepts and building blocks behind modern video generation systems such as:

Variational Autoencoders (VAE)
Latent Video Modeling
Diffusion / Flow Matching
DiT Transformers
Cross-Attention Conditioning
Classifier-Free-Guidance

The series is published as a dedicated website with structured lessons, explanations, and code walkthroughs: shubham2376g.github.io/NanoI2V

Each topic is a self-contained lesson - read in order or jump to what you need.

What This Series Covers

This series explores the core building blocks behind modern image-to-video (I2V) models, including topics such as:

Area	Topics
VAE	Causal 3D convolutions, residual blocks, video encoders & decoders
DiT	Rotary positional embeddings (RoPE), attention mechanisms, adaptive LayerNorm
Flow & Diffusion	Flow matching, schedulers, denoising concepts
Conditioning	Text conditioning, image conditioning, multimodal embeddings
Training	End-to-end training pipeline, optimization, inference

Additional topics and modules will be added as the series evolves.

Repository Structure

NanoI2V/
├── vae/
│   ├── conv.py                 # CausalConv3D implementation
│   ├── blocks.py               # 3D ResBlocks and Spatial Attention modules
│   ├── encoder.py              # VAE encoder
│   ├── decoder.py              # VAE decoder
│   └── vae.py                  # VAE model definition
│
├── dit/
│   ├── rope.py                 # 3D Rotary Positional Embeddings (RoPE)
│   ├── attention.py            # Self-Attention and Cross-Attention layers
│   ├── blocks.py               # DiT blocks with Adaptive LayerNorm (adaLN)
│   └── dit.py                  # Diffusion Transformer (DiT) architecture
│
├── flow/
│   └── scheduler.py            # Flow Matching scheduler
│
├── conditioning/
│   └── encoders.py             # Text and image conditioning encoders
│
├── data/
│   ├── download_vidgen.py      # Dataset download utilities
│   ├── prepare_vidgen.py       # Dataset preparation pipeline
│   └── preprocess_vae.py       # VAE preprocessing scripts
│
├── docs/
│   └── index.html              # Project website (GitHub Pages)
│
├── train_vae.py                # VAE training script
├── train_dit.py                # DiT training script
├── inference_dit.py            # Inference and video generation
│
└── README.md                   # Project documentation

Results

The following results were generated using the models implemented in this repository.

VAE Reconstruction Results

The VAE is trained to compress video clips into a latent representation and reconstruct them with minimal quality loss.

DiT Video Generation Results

The Diffusion Transformer (DiT) is trained in latent space and generates video sequences conditioned on an input image.

Example 1

Input Image

Generated Video

Text Prompt

Third-person follow shot of a Minecraft-style character running down a stone pathway. The character has brown hair, a grey shirt, and a sword strapped to their back, captured mid-stride from behind. The surrounding environment features textured cobblestone walls and building facades under natural daylight. Smooth animation, blocky voxel aesthetic.

Example 2

Input Image

Generated Video

Text Prompt

High-angle overhead shot of Formula 1 race cars navigating a sharp, wide turn on a grey asphalt track. The camera smoothly zooming and closing in on the bright green and yellow F1 car as it accelerates. The background shows a blurry race barrier and spectator area. Realistic lighting, high speed motion blur.

Prerequisites

A basic understanding of the following will help:

PyTorch basics
Transformer architecture and attention mechanisms
LLM fundamentals
Diffusion model intuition (helpful but not required)

If you've worked with LLMs before, many concepts here will feel familiar.

⭐ Support the Project

If you find this useful:

Star the repository ⭐
Share it with others interested in diffusion or video models