NanoI2V: Building an Image-to-Video Model from Scratch

June 19, 2026 · View on GitHub

Overview

NanoI2V is a from-scratch implementation of an Image-to-Video (I2V) generation pipeline.

The project focuses on understanding and implementing the core concepts and building blocks behind modern video generation systems such as:

  • Variational Autoencoders (VAE)
  • Latent Video Modeling
  • Diffusion / Flow Matching
  • DiT Transformers
  • Cross-Attention Conditioning
  • Classifier-Free-Guidance

The series is published as a dedicated website with structured lessons, explanations, and code walkthroughs: shubham2376g.github.io/NanoI2V

Each topic is a self-contained lesson - read in order or jump to what you need.


What This Series Covers

This series explores the core building blocks behind modern image-to-video (I2V) models, including topics such as:

AreaTopics
VAECausal 3D convolutions, residual blocks, video encoders & decoders
DiTRotary positional embeddings (RoPE), attention mechanisms, adaptive LayerNorm
Flow & DiffusionFlow matching, schedulers, denoising concepts
ConditioningText conditioning, image conditioning, multimodal embeddings
TrainingEnd-to-end training pipeline, optimization, inference

Additional topics and modules will be added as the series evolves.


Repository Structure

NanoI2V/
├── vae/
│   ├── conv.py                 # CausalConv3D implementation
│   ├── blocks.py               # 3D ResBlocks and Spatial Attention modules
│   ├── encoder.py              # VAE encoder
│   ├── decoder.py              # VAE decoder
│   └── vae.py                  # VAE model definition

├── dit/
│   ├── rope.py                 # 3D Rotary Positional Embeddings (RoPE)
│   ├── attention.py            # Self-Attention and Cross-Attention layers
│   ├── blocks.py               # DiT blocks with Adaptive LayerNorm (adaLN)
│   └── dit.py                  # Diffusion Transformer (DiT) architecture

├── flow/
│   └── scheduler.py            # Flow Matching scheduler

├── conditioning/
│   └── encoders.py             # Text and image conditioning encoders

├── data/
│   ├── download_vidgen.py      # Dataset download utilities
│   ├── prepare_vidgen.py       # Dataset preparation pipeline
│   └── preprocess_vae.py       # VAE preprocessing scripts

├── docs/
│   └── index.html              # Project website (GitHub Pages)

├── train_vae.py                # VAE training script
├── train_dit.py                # DiT training script
├── inference_dit.py            # Inference and video generation

└── README.md                   # Project documentation

Results

The following results were generated using the models implemented in this repository.

VAE Reconstruction Results

The VAE is trained to compress video clips into a latent representation and reconstruct them with minimal quality loss.


DiT Video Generation Results

The Diffusion Transformer (DiT) is trained in latent space and generates video sequences conditioned on an input image.

Example 1

Input Image Generated Video

Text Prompt

Third-person follow shot of a Minecraft-style character running down a stone pathway. The character has brown hair, a grey shirt, and a sword strapped to their back, captured mid-stride from behind. The surrounding environment features textured cobblestone walls and building facades under natural daylight. Smooth animation, blocky voxel aesthetic.


Example 2

Input Image Generated Video

Text Prompt

High-angle overhead shot of Formula 1 race cars navigating a sharp, wide turn on a grey asphalt track. The camera smoothly zooming and closing in on the bright green and yellow F1 car as it accelerates. The background shows a blurry race barrier and spectator area. Realistic lighting, high speed motion blur.


Prerequisites

A basic understanding of the following will help:

  • PyTorch basics
  • Transformer architecture and attention mechanisms
  • LLM fundamentals
  • Diffusion model intuition (helpful but not required)

If you've worked with LLMs before, many concepts here will feel familiar.


⭐ Support the Project

If you find this useful:

  • Star the repository ⭐
  • Share it with others interested in diffusion or video models