Feste

February 13, 2026 ยท View on GitHub

A GPT-2 style transformer language model implemented from scratch in Rust for educational purposes. Companion code for the blog series Building an LLM From Scratch in Rust.

Why Feste?

Feste is the fool in Shakespeare's Twelfth Night, known for his wordplay and wit. The model trains on Shakespeare's complete works and generates text in his style, making the name a natural fit.

What This Is

A complete trainable transformer that demonstrates how language models work by implementing every component from basic operations. No deep learning frameworks are used.

The implementation trains on Shakespeare's works and generates text in similar style, showing clear perplexity improvements as training progresses.

Blog Series

Each part of the blog has a companion doc with configuration details and implementation reference:

PartBlog PostCode Reference
1Tokenizationdocs/01_TOKENIZATION.md
2Tensor Operationsdocs/02_TENSOR_OPERATIONS.md
3Model Architecturedocs/03_MODEL_ARCHITECTURE.md
4Training Infrastructuredocs/04_TRAINING.md
5A Witless Fooldocs/05_TRAINING_EXAMPLES.md

Quick Start

# Get training data
curl -o shakespeare.txt https://www.gutenberg.org/files/100/100-0.txt

# Train a small model (10-15 minutes)
cargo run --release --example 06_train_shakespeare_small

Reproducing Blog Experiments

The configurable training example lets you reproduce any experiment from the Part 5 blog post using named presets:

# List available presets
cargo run --release --example train -- --list-presets

# Run a preset
cargo run --release --example train -- --preset pocket-bard

# Override parameters
cargo run --release --example train -- --preset spider --steps 10000

# Fully custom configuration
cargo run --release --example train -- \
    --embd 256 --layers 6 --heads 12 --context 448 --vocab 8192

See docs/05_TRAINING_EXAMPLES.md for the full preset table, transfer learning instructions, and details on all training examples.

Examples

Foundation (Parts 1-4)

  • 01_train_tokenizers - BPE tokenization at multiple vocab sizes
  • 02_tensor_operations - Matrix multiplication and operations
  • 03_model_architecture - Transformer architecture exploration
  • 04_training_infrastructure - Training loop components

Training (Part 5)

  • 05_train_shakespeare_tiny - 50K parameters, 2-5 minutes
  • 06_train_shakespeare_small - 200K parameters, 10-20 minutes
  • 07_train_shakespeare_medium - 4M parameters, 1-2 hours
  • 08_train_shakespeare_gpt2 - 163M parameters (GPT-2 Small), 24-30 hours
  • train - Configurable training with blog experiment presets

License

Apache 2.0