nanowhale ๐Ÿณ

May 4, 2026 ยท View on GitHub

A ~110M parameter language model trained from scratch using the DeepSeek-V4 architecture. This repo contains all the code, configs, and tokenizer used to pretrain and fine-tune the model.

Models

ModelDescriptionLink
nanowhale-100m-basePretrained base model (5K steps on FineWeb-Edu)๐Ÿค— Hub
nanowhale-100mSFT chat model (3K steps on SmolTalk)๐Ÿค— Hub

Architecture

The model implements the full DeepSeek-V4 feature set at miniature scale:

  • Multi-Head Latent Attention (MLA) โ€” 8 heads, 1 KV head (MQA), head_dim=96 (32 RoPE + 64 NoPE), q_lora_rank=160
  • Mixture-of-Experts (MoE) โ€” 4 routed + 1 shared expert, top-2 routing, SwiGLU FFN (dim 640)
  • Hyper-Connections โ€” hc_mult=4, Sinkhorn routing (2 iterations)
  • Multi-Token Prediction (MTP) โ€” 1 next-token prediction layer
ParameterValue
Total params~110M (41M embeddings + 69M non-embedding)
Hidden size320
Layers8
Vocab size129,280 (DeepSeek-V4 tokenizer)
Context length2,048 tokens

Repo Structure

โ”œโ”€โ”€ modeling_deepseek_v4.py         # DeepSeek-V4 model implementation
โ”œโ”€โ”€ configuration_deepseek_v4.py    # Model config class
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ configs/
โ”‚   โ”œโ”€โ”€ main_100m.yaml              # Training hyperparameters (100M model)
โ”‚   โ”œโ”€โ”€ debug.yaml                  # Quick debug config (50 steps)
โ”‚   โ””โ”€โ”€ fallback_under_1b.yaml      # Alternative config
โ”œโ”€โ”€ scripts/
โ”‚   โ”œโ”€โ”€ train_pretrain.py            # Pretraining (SFTTrainer on FineWeb-Edu)
โ”‚   โ”œโ”€โ”€ train_sft.py                 # SFT fine-tuning (SFTTrainer on SmolTalk)
โ”‚   โ”œโ”€โ”€ eval_smoke.py                # Perplexity evaluation & generation
โ”‚   โ”œโ”€โ”€ chat.py                      # Interactive chat
โ”‚   โ”œโ”€โ”€ upload_to_hub.py             # Hub upload utility
โ”‚   โ”œโ”€โ”€ count_params.py              # Parameter counting
โ”‚   โ”œโ”€โ”€ prepare_data.py              # Data preparation
โ”‚   โ””โ”€โ”€ inspect_deepseek_v4.py       # Architecture inspection
โ””โ”€โ”€ tokenizer/
    โ”œโ”€โ”€ tokenizer.json
    โ””โ”€โ”€ tokenizer_config.json

Quick Start

Install

pip install -r requirements.txt

Pretraining

python scripts/train_pretrain.py --config configs/main_100m.yaml

SFT

python scripts/train_sft.py

Chat

python scripts/chat.py

Evaluation

python scripts/eval_smoke.py

Training Results

Pretraining (5,000 steps on FineWeb-Edu)

MetricValue
Tokens seen~2.6B
Final loss~5.3
Token accuracy33.8%
Hardware1ร— H100 80GB, bf16
Throughput72ms/step (with torch.compile)

SFT (3,000 steps on SmolTalk)

MetricStartEnd
Train loss15.4110.22
Eval loss2.8732.607
Token accuracy36.2%48.5%

Perplexity (held-out English text)

ModelPerplexity
Pretrained13.62
SFT12.90

Known Issues

  • bf16 NaN: The model produces NaN in bf16 at this small scale. Use fp32 for inference and training. This is due to the Hyper-Connections architecture producing values that overflow bf16 range.
  • from_pretrained quirk: The custom architecture causes from_pretrained to re-initialize some weights. Use manual load_state_dict instead (see model cards for examples).
  • Large vocab / small model: The 129K vocab embedding table consumes 37% of all parameters, limiting capacity for language modeling.

License

MIT