Quantization Layer Placement Experiment

February 5, 2026 · View on GitHub

An empirical study investigating how the placement of quantized layers affects language model inference quality.

Motivation

When quantizing large language models, a common approach is to apply uniform quantization across all layers. However, different layers may have varying sensitivity to precision reduction. This experiment explores whether strategic placement of higher-precision layers can improve model quality while maintaining compression benefits.

Research Question: Given a fixed budget of layers to keep at higher precision, which layers should we prioritize?

Experimental Setup

Model

  • Architecture: Qwen2-0.5B-Instruct
  • Layers: 24 transformer blocks
  • Base precision: FP16 (948 MB)

Tools

  • llama.cpp (v7920) for quantization and inference
  • llama-quantize with --tensor-type flag for per-layer precision control
  • llama-perplexity for evaluation

Evaluation

  • Dataset: WikiText-2 test set
  • Metric: Perplexity (lower is better)
  • Chunks: 5 (context size 512)

Methodology

Experiment Workflow

Experiment Workflow

The experiment follows a systematic pipeline: starting with the base FP16 model, we apply 10 different quantization strategies using llama.cpp's --tensor-type flag for layer-wise precision control, then evaluate each configuration using perplexity on WikiText-2.

Quantization Strategies

We tested 10 quantization strategies, all using Q4_0 as the base quantization with selected layers kept at Q8_0:

StrategyDescriptionLayers at Q8
baseline_fp16No quantizationAll (FP16)
uniform_q4_0All layers Q4None
first_4_layers_q8Early layers protected0-3
first_8_layers_q8More early layers protected0-7
last_4_layers_q8Late layers protected20-23
last_8_layers_q8More late layers protected16-23
middle_8_layers_q8Middle layers protected8-15
first_last_4_layers_q8Both ends protected0-3, 20-23
alternating_even_q8Distributed protection0,2,4,...,22
attention_q8All attention weights Q8attn_q/k/v/output
ffn_q8All FFN weights Q8ffn_up/gate/down

Results

Summary Table

All quantized configurations use Q4_0 as base with selected layers/components kept at Q8_0 for higher precision. More Q8 layers → larger size (Q8 ≈ 8 bits/weight vs Q4 ≈ 4 bits/weight):

ConfigurationSize (MB)PerplexityPPL ΔCompression
FP16 Baseline94812.90-1.0x
Q4_0 + first 8 layers Q849213.07+1.3%1.9x
Q4_0 + first 4 layers Q846413.23+2.5%2.0x
Q4_0 + first+last 4 layers Q846413.23+2.5%2.0x
Q4_0 + alternating layers Q843513.24+2.6%2.2x
Q4_0 + FFN Q848513.31+3.1%2.0x
Q4_0 + last 8 layers Q839313.67+5.9%2.4x
Q4_0 + middle 8 layers Q839313.80+7.0%2.4x
Q4_0 + attention Q835713.82+7.1%2.7x
Q4_0 + last 4 layers Q836413.93+8.0%2.6x
Uniform Q4_033614.16+9.8%2.8x

Visualization

Model Size vs Perplexity Tradeoff

Model Size vs Perplexity Tradeoff

This figure shows the tradeoff between model size (compression) and perplexity (quality). The optimal region highlights the first_8_layers_q8 strategy, which achieves excellent compression with minimal quality loss.

Perplexity by Configuration

Perplexity by Configuration (lower is better)
Base: Q4_0, selected layers at Q8_0
─────────────────────────────────────────────────────────────────

FP16 Baseline              |█ 12.90
Q4_0 + first 8 layers Q8   |██████ 13.07        ← Best quantized
Q4_0 + first 4 layers Q8   |████████████ 13.23
Q4_0 + first+last 4 Q8     |████████████ 13.23
Q4_0 + alternating Q8      |█████████████ 13.24
Q4_0 + FFN Q8              |███████████████ 13.31
Q4_0 + last 8 layers Q8    |██████████████████████████████ 13.67
Q4_0 + middle 8 layers Q8  |███████████████████████████████████ 13.80
Q4_0 + attention Q8        |███████████████████████████████████ 13.82
Q4_0 + last 4 layers Q8    |████████████████████████████████████████ 13.93
Uniform Q4_0               |█████████████████████████████████████████████████ 14.16

Key Findings

1. Early Layers Are Most Sensitive to Quantization

The most significant finding is that early layers benefit more from higher precision than late layers:

  • First 4 layers at Q8: PPL = 13.23 (+2.5%)
  • Last 4 layers at Q8: PPL = 13.93 (+8.0%)
  • Difference: 0.70 perplexity points

This suggests that early transformer layers capture fundamental features (token embeddings, basic patterns) that degrade significantly when quantized aggressively.

2. Late Layers Are More Robust

Contrary to the intuition that "output layers need precision for generation," the last layers showed remarkable resilience to quantization. Protecting the last 8 layers provided less benefit than protecting the first 4 layers alone.

3. FFN vs Attention Trade-off

ComponentSizePPL
FFN at Q8485 MB13.31
Attention at Q8357 MB13.82

FFN layers contain more parameters but keeping them at higher precision yielded better perplexity than attention layers. This is somewhat surprising given attention's role in computing precise similarity scores.

4. Optimal Strategy

For this model, the optimal trade-off is:

Strategy: first_8_layers_q8
- Keep layers 0-7 at Q8_0
- Quantize layers 8-23 to Q4_0
- Result: 1.9x compression with only 1.3% perplexity increase

5. Diminishing Returns

The relationship between protected layers and quality is not linear:

Protected LayersPPL Improvement vs Uniform
First 40.93
First 81.09
First 12 (alternating)0.92

Adding more protected layers shows diminishing returns after the first 8.

Recommendations

Memory Constrained (< 400 MB)

llama-quantize model.gguf output.gguf Q4_0

Accept ~10% perplexity increase for maximum compression.

Balanced Quality/Size (400-500 MB)

llama-quantize \
  --tensor-type blk.0=q8_0 \
  --tensor-type blk.1=q8_0 \
  --tensor-type blk.2=q8_0 \
  --tensor-type blk.3=q8_0 \
  --tensor-type blk.4=q8_0 \
  --tensor-type blk.5=q8_0 \
  --tensor-type blk.6=q8_0 \
  --tensor-type blk.7=q8_0 \
  model.gguf output.gguf Q4_0

Best quality-to-size ratio with 1.3% perplexity increase.

Quality Focused

Keep more early layers at Q8 or use Q5_K_M as the base quantization.

Reproducing the Experiment

Prerequisites

# Install llama.cpp
brew install llama.cpp

# Or build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

Download Model

mkdir -p models
curl -L -o models/qwen2-0.5b-instruct-fp16.gguf \
  "https://huggingface.co/Qwen/Qwen2-0.5B-Instruct-GGUF/resolve/main/qwen2-0_5b-instruct-fp16.gguf"

Run Experiment

python3 quantization_experiment.py
python3 visualize_results.py

File Structure

llama-cpp-demo/
├── README.md                      # This report
├── quantization_experiment.py     # Main experiment script
├── visualize_results.py           # Analysis and visualization
├── plot_figures.py                # Generate visualization figures
├── run_quantization_experiment.sh # Shell script alternative
├── models/
│   └── qwen2-0.5b-instruct-fp16.gguf
└── results/
    ├── quantization_results.json  # Raw experiment data
    ├── wikitext-test.txt          # Evaluation dataset
    ├── model_size_perplexity_tradeoff.png  # Size vs PPL plot
    └── cover-2.png                # Workflow diagram

Limitations

  1. Single model: Results are from Qwen2-0.5B only; different architectures may behave differently
  2. Single metric: Perplexity on WikiText-2; task-specific performance may vary
  3. Limited scale: Small model (0.5B); larger models may show different layer sensitivity patterns
  4. Q4 vs Q8 only: Did not test intermediate precisions (Q5, Q6) the early layers first.

References