Performance Log

May 25, 2026 · View on GitHub

Append-only record of benchmark runs. Use git diff to spot regressions. Correctness is enforced by the pytest suite (see docs/testing.md).

Performance benchmark — 2026-05-24

Hardware: RTX 5070 12 GB, INT4 Model: gsai-ml/LLaDA-8B-Instruct Script: benchmarks/perf_ci.py

Metric	Value	Gate
Tokens/sec batch=1 (baseline)	32.9	—
Tokens/sec batch=8 (baseline)	110.6	—
Tokens/sec batch=1 (LocalLeap)	58.1	—
Tokens/sec batch=8 (LocalLeap)	146.8	—
LocalLeap speedup batch=1	1.77×	≥1.10× ✓
LocalLeap speedup batch=8	1.33×	≥1.10× ✓
Step duration p50 (baseline)	192.1 ms	—
Step duration p99 (baseline)	192.3 ms	p99/p50 ≤ 3×
p99/p50 ratio (baseline)	1.00×	✓
Steps executed (baseline)	16	—
GPU mem peak (baseline)	5.64 GB	≤ 11.8 GB ✓
Step duration p50 (LocalLeap)	192.4 ms	—
Step duration p99 (LocalLeap)	192.6 ms	p99/p50 ≤ 3×
p99/p50 ratio (LocalLeap)	1.00×	✓
Steps executed (LocalLeap)	11	(fewer than baseline = anchor propagation working)
GPU mem peak (LocalLeap)	5.64 GB	≤ 11.8 GB ✓

Performance benchmark — 2026-05-24

Hardware: RTX 5070 12 GB, INT4 Model: gsai-ml/LLaDA-1.5 Script: benchmarks/perf_ci.py

Metric	Value	Gate
Tokens/sec batch=1 (baseline)	32.8	—
Tokens/sec batch=8 (baseline)	110.3	—
Tokens/sec batch=1 (LocalLeap)	56.5	—
Tokens/sec batch=8 (LocalLeap)	147.2	—
LocalLeap speedup batch=1	1.72×	≥1.10× ✓
LocalLeap speedup batch=8	1.33×	≥1.10× ✓
Step duration p50 (baseline)	192.1 ms	—
Step duration p99 (baseline)	192.4 ms	p99/p50 ≤ 3×
p99/p50 ratio (baseline)	1.00×	✓
Steps executed (baseline)	16	—
GPU mem peak (baseline)	5.64 GB	≤ 11.8 GB ✓
Step duration p50 (LocalLeap)	193.9 ms	—
Step duration p99 (LocalLeap)	198.4 ms	p99/p50 ≤ 3×
p99/p50 ratio (LocalLeap)	1.02×	✓
Steps executed (LocalLeap)	12	(fewer than baseline = anchor propagation working)
GPU mem peak (LocalLeap)	5.64 GB	≤ 11.8 GB ✓

HF reference comparison — 2026-05-24

Hardware: RTX 5070 12 GB, INT4 Settings: steps=16, gen_length=64, 4 test prompts Script: benchmarks/compare_hf.py

Path	LLaDA-8B-Instruct	LLaDA-1.5
HF reference (`reference/llada_reference.py`), batch=1 sequential	32.4 tok/s	32.7 tok/s
dlmserve batch=1 sequential	32.8 tok/s (1.01× HF)	32.7 tok/s (1.00× HF)
dlmserve batch=4	81.7 tok/s (2.52× HF)	82.0 tok/s (2.51× HF)

dlmserve batch=1 matches HF reference to within measurement noise — confirms no per-request overhead. Batching at N=4 delivers ~2.5× HF reference throughput on both models.

LocalLeap calibration sweep — LLaDA-1.5 — 2026-05-24

Hardware: RTX 5070 12 GB, INT4 Settings: steps=16, gen_length=256, 20 HumanEval problems Script: scripts/calibrate_local_leap.py

Baseline pass@1 = 0.100.

κ	τ	W	pass@1	diff	gate
0.50	0.60	4	0.050	+0.050	✗
0.50	0.75	4	0.100	+0.000	✓
0.60	0.60	4	0.050	+0.050	✗
0.60	0.75	4	0.100	+0.000	✓
0.70	0.60	4	0.050	+0.050	✗
0.70	0.75	4	0.100	+0.000	✓
0.80	0.60	4	0.050	+0.050	✗
0.80	0.75	4	0.100	+0.000	✓
0.90	0.60	4	0.100	+0.000	✓
0.90	0.75	4	0.050	+0.050	✗

τ=0.75 is the determining threshold (passes for κ ∈ [0.5, 0.8]). Picked (κ=0.5, τ=0.75, W=4) for maximum anchor-propagation aggressiveness. Wired into _LOCAL_LEAP_MODEL_DEFAULTS in dlmserve/engine.py. Confirmed all 4 LocalLeap quality gates (MMLU, HumanEval, BLEU, determinism) pass for LLaDA-1.5 at these values.

LLaDA-8B-Instruct continues to use the paper-published defaults (κ=0.9, τ=0.75, W=4) from arXiv:2510.07081 §4.1. κ=0.9 fails for LLaDA-1.5 at τ=0.75, so per-model defaults are necessary.