Qwen3.6-35B-A3B-heretic NVFP4 + DFlash on DGX Spark

May 1, 2026 · View on GitHub

Image Model Drafter License ☕ Tips

A production-stable deployment of AEON-7/Qwen3.6-35B-A3B-heretic-NVFP4 with DFlash speculative decoding on NVIDIA DGX Spark (GB10 / sm_121a).

⚠️ READ THE REQUIREMENTS SECTION FIRST. This image and its weights are tuned specifically for the DGX Spark (GB10 / sm_120-121 Blackwell) with PyTorch nightly cu130. It will NOT work on Hopper, Ampere, B200, or other Blackwell variants without rebuilding.

ModelAEON-7/Qwen3.6-35B-A3B-heretic-NVFP4 (~22 GB, multimodal preserved)
Drafterz-lab/Qwen3.6-35B-A3B-DFlash (~905 MB, public anonymous pull)
HardwareDGX Spark (NVIDIA GB10, 128 GB unified memory, sm_121a)
Imageghcr.io/aeon-7/vllm-spark-omni-q36:v1.2 (~9 GB compressed)

Headline performance (measured)

DGX Spark, production config (--max-num-seqs 128, --max-model-len 262144, --max-num-batched-tokens 65536, DFlash spec decode k=15). Mixed-domain prompt set, enable_thinking=false for clean decode-rate measurement.

Single-stream decode (greedy T=0, 10 trials):

Statistictok/s
Median83.9
p95127.5
Min41.1
Max127.5

Variance reflects DFlash acceptance differences across prompt classes — math/code prompts hit 127 tok/s; open-ended prompts settle around 60-90 tok/s. Decode rate climbs to ~118 tok/s at 1000-token outputs once DFlash steady-state amortizes.

Concurrent throughput (T=0.7 stochastic, 200-tok output, median of 3 runs):

ConcurrentErrorsAgg tok/sPer-req decode p50TTFT p50TTFT p95
10102.9109.1111 ms111 ms
40128.148.5191 ms191 ms
160227.619.3501 ms503 ms
640310.86.91.07 s11.2 s
1280313.66.514.1 s46.7 s

Zero errors across 1,200+ requests in the full benchmark.

Aggregate plateaus at ~313 tok/s from 64 concurrent — that's the GB10 compute wall on this 35B-active-3B MoE with linear-attention layers + DFlash drafter overhead. Best concurrency for chat UX: 4-16 (TTFT < 500 ms, per-req 19-48 tok/s); best for max throughput: 64-128.

DFlash spec-decode acceptance: 62-78% position-0, 2.7-4.4 mean accepted tokens per target step.

Stress-tested with 22K-token prompts + multi-hour soak: zero crashes.

Full bench results (8 sections including TTFT-by-prompt-length, decode-by-output-length, sampling, long-prompt prefill, RAG-style concurrent) on the HF model card: AEON-7/Qwen3.6-35B-A3B-heretic-NVFP4. Raw JSON + log: bench/qwen36_v2_2026-04-20.json.


⚠️ Hard Requirements (read FIRST)

Hardware (mandatory — image is purpose-built for this only)

ComponentRequiredNotes
GPUNVIDIA GB10 (DGX Spark only)sm_120 / sm_121a Blackwell. Other GPUs WILL NOT WORK with the published image.
Unified memory128 GBSpark default
Disk35 GB freeImage (~22 GB) + weights (~22 GB) + drafter (~1 GB) + headroom

Image will NOT work on:

  • H100/H200 (sm_90 — Hopper)
  • A100/A40 (sm_80 — Ampere)
  • B200/GB200 (sm_100 — different Blackwell variant; rebuild from source)
  • L40S/RTX 4090/RTX PRO 6000 (sm_89/sm_120 desktop variants — see docs/build.md)

Software (mandatory)

ComponentVersionNotes
NVIDIA driver580.xnvidia-smi should print "NVIDIA GB10"
Docker≥ 25.xwith nvidia-container-toolkit
OSUbuntu 24.04 LTS confirmedother Linux distros likely fine

DFlash drafter (no auth required)

The DFlash drafter z-lab/Qwen3.6-35B-A3B-DFlash is now a public HF repo — just hf download it directly, no token needed.

⚠️ If you cloned the drafter before 2026-04-19, you MUST re-pull. The earlier drafter had a long-context bug that caused cudaErrorIllegalAddress crashes after ~16K tokens. The fixed version is on HF as of 2026-04-19.


Quick start (5 commands)

# 1. Pre-flight check — confirm anonymous pull works
docker pull ghcr.io/aeon-7/vllm-spark-omni-q36:v1.2

# 2. Pull both models into the canonical layout
sudo mkdir -p /opt/qwen36 && sudo chown $USER:$USER /opt/qwen36
cd /opt/qwen36
export HF_HUB_ENABLE_HF_TRANSFER=1
hf download AEON-7/Qwen3.6-35B-A3B-heretic-NVFP4 --local-dir ./qwen36-nvfp4 &
hf download z-lab/Qwen3.6-35B-A3B-DFlash         --local-dir ./qwen36-dflash &
wait

# 3. Get the compose file
curl -fsSL https://raw.githubusercontent.com/AEON-7/Qwen3.6-NVFP4-DFlash/main/examples/docker-compose.yml \
  -o docker-compose.yml

# 4. Start the server (3-5 min to first "Application startup complete")
docker compose up -d
docker compose logs -f

# 5. Smoke test (use temperature=0 for greedy → max DFlash speedup)
curl http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen36-fast",
    "messages": [{"role":"user","content":"What is 17 × 23?"}],
    "max_tokens": 2048,
    "temperature": 0
  }'

If max_tokens < ~1500 your response may show content: null with finish_reason: "length" — that's the model hitting max-tokens during reasoning, not a crash. See docs/troubleshooting.md. Use ≥ 2048 for thinking-enabled requests.

For the full step-by-step (with pre-flight + post-deploy verification), see docs/dgx-spark-setup.md.


What this image actually is

vLLM HEAD source-built for CUDA 13.0 / sm_120 + PTX (DGX Spark / GB10 / sm_121a) with the following v1/v1.2 backport patch set. Each entry came from a real deployment failure, but several are now legacy on newer vLLM/FlashInfer bases and are kept here so operators can tell which fixes still apply to their image.

#PatchWhat it fixes
1register_qwen3_5_text.pyAdds text-only Qwen3_5MoeForCausalLM to vLLM model registry. Legacy for v2 multimodal weights because they load through the canonical multimodal class, but harmless/backward-compatible for v1 text-layout weights.
2patch_cuda_optional_import.pyWraps import vllm._C_stable_libtorch in RTLD_LAZY dlopen. Needed on older sm_120 builds where _C_stable_libtorch references unresolved SM100-only MXFP4 symbols. Newer vLLM builds may already export the needed stubs and can skip this.
3`patch_kv_cache_utils.py$ ( \times 4 \text{sites})\text{Mamba}/\text{linear}-\text{attention} \text{groups} \text{could} \text{expose} $block_size=Noneto downstream arithmetic. Newer vLLM commits derive/validatemamba_block_size` before these paths execute, so this is mainly an older-base backport.
4patch_mrope_text_fallback.pyQwen3.6 declares M-RoPE in config but no model class implements get_mrope_input_positions in vLLM HEAD. Adds inline fallback for the canonical text-only positions (T=H=W=arange).
5patch_cudagraph_align.pyAligns spec-decode CUDA graph capture sizes for pure PIECEWISE mode. Default FULL_AND_PIECEWISE spec-decode deployments already capture FULL decode graphs and have not reproduced this failure in long soaks.
6ENV VLLM_TEST_FORCE_FP8_MARLIN=1v1/v1.2 compatibility guard only. Current v2 images set this to 0 and use FlashInfer CUTLASS NVFP4 successfully on GB10. Keep Marlin only for older bases or shapes that still reject CUTLASS/grouped kernels.
7ENV TORCH_CUDA_ARCH_LIST="12.0+PTX"Build target for sm_120, runtime JITs to sm_121a on Spark.
8flashinfer 0.6.8sm_120 NVFP4 KV-cache decode kernels (PRs #2520, #2702).

All patches live in patches/ and run automatically at image build time (idempotent). The Dockerfile is reproducible — see docs/build.md.

Current v2 note: ghcr.io/aeon-7/vllm-spark-omni-q36:v2 and ghcr.io/aeon-7/vllm-aeon-ultimate:qwen36-v2 bake the production GB10 defaults (VLLM_TEST_FORCE_FP8_MARLIN=0, VLLM_USE_FLASHINFER_MOE_FP4=0) and have been validated with FlashInfer CUTLASS NVFP4. latest intentionally remains on the v1.2 line for compatibility.


What changed in v2 (this release)

Previous v1 weights had language_model. prefix stripped from safetensors keys to match a text-only model class — required vLLM registry + key-rename patches and was unstable in production (intermittent cudaErrorIllegalAddress crashes during real chat sessions).

v2 (current) re-quantized from tvall43/Qwen3.6-35B-A3B-heretic directly with AutoModelForImageTextToText, preserving:

  • Full multimodal architecture (Qwen3_5MoeForConditionalGeneration)
  • 27-block ViT vision encoder (BF16, NVFP4-skipped)
  • Original model.language_model.layers.X.* key layout — vLLM's multimodal class loads natively, no prefix-strip patch needed
  • 30 linear_attention (Mamba/GDN, fp32) + 10 full_attention layers
  • 256 routed experts × 8 active + 1 shared expert per layer
  • All 122,880 per-expert NVFP4 keys (every expert calibrated)

vLLM serves it via the canonical multimodal class — fewer code paths in the inference hot loop, much better stability under load. Travis ran multiple live chat sessions (Celina) without a single crash where v1 was crashing on virtually every interaction.


OpenClaw integration

The compose serves 3 model aliases for the same backend:

  • qwen36-35b-heretic — canonical name
  • qwen36-fast — intended for greedy/agentic workloads (T=0 → 78% DFlash acceptance, 117 tok/s single-stream)
  • qwen36-deep — intended for sampled/creative workloads (T=0.7 → low DFlash acceptance, ~50 tok/s; tradeoff for diversity)

OpenClaw config (validated against actual zod schemas) is in docs/openclaw.md. The pattern: register two model entries pointing to the same backend with different default params.temperature. Route per agent or per channel binding.


Documentation map

DocAudience
docs/dgx-spark-setup.mdPrimary deployment guide — start here for full Spark setup
docs/openclaw.mdOpenClaw gateway integration (validated against real zod schemas)
docs/dflash.mdDFlash speculative decoding tuning + monitoring
docs/dtree.mdFuture-work — slot DTree in when z-lab releases
docs/quantization.mdRecreating the NVFP4 quantization end-to-end (including v2 recipe)
docs/build.mdBuilding the image yourself instead of pulling from GHCR
docs/troubleshooting.mdSymptoms → root causes → fixes
docs/patches.mdEach patch explained, with the upstream issues they address

Credits

License

Apache 2.0 (matching upstream vLLM, FlashInfer, llmcompressor). The base model carries its own license — see tvall43/Qwen3.6-35B-A3B-heretic.


☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

₿ Bitcoin (BTC)
BTC QR
bc1q09xmzn00q4z3c5raene0f3pzn9d9pvawfm0py4
Ξ Ethereum (ETH)
ETH QR
0x1512667F6D61454ad531d2E45C0a5d1fd82D0500
◎ Solana (SOL)
SOL QR
DgQsjHdAnT5PNLQTNpJdpLS3tYGpVcsHQCkpoiAKsw8t
ⓜ Monero (XMR)
XMR QR
836XrSKw4R76vNi3QPJ5Fa9ugcyvE2cWmKSPv3AhpTNNKvqP8v5ba9JRL4Vh7UnFNjDz3E2GXZDVVenu3rkZaNdUFhjAvgd

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.