ane-infer

March 5, 2026 · View on GitHub

Apple Neural Engine (ANE) LLM Inference Engine — reverse-engineered private APIs, Metal GPU compute shaders, hybrid ANE+GPU+CPU on Apple Silicon M1/M2/M3/M4/M5

Hybrid ANE+Metal+CPU inference engine for LLMs on Apple Silicon.

First implementation of Qwen3.5 (Gated DeltaNet + GQA) running natively on Apple Neural Engine via reverse-engineered private APIs. 32 tok/s Metal GPU decode matching llama.cpp, 3.6 TFLOPS fused ANE mega-kernels, built from scratch in Rust + Obj-C + Metal.

Built on the shoulders of maderix/ANE — the project that cracked open ANE training. We took it further into inference with DeltaNet, Metal GPU shaders, and a complete decode pipeline.

Keywords

apple-neural-engine ane apple-silicon metal-gpu llm-inference on-device-ai neural-engine m1 m2 m3 m4 m5 private-api reverse-engineering coreml gguf quantization q4 q8 deltanet qwen rust metal-shaders npu mlx-alternative llama-cpp-alternative macos ios edge-ai low-power-inference

What This Is — Apple Neural Engine LLM Inference

A from-scratch LLM inference engine that runs Qwen3.5-2B on three Apple Silicon accelerators simultaneously:

Apple Neural Engine (ANE) — batched prefill via 1x1 convolutions through private _ANEClient APIs
Metal GPU — single-token decode with 13 custom compute shaders, ONE command buffer per token
CPU (NEON/AMX) — parallel Q8_0 GEMV via rayon, Accelerate BLAS fallback

No CoreML. No Python. No MLX. Just system frameworks + objc_msgSend.

What This Is Not

Not faster than llama.cpp (yet). We match their decode speed, not their prefill.
Not production-ready. Private API usage means it breaks with macOS updates.
Not a general inference framework. Built specifically for Qwen3.5 DeltaNet hybrid architecture.

Performance — ANE vs Metal GPU vs CPU on Apple Silicon

Qwen3.5-2B Q8_0 on Apple M5 (same chip as llama.cpp benchmarks)

Backend	Speed	Power	Notes
Metal GPU Q8 decode	32 tok/s	~15W	Matches llama.cpp (34.8)
Metal GPU Q4 decode	42 tok/s	~15W	Q6K dequant WIP
CPU Q8 decode	23 tok/s	~5W	Rayon + NEON
ANE prefill pp16	33 tok/s	~3W	Fused FFN mega-kernel
ANE fused FFN	3.6 TFLOPS	~3W	3x single-op throughput

Apple Neural Engine (ANE) Reverse Engineering — Private API Discoveries

We went deeper than anyone into Apple's private Neural Engine framework. Key discoveries:

What We Cracked

Discovery	Impact
`doEvaluateDirectWithModel:`	Bypasses ANE daemon, 10% faster eval
Multi-procedure MIL models	N functions in one compiled program, dispatch by `procedureIndex`
`prepareChainingWithModel:` succeeds	First public success — error 15 was wrong `_ANEIOSurfaceOutputSets` API
`_ANEIOSurfaceOutputSets.objectWithstatsSurRef:outputBuffer:`	The correct factory method (not `outputSetsWithBuffers:`)
CoreML MLProgram → `MLProgramEngine` → `MLNeuralNetworkEngine`	Confirmed ANE enabled (`isANEPathForbidden=NO`, `modelIsMIL=YES`)
Espresso C++ runtime path	CoreML uses Espresso internally, no `_ANEModel` exposed
H11ANE IOKit user client type=1,4	Direct kernel driver access via `IOServiceOpen`
`_ANEDaemonConnection` XPC surface	19 methods including chaining, RT, telemetry

ANE Chaining — The Breakthrough

After 7 probe iterations across two sessions, we discovered that ANEProgramChainingPrepare() error 15 was not a firmware limitation — it was caused by using the wrong _ANEIOSurfaceOutputSets factory method.

Before: outputSetsWithBuffers:@[buf_out]  → error 15
After:  objectWithstatsSurRef:ioStats outputBuffer:@[buf_out]  → SUCCESS

Both prepareChainingWithModel: (daemon) and doPrepareChainingWithModel: (direct) succeed. buffersReady remains blocked — the next frontier.

Fused Mega-Kernels

Instead of dispatching one ANE kernel per linear projection (1.1 TFLOPS per op), we fuse multiple operations into single MIL programs:

Fused FFN: gate_proj conv → sigmoid → mul → up_proj conv → mul → down_proj conv = 8 ops, ONE dispatch, 3.6 TFLOPS
Fused QKV: 3 parallel convolutions from same input = 1 dispatch
Fused dual projection: gate + ssm_out in one program

The ANE compiler handles weight blobs >32MB SRAM automatically via DRAM spilling — no manual tiling needed.

Metal GPU Compute Shaders for LLM Decode on Apple Silicon

13 custom Metal compute shaders encode the entire DeltaNet + FullAttention forward pass into one command buffer per token:

Shader	Purpose
`q8_gemv`	Q8_0 GEMV (NR0=2, NQ=8, 4 simdgroups, simd_sum)
`q4_gemv`	Q4_0 GEMV (same pattern, nibble unpacking)
`deltanet_recurrence`	Full per-head state update (decay/recall/delta/update/query)
`conv1d_silu`	Shift + apply + SiLU activation
`compute_beta_decay`	sigmoid(beta) + exp(a*softplus(alpha+bias))
`sdpa_causal`	Flash Attention decode (single-pass online softmax)
`rope_apply`	Rotary position embeddings
`rmsnorm_simple`	128-thread reduction RMSNorm
`rmsnorm_gated`	Per-head RMSNorm with SiLU gate
`sigmoid_gate`	Output gating
`q_gate_split`	Deinterleave packed Q+gate projection
`residual_add`	Element-wise residual connection
`silu_mul`	Fused SiLU(gate) * up

Zero per-token Metal buffer allocations. All params pre-allocated at model load.

The GPU Performance Journey

Optimization	Speed	Gain
Starting point (params buffer corruption)	0.1 tok/s	—
Fix shared params buffer	3.5 tok/s	35x
Single command buffer per token	5.0 tok/s	1.4x
llama.cpp-style Q8 GEMV shader	32.6 tok/s	6.5x
NR0=2 threadgroup dispatch fix	34.7 tok/s	1.06x
FullAttention layers on GPU	30.0 tok/s	(added 6 layers)
Flash SDPA (single-pass softmax)	42.3 tok/s	+10%
Total improvement	0.1 → 42 tok/s	420x

Architecture — Hybrid ANE + Metal GPU + CPU Pipeline

                    ┌─────────────┐
                    │  GGUF Model │
                    │  (Q8/Q4_0)  │
                    └──────┬──────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
         ┌────▼────┐  ┌───▼───┐  ┌────▼────┐
         │   ANE   │  │  CPU  │  │  Metal  │
         │ Prefill │  │ NEON  │  │   GPU   │
         │ 33 tk/s │  │ 23t/s │  │  32t/s  │
         └─────────┘  └───────┘  └─────────┘
              │            │            │
              │     ┌──────┴──────┐     │
              │     │ DeltaNet    │     │
              │     │ Recurrence  │     │
              │     │ (sequential)│     │
              │     └─────────────┘     │
              │                         │
              └────────┬────────────────┘
                       │
                  ┌────▼────┐
                  │ Tokenizer│
                  │ (BPE)    │
                  └──────────┘

Qwen3.5-2B Hybrid Architecture

24 layers: 18 DeltaNet (linear attention + SSM recurrence) + 6 Full Attention (GQA)
DeltaNet: O(1) per token, 128-dim recurrent state, conv1d with kernel=4
Full Attention: 8 Q heads, 2 KV heads, head_dim=256, partial RoPE
FFN: SwiGLU, dim=2048 → hidden=6144

Building

# Prerequisites: Rust, Xcode Command Line Tools
git clone https://github.com/youruser/ane-infer
cd ane-infer

# Compile Metal shaders
cd crates/engine/metal
xcrun -sdk macosx metal -c q8_gemv.metal -o q8_gemv.air
xcrun -sdk macosx metal -c deltanet.metal -o deltanet.air
xcrun -sdk macosx metal -c attention.metal -o attention.air
xcrun -sdk macosx metal -c q4_gemv.metal -o q4_gemv.air
xcrun -sdk macosx metallib q8_gemv.air deltanet.air attention.air q4_gemv.air -o q8_gemv.metallib
cd ../../..

# Build
cargo build --release

# Download model (Q8_0)
# Place at ~/models/Qwen3.5-2B-Q8_0.gguf

Usage

# Generate text
ane-infer generate -m model.gguf -p "The capital of France is" --max-tokens 256 --temp 0.7

# Full benchmark suite
ane-infer bench -m model.gguf --prompt-tokens 128 --gen-tokens 32

# Test ANE hardware
ane-infer test-ane

# ANE throughput benchmark
ane-infer bench-ane

# Model info
ane-infer info -m model.gguf

File Structure

crates/
├── ane-bridge/           # ANE private framework FFI
│   ├── objc/
│   │   ├── ane_runtime.m        # _ANEClient, compile/eval/free lifecycle
│   │   ├── ane_runtime.h        # C ABI for Rust FFI
│   │   ├── coreml_probe.m       # CoreML MLProgram reverse engineering
│   │   ├── chaining_e2e.m       # ANE chaining end-to-end test
│   │   ├── iokit_probe.m        # IOKit H11ANE direct access
│   │   └── test_fused_ffn.m     # Fused FFN mega-kernel test
│   └── src/lib.rs               # Safe Rust wrappers (AneKernel, weight blobs)
├── mil-gen/              # MIL program text generation
│   └── src/
│       ├── lib.rs               # MIL header/footer, conv op helper
│       ├── mega.rs              # Fused FFN, dual/triple projections
│       ├── attention.rs         # QKV, output projection
│       └── ffn.rs               # FFN up/down projections
├── engine/               # Core inference engine
│   ├── metal/
│   │   ├── q8_gemv.metal        # Q8_0 GEMV + SiLU (optimized)
│   │   ├── q4_gemv.metal        # Q4_0 GEMV (tiled + simple)
│   │   ├── deltanet.metal       # DeltaNet recurrence shaders (9 kernels)
│   │   └── attention.metal      # RoPE, SDPA, gating (4 kernels)
│   └── src/
│       ├── metal_graph.rs       # GpuContext, GpuGraph, all pipeline states
│       ├── gpu_full_decode.rs   # Full-GPU token decode (ONE cmd buffer)
│       ├── gpu_decode.rs        # GPU weight upload, GpuBuffer types
│       ├── ane_prefill.rs       # ANE batched prefill with mega-kernels
│       ├── deltanet.rs          # CPU DeltaNet recurrence (NEON)
│       ├── q8_gemv.rs           # CPU Q8/Q4 GEMV (rayon parallel)
│       ├── model.rs             # Model weight types, config
│       ├── tokenizer.rs         # GPT-2 BPE tokenizer
│       └── scratch.rs           # Pre-allocated scratch buffers
├── gguf/                 # GGUF file parser
│   └── src/
│       ├── parser.rs            # GGUF v2/v3 parsing
│       ├── to_ane.rs            # Tensor extraction helpers
│       └── dequant.rs           # Q4/Q8/Q6K dequantization
└── cli/                  # CLI binary
    └── src/main.rs              # Commands: generate, bench, test-ane, info

Limitations — Apple Neural Engine Private API Caveats

Private APIs: Uses _ANEClient, _ANEInMemoryModel, etc. Will break on macOS updates.
Q6K dequant: Partially broken — Q4 models with Q6K embeddings produce degraded output.
No speculative decoding: Same-model speculation doesn't help (draft ~= verify speed). Needs separate tiny draft model.
Sequential recurrence: DeltaNet state update is O(L) per token for prefill. Chunked parallel algorithm (FLA) not yet implemented.
FullAttention prefill: Not yet batched on ANE — only DeltaNet layers use ANE prefill.
Single sequence: No batched inference (batch_size=1 only).

Acknowledgments

maderix/ANE — The breakthrough project that reverse-engineered ANE training. We built on their _ANEInMemoryModelDescriptor, weight blob format, and MIL compilation pipeline.
hollance/neural-engine — Comprehensive ANE documentation.
eiln/ane — Linux ANE driver reverse engineering.
llama.cpp — Metal Q8 GEMV shader patterns, GGUF format, performance reference.
Flash Linear Attention — Chunked parallel DeltaNet algorithm reference.
metalQwen3 — Metal GPU inference reference for Qwen.

Disclaimer

This project uses Apple's private, undocumented frameworks (AppleNeuralEngine.framework). These APIs have no stability guarantee and may change or break with any macOS update. Use at your own risk. Not affiliated with Apple.

License

MIT