README.md

June 11, 2026 · View on GitHub

TokenSpeed is a speed-of-light LLM inference engine designed for agentic workloads, with TensorRT-LLM-level performance and vLLM-level usability. Our goal is to be the most performant inference engine for production agentic workloads.

Core components:

Modeling layer: local-SPMD design with a static compiler that generates collective communication from module-boundary placement annotations, so users do not hand-write parallelism logic.
Scheduler: C++ control plane and Python execution plane. Request lifecycle, KV cache ownership, and overlap timing are encoded as a finite-state machine, with safe KV resource reuse enforced by the type system at compile time.
Kernels: pluggable, layered kernel system with a portable public API and a centralized registry including one of the fastest MLA (Multi-head Latent Attention) implementations on Blackwell for agentic workload.
Entrypoint: SMG-integrated AsyncLLM for low-overhead CPU-side request handling.

News

[2026/05] 🚀 TokenSpeed hits 580 TPS on Qwen3.5-397B-A17B for agentic workloads. [blog]
[2026/05] TokenSpeed announced — a speed-of-light LLM inference engine for agentic workloads. [blog]

Performance Comparison

TokenSpeed vs. TensorRT-LLM Pareto curves on agentic workload (Kimi K2.5, B200)

Documentation

Start here: