README.md

June 11, 2026 ยท View on GitHub

TokenSpeed: Tokens at the speed of light

TokenSpeed is a speed-of-light LLM inference engine designed for agentic workloads, with TensorRT-LLM-level performance and vLLM-level usability. Our goal is to be the most performant inference engine for production agentic workloads.

Core components:

  • Modeling layer: local-SPMD design with a static compiler that generates collective communication from module-boundary placement annotations, so users do not hand-write parallelism logic.
  • Scheduler: C++ control plane and Python execution plane. Request lifecycle, KV cache ownership, and overlap timing are encoded as a finite-state machine, with safe KV resource reuse enforced by the type system at compile time.
  • Kernels: pluggable, layered kernel system with a portable public API and a centralized registry including one of the fastest MLA (Multi-head Latent Attention) implementations on Blackwell for agentic workload.
  • Entrypoint: SMG-integrated AsyncLLM for low-overhead CPU-side request handling.

News

  • [2026/05] ๐Ÿš€ TokenSpeed hits 580 TPS on Qwen3.5-397B-A17B for agentic workloads. [blog]
  • [2026/05] TokenSpeed announced โ€” a speed-of-light LLM inference engine for agentic workloads. [blog]

Performance Comparison

TokenSpeed vs. TensorRT-LLM Pareto curves on agentic workload (Kimi K2.5, B200)

Documentation

Start here: