grout

May 8, 2026 ยท View on GitHub

A Qwen 3 inference engine written in Rust, build using cutile-rs.

Usage

  1. Follow the installation guide of cutile-rs.
  2. Configure your environment variables for cutile-rs.
  • Set CUDA_TOOLKIT_PATH to your CUDA 13.2 install directory.
CUDA_TOOLKIT_PATH=/usr/local/cuda-13 cargo run --release -- --model <path-to-qwen3-model> --prompt "Hello, how are you?" --max-new-tokens 128

Options

FlagDescription
--model <path>Path to model directory (safetensors + config.json)
--prompt <text>Input prompt
--max-new-tokens <n>Number of tokens to generate (default: 128)
--max-seq-len <n>Override max sequence length
--sampleEnable sampling (default: greedy)
--raw-promptSkip chat template, use prompt as-is
--device-argmaxRun argmax on device
--profilePrint per-step profiling report

Environment Variables

VariableDescription
GROUT_CUDA_GRAPH_DECODESet to 1 to enable CUDA graph capture for decode
GROUT_CUBLAS_COMPUTE16Set to 1 to use FP16 accumulation in cuBLAS
GROUT_CUBLAS_COMPUTE16_MAX_MMax M dimension for FP16 compute
GROUT_CUBLAS_FAST_ALGOcuBLAS algorithm selection
GROUT_ATTN_BN_DECODEAttention block size for decode

License

Apache-2.0