Roadmap

January 22, 2024 ยท View on GitHub

Functionality

  • Batched inference
  • Fine-grained KV cache management
  • Explore tree sparsity
  • Fine-tune Medusa heads together with LM head from scratch
  • Distill from any model without access to the original training data

Integration

Local Deployment

  • mlc-llm
  • exllama
  • llama.cpp

Serving

  • vllm
  • lightllm
  • TGI
  • TensorRT

Contents

  1. 1Functionality
  2. 2Integration
  3. 2.1Local Deployment
  4. 2.2Serving