Roadmap

January 22, 2024 · View on GitHub

Functionality

Batched inference
Fine-grained KV cache management
Explore tree sparsity
Fine-tune Medusa heads together with LM head from scratch
Distill from any model without access to the original training data

Integration

Local Deployment

mlc-llm
exllama
llama.cpp

Serving

vllm
lightllm
TGI
TensorRT

Contents

1Functionality
2Integration
2.1Local Deployment
2.2Serving