context=prompt (str) can be replaced by input_ids=tokens list[int]
March 17, 2025 Β· View on GitHub
UMbreLLa: Deploying LLMs for Personal Agents
News π
[2025/03/17] Support QwQ-32B in FP8, achieving 7.54 tokens/sec on 1 x RTX 4090! π
[2025/03/17] Support QwQ-32B-AWQ in INT4, achieving 67.98 tokens/sec on 1 x RTX 4090 and 6.04 tokens/sec on 1 x RTX 3070 (with PCIE3.0)! π
1. Models Supported and Benchmarks
The throughput is measured with a batch size of 1 to directly mirror the user experience.
1.1 MT Bench
| GPU | Model | Draft | Throughput (tokens/sec) | |
|---|---|---|---|---|
| Stochastic | Greedy | |||
| RTX 4090 | Llama3.1-70B-Instruct-AWQ | Llama3.1-8B-Instruct-AWQ | 7.2 | 8.6 |
| Llama3.3-70B-Instruct-AWQ | Llama3.1-8B-Instruct-AWQ | 7.0 | 7.4 | |
| Llama3.1-8B-Instruct | Llama3.2-1B-Instruct | 100.7 | 108.1 | |
| RTX 4080 SUPER | Llama3.1-70B-Instruct-AWQ | Llama3.1-8B-Instruct-AWQ | 7.4 | 8.4 |
| Llama3.3-70B-Instruct-AWQ | Llama3.1-8B-Instruct-AWQ | 6.7 | 7.2 | |
| RTX 4070 Ti | Llama3.1-70B-Instruct-AWQ | Llama3.2-1B-Instruct | 5.5 | 6.1 |
| Llama3.3-70B-Instruct-AWQ | Llama3.2-1B-Instruct | 5.2 | 5.5 | |
| L40 | Llama3.1-70B-Instruct-AWQ | Llama3.2-1B-Instruct | 37.0 | 38.5 |
| Llama3.3-70B-Instruct-AWQ | Llama3.2-1B-Instruct | 36.3 | 37.1 | |
1.2 Code Completion
Evaluated on ananyarn/Algorithm_and_Python_Source_Code.
| GPU | Model | Draft | Throughput (tokens/sec) |
|---|---|---|---|
| RTX 4090 | Llama3.1-70B-Instruct-AWQ | Llama3.1-8B-Instruct-AWQ | 11.4 |
| Llama3.3-70B-Instruct-AWQ | Llama3.1-8B-Instruct-AWQ | 11.2 | |
| Llama3.1-8B-Instruct | CodeDrafter-500M | 174.8 | |
| RTX 4080 SUPER | Llama3.1-70B-Instruct-AWQ | Llama3.1-8B-Instruct-AWQ | 12.2 |
| Llama3.3-70B-Instruct-AWQ | Llama3.1-8B-Instruct-AWQ | 12.1 | |
| Llama3.1-8B-Instruct-AWQ | CodeDrafter-500M | 195.3 | |
| RTX 4070 Ti | Llama3.1-70B-Instruct-AWQ | Llama3.2-1B-Instruct | 9.7 |
| Llama3.3-70B-Instruct-AWQ | Llama3.2-1B-Instruct | 9.6 | |
| Llama3.1-8B-Instruct-AWQ | CodeDrafter-500M | 162.3 | |
| L40 | Llama3.1-70B-Instruct-AWQ | CodeDrafter-500M | 45.6 |
| Llama3.3-70B-Instruct-AWQ | CodeDrafter-500M | 45.0 |
Offloading experiments heavily rely on the status of PCIE, and may vary across instances.
β UMbreLLa is not designed for large-scale LLM serving.
2 Deploying your LLMs with UMbreLLa
2.1 Install
conda create -n umbrella python=3.10
bash install.sh
2.2 CLI Chatbot
cd app
python chatbot.py --configuration ../configs/chat_config_24gb.json
Then you can chat with the LLM specified in chat_config_24gb.json.
2.3 Gradio Chatbot
cd app
python gradio_chat.py --configuration ../configs/chat_config_24gb.json
Then you can chat with the LLM specified in chat_config_24gb.json in Gradio.
2.4 API Server/Client
2.4.1 Server
cd app
python api.py --configuration ../configs/chat_config_24gb.json --max_client 1 --port 65432
configuration specifies the LLM and speculative decoding details.
max_client is the maximum clients that can connect to the server.
port is the port of the server.
2.4.2 Client
After the server is started, Client can be started and connect to the server by
from umbrella.api.client import APIClient
client = APIClient(port=port) #port should be the same as the server
client.run()
To get the LLM output,
input1 = {"context": text1, "max_new_tokens": 512, "temperature": 0.0}
output1 = client.get_output(**input1)
3 Config the LLM Engine
{
"model": "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
"draft_model": "meta-llama/Llama-3.2-1B-Instruct",
"offload": true,
"cuda_graph": false,
"max_length": 4096,
"num_cache_layers": 0,
"generation_length": 256,
"max_turns": 12,
"topk": 32,
"temperature": 0.6,
"topp": 0.9,
"repetition_penalty": 1.05,
"growmap_path": "../umbrella/trees/sequoia_tree-3x4.json",
"width": 16,
"num_beams": 24,
"depth": 16,
"engine": "dynamic",
"template": "meta-llama3"
}
Key Configuration Options
- model: Specifies the target LLM to serve, e.g.,
"hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4". - draft_model: Lightweight draft model, e.g.,
"meta-llama/Llama-3.2-1B-Instruct". - offload: Enables offloading of the target model to host memory or disk (
trueorfalse). - cuda_graph: Toggles CUDA graph optimization for the draft model (currently unsupported for AWQ models).
- max_length: The maximum token length for input and output combined.
- num_cache_layers: Sets the number of layers cached during inference (e.g., for memory optimization).
- generation_length: Maximum length of generated responses in tokens.
- max_turns: Limits the number of conversational turns retained in memory.
- topk: Limits token selection during generation to the top
kmost likely tokens. - temperature: Controls randomness in token selection (lower values = more deterministic outputs).
- topp: Enables nucleus sampling by limiting token selection to those with cumulative probability β€
p. - repetition_penalty: Penalizes repetitive text generation (values > 1 discourage repetition).
- growmap_path: Path to the speculative decoding tree used by the static engine (e.g.,
"../umbrella/trees/sequoia_tree-3x4.json").
Dynamic Engine-Specific Hyperparameters
- engine: Defines the decoding strategy. Choose between:
"static": Optimized for on-device execution."dynamic": Designed for offloading scenarios.
- width, num_beams, depth: Hyperparameters for speculative decoding in dynamic engines.
Prompt Template
- template: Defines the structure for input prompts. Supported values include:
"llama3-code": Optimized for code-related tasks."meta-llama3": General-purpose instruction-following template.
β οΈNotice: width, num_beams, depth, and growmap_path require tuning according to GPUs. Several examples are provided in ./configs and ./umbrella/trees.
4 Basic Usage
4.1 Initialize a Speculation Engine
from umbrella.speculation.auto_engine import AutoEngine
DEVICE = "cuda:0"
engine = AutoEngine.from_config(device=DEVICE, **config)
engine.initialize()
4.2 Prefill, Append and Decode
GEN_LEN = 512
text1 = "Tell me what you know about Reinforcement Learning in 100 words."
text2 = "Tell me what you know about LSH in 100 words."
engine.prefill(text1) # The first operation must be prefilling
engine.speculative_decoding(max_new_tokens=GEN_LEN)
engine.append(text2)
engine.speculative_decoding(max_new_tokens=GEN_LEN)
4.3 Other functions for API and Gradio
output = engine.generate(
context=prompt,
max_new_tokens=max_new_tokens,
temperature=temperature,
top_p=top_p,
repetition_penalty=repetition_penalty,
)
# return a dict containing token ids and detokenized texts
# context=prompt (str) can be replaced by input_ids=tokens list[int]
stream = engine.generate_stream(
context=prompt,
max_new_tokens=max_new_tokens,
temperature=temperature,
top_p=top_p,
repetition_penalty=repetition_penalty,
)
# return a stream containing detokenized texts
# context=prompt (str) can be replaced by input_ids=tokens list[int]
Reference
@article{chen2024sequoia,
title={Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding},
author={Chen, Zhuoming and May, Avner and Svirschevski, Ruslan and Huang, Yuhsun and Ryabinin, Max and Jia, Zhihao and Chen, Beidi},
journal={arXiv preprint arXiv:2402.12374},
year={2024}
}
@article{svirschevski2024specexec,
title={SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices},
author={Svirschevski, Ruslan and May, Avner and Chen, Zhuoming and Chen, Beidi and Jia, Zhihao and Ryabinin, Max},
journal={arXiv preprint arXiv:2406.02532},
year={2024}
}