context=prompt (str) can be replaced by input_ids=tokens list[int]

March 17, 2025 · View on GitHub

UMbreLLa: Deploying LLMs for Personal Agents

UMbreLLa combines offloading, speculative decoding and quantization, tailored to single-user LLM deployment scenarios. Using UMbreLLa, 70B-level models can achieve performance comparable to human reading speed on an RTX 4070Ti, delivering exceptional efficiency and responsiveness, and especially expertised on coding tasks.

demogif

Deploy 4bit Llama3.1-70B model on RTX 4070Ti with UMbreLLa

News 🚀

[2025/03/17] Support QwQ-32B in FP8, achieving 7.54 tokens/sec on 1 x RTX 4090! 🎉

[2025/03/17] Support QwQ-32B-AWQ in INT4, achieving 67.98 tokens/sec on 1 x RTX 4090 and 6.04 tokens/sec on 1 x RTX 3070 (with PCIE3.0)! 🎉

1. Models Supported and Benchmarks

The throughput is measured with a batch size of 1 to directly mirror the user experience.

1.1 MT Bench

GPU	Model	Draft	Throughput (tokens/sec)
			Stochastic	Greedy
RTX 4090	Llama3.1-70B-Instruct-AWQ	Llama3.1-8B-Instruct-AWQ	7.2	8.6
	Llama3.3-70B-Instruct-AWQ	Llama3.1-8B-Instruct-AWQ	7.0	7.4
	Llama3.1-8B-Instruct	Llama3.2-1B-Instruct	100.7	108.1
RTX 4080 SUPER	Llama3.1-70B-Instruct-AWQ	Llama3.1-8B-Instruct-AWQ	7.4	8.4
RTX 4080 SUPER	Llama3.3-70B-Instruct-AWQ	Llama3.1-8B-Instruct-AWQ	6.7	7.2
RTX 4070 Ti	Llama3.1-70B-Instruct-AWQ	Llama3.2-1B-Instruct	5.5	6.1
RTX 4070 Ti	Llama3.3-70B-Instruct-AWQ	Llama3.2-1B-Instruct	5.2	5.5
L40	Llama3.1-70B-Instruct-AWQ	Llama3.2-1B-Instruct	37.0	38.5
L40	Llama3.3-70B-Instruct-AWQ	Llama3.2-1B-Instruct	36.3	37.1

1.2 Code Completion

Evaluated on ananyarn/Algorithm_and_Python_Source_Code.

GPU	Model	Draft	Throughput (tokens/sec)
RTX 4090	Llama3.1-70B-Instruct-AWQ	Llama3.1-8B-Instruct-AWQ	11.4
	Llama3.3-70B-Instruct-AWQ	Llama3.1-8B-Instruct-AWQ	11.2
	Llama3.1-8B-Instruct	CodeDrafter-500M	174.8
RTX 4080 SUPER	Llama3.1-70B-Instruct-AWQ	Llama3.1-8B-Instruct-AWQ	12.2
	Llama3.3-70B-Instruct-AWQ	Llama3.1-8B-Instruct-AWQ	12.1
	Llama3.1-8B-Instruct-AWQ	CodeDrafter-500M	195.3
RTX 4070 Ti	Llama3.1-70B-Instruct-AWQ	Llama3.2-1B-Instruct	9.7
	Llama3.3-70B-Instruct-AWQ	Llama3.2-1B-Instruct	9.6
	Llama3.1-8B-Instruct-AWQ	CodeDrafter-500M	162.3
L40	Llama3.1-70B-Instruct-AWQ	CodeDrafter-500M	45.6
L40	Llama3.3-70B-Instruct-AWQ	CodeDrafter-500M	45.0

Offloading experiments heavily rely on the status of PCIE, and may vary across instances.

❌ UMbreLLa is not designed for large-scale LLM serving.

2 Deploying your LLMs with UMbreLLa

2.1 Install

conda create -n umbrella python=3.10
bash install.sh

2.2 CLI Chatbot

cd app
python chatbot.py --configuration ../configs/chat_config_24gb.json

Then you can chat with the LLM specified in chat_config_24gb.json.

2.3 Gradio Chatbot

cd app
python gradio_chat.py --configuration ../configs/chat_config_24gb.json

Then you can chat with the LLM specified in chat_config_24gb.json in Gradio.

2.4 API Server/Client

2.4.1 Server

cd app
python api.py --configuration ../configs/chat_config_24gb.json --max_client 1 --port 65432

configuration specifies the LLM and speculative decoding details.

max_client is the maximum clients that can connect to the server.

port is the port of the server.

2.4.2 Client

After the server is started, Client can be started and connect to the server by

from umbrella.api.client import APIClient
client = APIClient(port=port) #port should be the same as the server
client.run()

To get the LLM output,

input1 = {"context": text1, "max_new_tokens": 512, "temperature": 0.0}
output1 = client.get_output(**input1)

3 Config the LLM Engine

{
    "model": "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4", 
    "draft_model": "meta-llama/Llama-3.2-1B-Instruct",
    "offload": true,
    "cuda_graph": false,
    "max_length": 4096,
    "num_cache_layers": 0,
    "generation_length": 256,
    "max_turns": 12,
    "topk": 32,
    "temperature": 0.6,
    "topp": 0.9,
    "repetition_penalty": 1.05,
    "growmap_path": "../umbrella/trees/sequoia_tree-3x4.json",
    "width": 16,
    "num_beams": 24,
    "depth": 16,
    "engine": "dynamic",
    "template": "meta-llama3"
}

Key Configuration Options

model: Specifies the target LLM to serve, e.g., "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4".
draft_model: Lightweight draft model, e.g., "meta-llama/Llama-3.2-1B-Instruct".
offload: Enables offloading of the target model to host memory or disk (true or false).
cuda_graph: Toggles CUDA graph optimization for the draft model (currently unsupported for AWQ models).
max_length: The maximum token length for input and output combined.
num_cache_layers: Sets the number of layers cached during inference (e.g., for memory optimization).
generation_length: Maximum length of generated responses in tokens.
max_turns: Limits the number of conversational turns retained in memory.
topk: Limits token selection during generation to the top k most likely tokens.
temperature: Controls randomness in token selection (lower values = more deterministic outputs).
topp: Enables nucleus sampling by limiting token selection to those with cumulative probability ≤ p.
repetition_penalty: Penalizes repetitive text generation (values > 1 discourage repetition).
growmap_path: Path to the speculative decoding tree used by the static engine (e.g., "../umbrella/trees/sequoia_tree-3x4.json").

Dynamic Engine-Specific Hyperparameters

engine: Defines the decoding strategy. Choose between:
- "static": Optimized for on-device execution.
- "dynamic": Designed for offloading scenarios.
width, num_beams, depth: Hyperparameters for speculative decoding in dynamic engines.

Prompt Template

template: Defines the structure for input prompts. Supported values include:
- "llama3-code": Optimized for code-related tasks.
- "meta-llama3": General-purpose instruction-following template.

⚠️Notice: width, num_beams, depth, and growmap_path require tuning according to GPUs. Several examples are provided in ./configs and ./umbrella/trees.

4 Basic Usage

4.1 Initialize a Speculation Engine

from umbrella.speculation.auto_engine import AutoEngine
DEVICE = "cuda:0"
engine = AutoEngine.from_config(device=DEVICE, **config)
engine.initialize()

4.2 Prefill, Append and Decode

GEN_LEN = 512
text1 = "Tell me what you know about Reinforcement Learning in 100 words."
text2 = "Tell me what you know about LSH in 100 words."

engine.prefill(text1) # The first operation must be prefilling
engine.speculative_decoding(max_new_tokens=GEN_LEN)

engine.append(text2)
engine.speculative_decoding(max_new_tokens=GEN_LEN)

4.3 Other functions for API and Gradio

output = engine.generate(
        context=prompt, 
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        top_p=top_p,
        repetition_penalty=repetition_penalty,
    )
# return a dict containing token ids and detokenized texts
# context=prompt (str) can be replaced by input_ids=tokens list[int]

stream = engine.generate_stream(
        context=prompt, 
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        top_p=top_p,
        repetition_penalty=repetition_penalty,
    )
# return a stream containing detokenized texts
# context=prompt (str) can be replaced by input_ids=tokens list[int]

Reference

@article{chen2024sequoia,
  title={Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding},
  author={Chen, Zhuoming and May, Avner and Svirschevski, Ruslan and Huang, Yuhsun and Ryabinin, Max and Jia, Zhihao and Chen, Beidi},
  journal={arXiv preprint arXiv:2402.12374},
  year={2024}
}
@article{svirschevski2024specexec,
  title={SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices},
  author={Svirschevski, Ruslan and May, Avner and Chen, Zhuoming and Chen, Beidi and Jia, Zhihao and Ryabinin, Max},
  journal={arXiv preprint arXiv:2406.02532},
  year={2024}
}