Deployment using Llama.cpp
May 28, 2024 ยท View on GitHub
This guide demonstrates how to deploy using the llama.cpp library with a merged checkpoint created using the AI Workbench-LLaMa-Factory project.
First, convert the merged checkpoint into a (quantized) GGUF binary. Then, run inference using the llama.cpp library or its Python bindings. GGUF binaries generated using this workflow are also compatible with applications such as LMStudio, jan.ai, text-generation-webUI, as well as other applications that offer a llama.cpp execution backend.
0. Pre-requisites
[!NOTE] Skip this if you have a functional Llama.cpp local environment.
Build Llama.cpp with CUDA acceleration by following the instructions here. Ensure you have the correct pre-requisites.
Clone and build the llama.cpp repo
git clone https://github.com/ggerganov/llama.cpp cd llama.cpp cmake -B build -DLLAMA_CUDA=ON cmake --build build --config Release python3 -m pip install -r requirements.txt
1. HF checkpoint to GGUF conversion
To run inference, first, let's use the HF checkpoint generated by LLaMa-Factory in the GGUF model format used by llama.cpp. Then quantize the model to desired quantization level.
python convert-hf-to-gguf.py--outfile
For example:
python convert-hf-to-gguf.py C:\models\codealpaca-merged --outfile C:\models\codealpaca.gguf
Quantize down to Q4:
cd build\bin\Debug quantize.exe C:\models\codealpaca.gguf C:\models\codealpaca_q4.gguf Q4_K_M
2. Inference using Python API
Setup llama-cpp-python
set CMAKE_ARGS=-DLLAMA_CUBLAS=on set FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
from llama_cpp import Llama
llm = Llama(
model_path="C:\models\llama-model.gguf",
n_gpu_layers=-1, #Too use GPU acceleration
# seed=1337, # Uncomment to set a specific seed
# n_ctx=2048, # Uncomment to increase the context window
)
output = llm(
"Q: Name the planets in the solar system? A: ", # Prompt
max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)