Deployment using Llama.cpp

May 28, 2024 · View on GitHub

This guide demonstrates how to deploy using the llama.cpp library with a merged checkpoint created using the AI Workbench-LLaMa-Factory project.

First, convert the merged checkpoint into a (quantized) GGUF binary. Then, run inference using the llama.cpp library or its Python bindings. GGUF binaries generated using this workflow are also compatible with applications such as LMStudio, jan.ai, text-generation-webUI, as well as other applications that offer a llama.cpp execution backend.

0. Pre-requisites

[!NOTE] Skip this if you have a functional Llama.cpp local environment.

Build Llama.cpp with CUDA acceleration by following the instructions here. Ensure you have the correct pre-requisites.

Clone and build the llama.cpp repo

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DLLAMA_CUDA=ON
cmake --build build --config Release
python3 -m pip install -r requirements.txt

1. HF checkpoint to GGUF conversion

To run inference, first, let's use the HF checkpoint generated by LLaMa-Factory in the GGUF model format used by llama.cpp. Then quantize the model to desired quantization level.

python convert-hf-to-gguf.py  --outfile

For example:

python convert-hf-to-gguf.py C:\models\codealpaca-merged --outfile C:\models\codealpaca.gguf

Quantize down to Q4:

cd build\bin\Debug
quantize.exe C:\models\codealpaca.gguf C:\models\codealpaca_q4.gguf Q4_K_M

2. Inference using Python API

Setup llama-cpp-python

set CMAKE_ARGS=-DLLAMA_CUBLAS=on
set FORCE_CMAKE=1
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

from llama_cpp import Llama

llm = Llama(
      model_path="C:\models\llama-model.gguf",
      n_gpu_layers=-1, #Too use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      # n_ctx=2048, # Uncomment to increase the context window
)
output = llm(
      "Q: Name the planets in the solar system? A: ", # Prompt
      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)