readme.md

June 11, 2026 · View on GitHub

ncnn_llm

ncnn_llm

LLM, VLM, OCR, translation, and embedding inference on top of ncnn.

License Build Backend Platform

中文文档 · Quick Start · Supported Models · Model Zoo


ncnn_llm provides a lightweight C++ runtime for running language models and embedding models with ncnn. It focuses on practical local inference for edge devices, desktop CPU, and Vulkan-capable GPUs.

The project started from nihui's experimental ncnn kvcache work and expands it into reusable examples, model loaders, tokenizers, vision preprocessing, OCR inference, and embedding APIs.

Highlights

  • Unified CLI runner for chat and vision-language models
  • KV-cache autoregressive decoding with CPU and optional Vulkan execution
  • Qwen / MiniCPM style LLM support
  • Qwen VL image input support
  • GLM-OCR image-to-text example
  • NLLB translation example
  • Text and multimodal embedding APIs
  • BPE and Unigram tokenizer support
  • xmake-based build with small standalone examples

Supported Models

CategoryModelStatusNotes
LLMYoutuLLMSupportedChat / text generation
LLMMiniCPM4SupportedChat / text generation
LLMQwen3SupportedChat / text generation
VLMQwen3.5SupportedImage + text input
VLMQwen2.5-VLSupportedImage + text input
OCRGLM-OCRSupportedOCR
TranslationNLLBSupportedTranslation example
EmbeddingJina-Embeddings-v5-Text-NanoSupported768-dim text embeddings
EmbeddingJina-CLIP-v2Supported1024-dim text + image embeddings

Quick Start

1. Requirements

  • xmake
  • ncnn built from master

2. Clone

git clone https://github.com/futz12/ncnn_llm.git
cd ncnn_llm

3. Build

xmake build

Build a single target:

xmake build llm_ncnn_run

4. Download Models

Download converted ncnn model directories from the mirror:

https://mirrors.sdu.edu.cn/ncnn_modelzoo/

Put the model directory under assets/, for example:

assets/
└── qwen3_0.6b/
    ├── model.json
    ├── *.ncnn.param
    ├── *.ncnn.bin
    └── tokenizer files

CLI Chat

llm_ncnn_run is the main interactive example for text and vision-language models.

xmake run llm_ncnn_run --model ./assets/qwen3_0.6b

With explicit runtime options:

xmake run llm_ncnn_run --model ./assets/qwen3_0.6b --threads 4
xmake run llm_ncnn_run --model ./assets/qwen3_0.6b --vulkan --vulkan-device 0

Vision-language input:

xmake run llm_ncnn_run --model ./assets/qwen2.5_vl_3b --image ./assets/test.jpg

CLI Options

OptionDescription
--modelModel directory
--threadsCPU thread count
--vulkanEnable Vulkan compute
--vulkan-deviceVulkan device index
--imageImage path for VL models
--builtin-toolsEnable built-in demo tools

Example session:

llm_ncnn_run (cli). Type 'exit' or 'quit' to end the conversation.
User: Hello
Assistant: Hello! How can I help you today?

OCR

GLM-OCR uses a dedicated image prefill path and the shared text decode runtime.

xmake build ocr_main
xmake run ocr_main --model ./assets/glm_ocr --image ./test_ocr.png --prompt "Read the text in the image."

Example output:

Generating text:
Hello World 123

Embeddings

ncnn_embedding provides a common API for text embeddings and CLIP-style text-image embeddings.

Text Embedding

xmake build embedding_main
xmake run embedding_main --model ./assets/jina-embeddings-v5-text-nano

CLIP Multimodal Embedding

xmake build clip_main
xmake run clip_main --model ./assets/jina_clip_v2 --image ./assets/ganyu.jpg

C++ API

#include "ncnn_embedding.h"

ncnn_embedding embed("./assets/jina_clip_v2", false, 4);

std::vector<float> text_vec = embed.encode_text("Hello world");

if (embed.supports_image()) {
    std::vector<float> image_vec = embed.encode_image_file("./image.jpg");
    float score = cosine_similarity(text_vec, image_vec);
}

Other Examples

TargetPurpose
llm_ncnn_runUnified chat / VL CLI
ocr_mainGLM-OCR inference
embedding_mainText embedding inference
clip_mainCLIP text-image embedding inference
nllb_mainNLLB translation example
unigram_mainUnigram tokenizer example
benchllmLLM benchmark
test_llmUnit tests

Build and run tests:

xmake build test_llm
xmake run test_llm

Run benchmark:

xmake build benchllm
xmake run benchllm [loop_count] [num_threads] [powersave] [gpu_device] [cooling_down] [seqlen]

Model Zoo

Converted ncnn model weights are available from:

https://mirrors.sdu.edu.cn/ncnn_modelzoo/

Each downloaded model directory should contain model.json, ncnn param/bin files, and tokenizer files. Put the directory under assets/ or pass its path with --model.

Configuration

Each model directory is described by model.json. The exact fields depend on the model family, but a typical text model contains:

{
  "model_type": "llm",
  "params": {
    "embed_param": "embed.ncnn.param",
    "embed_bin": "embed.ncnn.bin",
    "decoder_param": "decoder.ncnn.param",
    "decoder_bin": "decoder.ncnn.bin",
    "lm_head_param": "lm_head.ncnn.param",
    "lm_head_bin": "lm_head.ncnn.bin"
  },
  "tokenizer": {
    "type": "bbpe",
    "vocab_file": "vocab.txt",
    "merges_file": "merges.txt"
  },
  "setting": {
    "attn_cnt": 32,
    "hidden_size": 1024,
    "rope": {
      "type": "RoPE",
      "rope_head_dim": 64,
      "rope_theta": 1000000.0
    }
  }
}

Embedding and OCR models use their own model_type and parameter sections. See the model files under assets/ for concrete examples.

Project Layout

ncnn_llm/
├── assets/                 # Local model directories and demo assets
├── benchmark/              # Benchmark entry points
├── examples/               # CLI and feature examples
│   ├── llm_ncnn_run/       # Unified chat / VL runner
│   ├── ocr_main.cpp        # OCR example
│   ├── embedding_main.cpp  # Text embedding example
│   ├── clip_main.cpp       # CLIP example
│   └── nllb_main.cpp       # Translation example
├── export/                 # Export scripts
├── src/                    # Core runtime
│   ├── ncnn_llm_gpt.*      # LLM / VL runtime
│   ├── ncnn_llm_ocr.*      # OCR image prefill + shared decode
│   ├── ncnn_embedding.*    # Embedding runtime
│   ├── ncnn_text_runtime.* # Shared text decode helpers
│   └── utils/              # Tokenizer, image, RoPE, prompt helpers
├── tests/                  # Unit tests
└── xmake.lua               # Build configuration

Roadmap

  • Keep decoder and KV-cache runtime shared across model families
  • Expand supported model architectures and tokenizers
  • Improve Vulkan and CPU performance
  • Add INT8 quantization support
  • Document model export pipelines in more detail

Older export scripts may become outdated as the runtime evolves. Prefer the latest model examples and model.json files as references.

Community

Issues, fixes, converted models, and test results are welcome.

  • QQ group: 767178345

License

Apache License 2.0. See LICENSE.