readme.md

June 11, 2026 · View on GitHub

ncnn_llm

LLM, VLM, OCR, translation, and embedding inference on top of ncnn.

中文文档 · Quick Start · Supported Models · Model Zoo

ncnn_llm provides a lightweight C++ runtime for running language models and embedding models with ncnn. It focuses on practical local inference for edge devices, desktop CPU, and Vulkan-capable GPUs.

The project started from nihui's experimental ncnn kvcache work and expands it into reusable examples, model loaders, tokenizers, vision preprocessing, OCR inference, and embedding APIs.

Highlights

Unified CLI runner for chat and vision-language models
KV-cache autoregressive decoding with CPU and optional Vulkan execution
Qwen / MiniCPM style LLM support
Qwen VL image input support
GLM-OCR image-to-text example
NLLB translation example
Text and multimodal embedding APIs
BPE and Unigram tokenizer support
xmake-based build with small standalone examples

Supported Models

Category	Model	Status	Notes
LLM	YoutuLLM	Supported	Chat / text generation
LLM	MiniCPM4	Supported	Chat / text generation
LLM	Qwen3	Supported	Chat / text generation
VLM	Qwen3.5	Supported	Image + text input
VLM	Qwen2.5-VL	Supported	Image + text input
OCR	GLM-OCR	Supported	OCR
Translation	NLLB	Supported	Translation example
Embedding	Jina-Embeddings-v5-Text-Nano	Supported	768-dim text embeddings
Embedding	Jina-CLIP-v2	Supported	1024-dim text + image embeddings

Quick Start

1. Requirements

xmake
ncnn built from master

2. Clone

git clone https://github.com/futz12/ncnn_llm.git
cd ncnn_llm

3. Build

xmake build

Build a single target:

xmake build llm_ncnn_run

4. Download Models

Download converted ncnn model directories from the mirror:

https://mirrors.sdu.edu.cn/ncnn_modelzoo/

Put the model directory under assets/, for example:

assets/
└── qwen3_0.6b/
    ├── model.json
    ├── *.ncnn.param
    ├── *.ncnn.bin
    └── tokenizer files

CLI Chat

llm_ncnn_run is the main interactive example for text and vision-language models.

xmake run llm_ncnn_run --model ./assets/qwen3_0.6b

With explicit runtime options:

xmake run llm_ncnn_run --model ./assets/qwen3_0.6b --threads 4
xmake run llm_ncnn_run --model ./assets/qwen3_0.6b --vulkan --vulkan-device 0

Vision-language input:

xmake run llm_ncnn_run --model ./assets/qwen2.5_vl_3b --image ./assets/test.jpg

CLI Options

Option	Description
`--model`	Model directory
`--threads`	CPU thread count
`--vulkan`	Enable Vulkan compute
`--vulkan-device`	Vulkan device index
`--image`	Image path for VL models
`--builtin-tools`	Enable built-in demo tools

Example session:

llm_ncnn_run (cli). Type 'exit' or 'quit' to end the conversation.
User: Hello
Assistant: Hello! How can I help you today?

OCR

GLM-OCR uses a dedicated image prefill path and the shared text decode runtime.

xmake build ocr_main
xmake run ocr_main --model ./assets/glm_ocr --image ./test_ocr.png --prompt "Read the text in the image."

Example output:

Generating text:
Hello World 123

Embeddings

ncnn_embedding provides a common API for text embeddings and CLIP-style text-image embeddings.

Text Embedding

xmake build embedding_main
xmake run embedding_main --model ./assets/jina-embeddings-v5-text-nano

CLIP Multimodal Embedding

xmake build clip_main
xmake run clip_main --model ./assets/jina_clip_v2 --image ./assets/ganyu.jpg

C++ API

#include "ncnn_embedding.h"

ncnn_embedding embed("./assets/jina_clip_v2", false, 4);

std::vector<float> text_vec = embed.encode_text("Hello world");

if (embed.supports_image()) {
    std::vector<float> image_vec = embed.encode_image_file("./image.jpg");
    float score = cosine_similarity(text_vec, image_vec);
}

Other Examples

Target	Purpose
`llm_ncnn_run`	Unified chat / VL CLI
`ocr_main`	GLM-OCR inference
`embedding_main`	Text embedding inference
`clip_main`	CLIP text-image embedding inference
`nllb_main`	NLLB translation example
`unigram_main`	Unigram tokenizer example
`benchllm`	LLM benchmark
`test_llm`	Unit tests

Build and run tests:

xmake build test_llm
xmake run test_llm

Run benchmark:

xmake build benchllm
xmake run benchllm [loop_count] [num_threads] [powersave] [gpu_device] [cooling_down] [seqlen]

Model Zoo

Converted ncnn model weights are available from:

https://mirrors.sdu.edu.cn/ncnn_modelzoo/

Each downloaded model directory should contain model.json, ncnn param/bin files, and tokenizer files. Put the directory under assets/ or pass its path with --model.

Configuration

Each model directory is described by model.json. The exact fields depend on the model family, but a typical text model contains:

{
  "model_type": "llm",
  "params": {
    "embed_param": "embed.ncnn.param",
    "embed_bin": "embed.ncnn.bin",
    "decoder_param": "decoder.ncnn.param",
    "decoder_bin": "decoder.ncnn.bin",
    "lm_head_param": "lm_head.ncnn.param",
    "lm_head_bin": "lm_head.ncnn.bin"
  },
  "tokenizer": {
    "type": "bbpe",
    "vocab_file": "vocab.txt",
    "merges_file": "merges.txt"
  },
  "setting": {
    "attn_cnt": 32,
    "hidden_size": 1024,
    "rope": {
      "type": "RoPE",
      "rope_head_dim": 64,
      "rope_theta": 1000000.0
    }
  }
}

Embedding and OCR models use their own model_type and parameter sections. See the model files under assets/ for concrete examples.

Project Layout

ncnn_llm/
├── assets/                 # Local model directories and demo assets
├── benchmark/              # Benchmark entry points
├── examples/               # CLI and feature examples
│   ├── llm_ncnn_run/       # Unified chat / VL runner
│   ├── ocr_main.cpp        # OCR example
│   ├── embedding_main.cpp  # Text embedding example
│   ├── clip_main.cpp       # CLIP example
│   └── nllb_main.cpp       # Translation example
├── export/                 # Export scripts
├── src/                    # Core runtime
│   ├── ncnn_llm_gpt.*      # LLM / VL runtime
│   ├── ncnn_llm_ocr.*      # OCR image prefill + shared decode
│   ├── ncnn_embedding.*    # Embedding runtime
│   ├── ncnn_text_runtime.* # Shared text decode helpers
│   └── utils/              # Tokenizer, image, RoPE, prompt helpers
├── tests/                  # Unit tests
└── xmake.lua               # Build configuration

Roadmap

Keep decoder and KV-cache runtime shared across model families
Expand supported model architectures and tokenizers
Improve Vulkan and CPU performance
Add INT8 quantization support
Document model export pipelines in more detail

Older export scripts may become outdated as the runtime evolves. Prefer the latest model examples and model.json files as references.

Community

Issues, fixes, converted models, and test results are welcome.

QQ group: 767178345

License

Apache License 2.0. See LICENSE.