readme.md
June 11, 2026 · View on GitHub
ncnn_llm
LLM, VLM, OCR, translation, and embedding inference on top of ncnn.
中文文档 · Quick Start · Supported Models · Model Zoo
ncnn_llm provides a lightweight C++ runtime for running language models and embedding models with ncnn. It focuses on practical local inference for edge devices, desktop CPU, and Vulkan-capable GPUs.
The project started from nihui's experimental ncnn kvcache work and expands it into reusable examples, model loaders, tokenizers, vision preprocessing, OCR inference, and embedding APIs.
Highlights
- Unified CLI runner for chat and vision-language models
- KV-cache autoregressive decoding with CPU and optional Vulkan execution
- Qwen / MiniCPM style LLM support
- Qwen VL image input support
- GLM-OCR image-to-text example
- NLLB translation example
- Text and multimodal embedding APIs
- BPE and Unigram tokenizer support
- xmake-based build with small standalone examples
Supported Models
| Category | Model | Status | Notes |
|---|---|---|---|
| LLM | YoutuLLM | Supported | Chat / text generation |
| LLM | MiniCPM4 | Supported | Chat / text generation |
| LLM | Qwen3 | Supported | Chat / text generation |
| VLM | Qwen3.5 | Supported | Image + text input |
| VLM | Qwen2.5-VL | Supported | Image + text input |
| OCR | GLM-OCR | Supported | OCR |
| Translation | NLLB | Supported | Translation example |
| Embedding | Jina-Embeddings-v5-Text-Nano | Supported | 768-dim text embeddings |
| Embedding | Jina-CLIP-v2 | Supported | 1024-dim text + image embeddings |
Quick Start
1. Requirements
xmake- ncnn built from
master
2. Clone
git clone https://github.com/futz12/ncnn_llm.git
cd ncnn_llm
3. Build
xmake build
Build a single target:
xmake build llm_ncnn_run
4. Download Models
Download converted ncnn model directories from the mirror:
https://mirrors.sdu.edu.cn/ncnn_modelzoo/
Put the model directory under assets/, for example:
assets/
└── qwen3_0.6b/
├── model.json
├── *.ncnn.param
├── *.ncnn.bin
└── tokenizer files
CLI Chat
llm_ncnn_run is the main interactive example for text and vision-language models.
xmake run llm_ncnn_run --model ./assets/qwen3_0.6b
With explicit runtime options:
xmake run llm_ncnn_run --model ./assets/qwen3_0.6b --threads 4
xmake run llm_ncnn_run --model ./assets/qwen3_0.6b --vulkan --vulkan-device 0
Vision-language input:
xmake run llm_ncnn_run --model ./assets/qwen2.5_vl_3b --image ./assets/test.jpg
CLI Options
| Option | Description |
|---|---|
--model | Model directory |
--threads | CPU thread count |
--vulkan | Enable Vulkan compute |
--vulkan-device | Vulkan device index |
--image | Image path for VL models |
--builtin-tools | Enable built-in demo tools |
Example session:
llm_ncnn_run (cli). Type 'exit' or 'quit' to end the conversation.
User: Hello
Assistant: Hello! How can I help you today?
OCR
GLM-OCR uses a dedicated image prefill path and the shared text decode runtime.
xmake build ocr_main
xmake run ocr_main --model ./assets/glm_ocr --image ./test_ocr.png --prompt "Read the text in the image."
Example output:
Generating text:
Hello World 123
Embeddings
ncnn_embedding provides a common API for text embeddings and CLIP-style text-image embeddings.
Text Embedding
xmake build embedding_main
xmake run embedding_main --model ./assets/jina-embeddings-v5-text-nano
CLIP Multimodal Embedding
xmake build clip_main
xmake run clip_main --model ./assets/jina_clip_v2 --image ./assets/ganyu.jpg
C++ API
#include "ncnn_embedding.h"
ncnn_embedding embed("./assets/jina_clip_v2", false, 4);
std::vector<float> text_vec = embed.encode_text("Hello world");
if (embed.supports_image()) {
std::vector<float> image_vec = embed.encode_image_file("./image.jpg");
float score = cosine_similarity(text_vec, image_vec);
}
Other Examples
| Target | Purpose |
|---|---|
llm_ncnn_run | Unified chat / VL CLI |
ocr_main | GLM-OCR inference |
embedding_main | Text embedding inference |
clip_main | CLIP text-image embedding inference |
nllb_main | NLLB translation example |
unigram_main | Unigram tokenizer example |
benchllm | LLM benchmark |
test_llm | Unit tests |
Build and run tests:
xmake build test_llm
xmake run test_llm
Run benchmark:
xmake build benchllm
xmake run benchllm [loop_count] [num_threads] [powersave] [gpu_device] [cooling_down] [seqlen]
Model Zoo
Converted ncnn model weights are available from:
https://mirrors.sdu.edu.cn/ncnn_modelzoo/
Each downloaded model directory should contain model.json, ncnn param/bin files, and tokenizer files. Put the directory under assets/ or pass its path with --model.
Configuration
Each model directory is described by model.json. The exact fields depend on the model family, but a typical text model contains:
{
"model_type": "llm",
"params": {
"embed_param": "embed.ncnn.param",
"embed_bin": "embed.ncnn.bin",
"decoder_param": "decoder.ncnn.param",
"decoder_bin": "decoder.ncnn.bin",
"lm_head_param": "lm_head.ncnn.param",
"lm_head_bin": "lm_head.ncnn.bin"
},
"tokenizer": {
"type": "bbpe",
"vocab_file": "vocab.txt",
"merges_file": "merges.txt"
},
"setting": {
"attn_cnt": 32,
"hidden_size": 1024,
"rope": {
"type": "RoPE",
"rope_head_dim": 64,
"rope_theta": 1000000.0
}
}
}
Embedding and OCR models use their own model_type and parameter sections. See the model files under assets/ for concrete examples.
Project Layout
ncnn_llm/
├── assets/ # Local model directories and demo assets
├── benchmark/ # Benchmark entry points
├── examples/ # CLI and feature examples
│ ├── llm_ncnn_run/ # Unified chat / VL runner
│ ├── ocr_main.cpp # OCR example
│ ├── embedding_main.cpp # Text embedding example
│ ├── clip_main.cpp # CLIP example
│ └── nllb_main.cpp # Translation example
├── export/ # Export scripts
├── src/ # Core runtime
│ ├── ncnn_llm_gpt.* # LLM / VL runtime
│ ├── ncnn_llm_ocr.* # OCR image prefill + shared decode
│ ├── ncnn_embedding.* # Embedding runtime
│ ├── ncnn_text_runtime.* # Shared text decode helpers
│ └── utils/ # Tokenizer, image, RoPE, prompt helpers
├── tests/ # Unit tests
└── xmake.lua # Build configuration
Roadmap
- Keep decoder and KV-cache runtime shared across model families
- Expand supported model architectures and tokenizers
- Improve Vulkan and CPU performance
- Add INT8 quantization support
- Document model export pipelines in more detail
Older export scripts may become outdated as the runtime evolves. Prefer the latest model examples and model.json files as references.
Community
Issues, fixes, converted models, and test results are welcome.
- QQ group:
767178345
License
Apache License 2.0. See LICENSE.