Deploy ArcLight from Source (CPU)

May 31, 2026 · View on GitHub

ArcLight is a lightweight LLM inference framework written in C/C++ for unified-memory systems. It is designed for inference scenarios beyond high-performance GPU servers, with current v1.0 optimizations focused on many-core CPU platforms and cross-NUMA tensor parallelism.

ArcLight currently supports CPU backends for ARM and x86 platforms, with basic Windows build support. The recommended path today is to build from source and run GGUF models locally.

TL;DR

git clone https://github.com/OpenBMB/ArcLight.git
cd ArcLight

cmake -B build -DARCLIGHT_BACKEND=AUTO -DNNML_USE_NUMA=OFF
cmake --build build --config Release -j 32

./build/al-gen \
    --model /path/to/MiniCPM5-1B-Q4_0.gguf \
    --prompt "Hello!" \
    --numa none --nodes 1 \
    --threads 4

Preparing a Model

ArcLight uses the GGUF model format from llama.cpp. The current nnml backend only ships kernels for f32 / f16 / q4_0 / q8_0 / q6_K / q8_K (see nnml/src/ops/types.cpp); other quants such as Q4_K_M will not load.

The official openbmb/MiniCPM5-1B-GGUF repo publishes F16, Q8_0, and Q4_K_M — note that Q4_0 is not included. For a q4_0 build you have to quantize it yourself from the released F16:

huggingface-cli download openbmb/MiniCPM5-1B-GGUF MiniCPM5-1B-F16.gguf --local-dir .
llama-quantize ./MiniCPM5-1B-F16.gguf ./MiniCPM5-1B-Q4_0.gguf Q4_0

The current codebase includes model definitions for:

MiniCPM5-1B
Qwen3
Llama2

For first-time testing, start with MiniCPM5-1B or another small GGUF model, preferably the locally produced Q4_0 (smallest) or the released Q8_0.

Building

Linux / x86 / ARM

git clone https://github.com/OpenBMB/ArcLight.git
cd ArcLight

cmake -B build -DARCLIGHT_BACKEND=AUTO
cmake --build build --config Release -j 32

ARCLIGHT_BACKEND can be set to:

Value	Meaning
`AUTO`	Select the backend from the target CPU architecture. Recommended.
`X86`	Enable the x86 backend explicitly.
`NEON`	Enable the ARM NEON backend explicitly.
`NONE`	Build without architecture-specific backend code.

Make sure your machine has a C++17-capable toolchain, such as GCC/G++ on Linux or MSVC on Windows.

Windows

git clone https://github.com/OpenBMB/ArcLight.git
cd ArcLight

cmake -B build -G "Visual Studio 18 2026"
cmake --build build --config Release -j 32

On Windows, executables are emitted under the build output directory configured by CMake, typically build\bin for Visual Studio builds.

Inference

ArcLight currently provides three command-line apps:

al-gen: one-shot generation
al-chat: interactive chat
al-ppl: perplexity evaluation for one text

If you build from source, run them from the build directory, for example ./build/al-gen on Linux.

One-shot Generate

./build/al-gen \
    --model /path/to/MiniCPM5-1B-Q4_0.gguf \
    --prompt "Explain what unified memory means in one sentence." \
    --numa none --nodes 1 \
    --threads 4 \
    --max_length 4096 \
    --max_gen 256

For a Chinese prompt:

./build/al-gen \
    --model /path/to/MiniCPM5-1B-Q4_0.gguf \
    --prompt "用一句话解释什么是统一内存。" \
    --numa none --nodes 1 \
    --threads 4 \
    --max_gen 256

Interactive Chat

./build/al-chat \
    --model /path/to/MiniCPM5-1B-Q4_0.gguf \
    --numa none --nodes 1 \
    --threads 4 \
    --max_length 4096 \
    --max_gen 512

Press Ctrl+C during generation to interrupt the current response. Press Ctrl+C while waiting for input to exit and print the performance profile.

Perplexity

./build/al-ppl \
    --model /path/to/MiniCPM5-1B-Q4_0.gguf \
    --prompt "Good morning, Miss Lee!" \
    --numa none --nodes 1 \
    --threads 4

The app prints the evaluated text and a final perplexity: ... line.

NUMA Modes

ArcLight supports single-node inference and cross-node tensor parallelism.

Mode	Required args	When to use
`--numa none`	`--nodes 1`	Single-node mode. Start here for correctness checks and small models.
`--numa tp`	`--nodes N` where `N > 1`	Cross-NUMA tensor parallelism. Use on many-core CPU machines for higher throughput.
`--numa pp`	Not ready	Reserved for future pipeline parallelism; currently not implemented.

For tensor parallelism, --nodes should be a power of 2 in the current version. Choose --threads so that it can be evenly divided across NUMA nodes.

Example for a 4-node many-core machine:

./build/al-gen \
    --model /path/to/MiniCPM5-1B-Q4_0.gguf\
    --prompt "Hello!" \
    --numa tp --nodes 4 \
    --threads 32

Recommended Settings

Scenario	Suggested settings
First run / small model	`--numa none --nodes 1 --threads <cores-on-one-node>`
Many-core CPU throughput	`--numa tp --nodes <power-of-2> --threads <total-threads>`
Longer context	Increase `--max_length` and `--kv_gb`.
Larger model	Increase `--w_gb`, then tune `--a_gb` and `--work_gb` if allocation fails.

Deploy ArcLight from Source (CPU)

TL;DR

Preparing a Model

Building

Linux / x86 / ARM

Windows

Inference

One-shot Generate

Interactive Chat

Perplexity

NUMA Modes

Recommended Settings

Q&A

The program aborts immediately in single-node mode

Tensor parallel mode fails to start

Pipeline parallelism does not work

Model loading fails

Inference runs out of memory

See Also