Deploy ArcLight from Source (CPU)

May 31, 2026 · View on GitHub

ArcLight is a lightweight LLM inference framework written in C/C++ for unified-memory systems. It is designed for inference scenarios beyond high-performance GPU servers, with current v1.0 optimizations focused on many-core CPU platforms and cross-NUMA tensor parallelism.

ArcLight currently supports CPU backends for ARM and x86 platforms, with basic Windows build support. The recommended path today is to build from source and run GGUF models locally.

TL;DR

git clone https://github.com/OpenBMB/ArcLight.git
cd ArcLight

cmake -B build -DARCLIGHT_BACKEND=AUTO -DNNML_USE_NUMA=OFF
cmake --build build --config Release -j 32

./build/al-gen \
    --model /path/to/MiniCPM5-1B-Q4_0.gguf \
    --prompt "Hello!" \
    --numa none --nodes 1 \
    --threads 4

Preparing a Model

ArcLight uses the GGUF model format from llama.cpp. The current nnml backend only ships kernels for f32 / f16 / q4_0 / q8_0 / q6_K / q8_K (see nnml/src/ops/types.cpp); other quants such as Q4_K_M will not load.

The official openbmb/MiniCPM5-1B-GGUF repo publishes F16, Q8_0, and Q4_K_M — note that Q4_0 is not included. For a q4_0 build you have to quantize it yourself from the released F16:

huggingface-cli download openbmb/MiniCPM5-1B-GGUF MiniCPM5-1B-F16.gguf --local-dir .
llama-quantize ./MiniCPM5-1B-F16.gguf ./MiniCPM5-1B-Q4_0.gguf Q4_0

The current codebase includes model definitions for:

  • MiniCPM5-1B
  • Qwen3
  • Llama2

For first-time testing, start with MiniCPM5-1B or another small GGUF model, preferably the locally produced Q4_0 (smallest) or the released Q8_0.

Building

Linux / x86 / ARM

git clone https://github.com/OpenBMB/ArcLight.git
cd ArcLight

cmake -B build -DARCLIGHT_BACKEND=AUTO
cmake --build build --config Release -j 32

ARCLIGHT_BACKEND can be set to:

ValueMeaning
AUTOSelect the backend from the target CPU architecture. Recommended.
X86Enable the x86 backend explicitly.
NEONEnable the ARM NEON backend explicitly.
NONEBuild without architecture-specific backend code.

Make sure your machine has a C++17-capable toolchain, such as GCC/G++ on Linux or MSVC on Windows.

Windows

git clone https://github.com/OpenBMB/ArcLight.git
cd ArcLight

cmake -B build -G "Visual Studio 18 2026"
cmake --build build --config Release -j 32

On Windows, executables are emitted under the build output directory configured by CMake, typically build\bin for Visual Studio builds.

Inference

ArcLight currently provides three command-line apps:

  • al-gen: one-shot generation
  • al-chat: interactive chat
  • al-ppl: perplexity evaluation for one text

If you build from source, run them from the build directory, for example ./build/al-gen on Linux.

One-shot Generate

./build/al-gen \
    --model /path/to/MiniCPM5-1B-Q4_0.gguf \
    --prompt "Explain what unified memory means in one sentence." \
    --numa none --nodes 1 \
    --threads 4 \
    --max_length 4096 \
    --max_gen 256

For a Chinese prompt:

./build/al-gen \
    --model /path/to/MiniCPM5-1B-Q4_0.gguf \
    --prompt "用一句话解释什么是统一内存。" \
    --numa none --nodes 1 \
    --threads 4 \
    --max_gen 256

Interactive Chat

./build/al-chat \
    --model /path/to/MiniCPM5-1B-Q4_0.gguf \
    --numa none --nodes 1 \
    --threads 4 \
    --max_length 4096 \
    --max_gen 512

Press Ctrl+C during generation to interrupt the current response. Press Ctrl+C while waiting for input to exit and print the performance profile.

Perplexity

./build/al-ppl \
    --model /path/to/MiniCPM5-1B-Q4_0.gguf \
    --prompt "Good morning, Miss Lee!" \
    --numa none --nodes 1 \
    --threads 4

The app prints the evaluated text and a final perplexity: ... line.

NUMA Modes

ArcLight supports single-node inference and cross-node tensor parallelism.

ModeRequired argsWhen to use
--numa none--nodes 1Single-node mode. Start here for correctness checks and small models.
--numa tp--nodes N where N > 1Cross-NUMA tensor parallelism. Use on many-core CPU machines for higher throughput.
--numa ppNot readyReserved for future pipeline parallelism; currently not implemented.

For tensor parallelism, --nodes should be a power of 2 in the current version. Choose --threads so that it can be evenly divided across NUMA nodes.

Example for a 4-node many-core machine:

./build/al-gen \
    --model /path/to/MiniCPM5-1B-Q4_0.gguf\
    --prompt "Hello!" \
    --numa tp --nodes 4 \
    --threads 32
ScenarioSuggested settings
First run / small model--numa none --nodes 1 --threads <cores-on-one-node>
Many-core CPU throughput--numa tp --nodes <power-of-2> --threads <total-threads>
Longer contextIncrease --max_length and --kv_gb.
Larger modelIncrease --w_gb, then tune --a_gb and --work_gb if allocation fails.

Q&A

The program aborts immediately in single-node mode

Use --numa none --nodes 1. The current implementation requires --nodes to be exactly 1 when --numa none is selected.

Tensor parallel mode fails to start

Use --numa tp --nodes N with N > 1. In the current version, N should be a power of 2. Also make sure --threads is large enough and divisible by --nodes.

Pipeline parallelism does not work

--numa pp is planned but not implemented yet. Use --numa none or --numa tp.

Model loading fails

Check that the model is a GGUF checkpoint from a supported model family. The current codebase includes Qwen3, Llama, and MiniCPM5 definitions. Also verify that --w_gb is large enough for the selected model.

Inference runs out of memory

Increase --a_gb, --kv_gb, or --work_gb. Longer contexts require a larger KV cache, so --max_length 8192 generally needs a larger --kv_gb than --max_length 4096.

See Also