Cactus

May 27, 2026 · View on GitHub

A low-latency AI engine for mobile devices & wearables.

Fast & accurate: fastest inference on ARM CPU, Cactus quants at 4-bit matches f16
Low RAM: zero-copy memory mapping ensures 10x lower RAM use than other engines
Multimodal: one engine for speech, vision, and language models
Cloud fallback: automatically route requests to cloud models if needed
Model-Agnostic: Custom PyTorch models can be exported to the Cactus runtime.

┌─────────────────┐
│  Cactus Engine  │ ←── OpenAI-compatible APIs for text, speech, and vision.
└─────────────────┘     
         │
┌─────────────────┐
│  Cactus Graph   │ ←── Zero-copy computation graph ensures 10x lower RAM 
└─────────────────┘     
         │
┌─────────────────┐
│ Cactus Kernels  │ ←── Fastest ARM SIMD kernels (Apple, Samsung, Pixel, etc)
└─────────────────┘     
         │
┌─────────────────┐
│ Cactus Quants   │ ←── Cactus Quants at 4-bit uniform matches f16.
└─────────────────┘  
         │
┌─────────────────┐
│Cactus Transpiler│ ←── Transpiles custom PyTorch model to Cactus.
└─────────────────┘

Quick Demo (Mac)

Step 1: brew install cactus-compute/cactus/cactus
Step 2: cactus transcribe or cactus run

Cactus Engine

#include "cactus.h"

cactus_model_t model = cactus_init(
    "path/to/weight/folder",
    "path to txt or dir of txts for auto-rag",
    false
);

const char* messages = R"([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "My name is Henry Ndubuaku"}
])";

const char* options = R"({
    "max_tokens": 50,
    "stop_sequences": ["<|im_end|>"]
})";

char response[4096];
int result = cactus_complete(
    model,            // model handle
    messages,         // JSON chat messages
    response,         // response buffer
    sizeof(response), // buffer size
    options,          // generation options
    nullptr,          // tools JSON
    nullptr,          // streaming callback
    nullptr,          // user data
    nullptr,          // pcm audio buffer
    0                 // pcm buffer size
);

Example response from Gemma3-270m

{
    "success": true,        // generation succeeded
    "error": null,          // error details if failed
    "cloud_handoff": false, // true if cloud model used
    "response": "Hi there!",
    "function_calls": [],   // parsed tool calls
    "confidence": 0.8193,   // model confidence
    "time_to_first_token_ms": 45.23,
    "total_time_ms": 163.67,
    "prefill_tps": 1621.89,
    "decode_tps": 168.42,
    "ram_usage_mb": 245.67,
    "prefill_tokens": 28,
    "decode_tokens": 50,
    "total_tokens": 78
}

Cactus Graph

#include "cactus.h"

CactusGraph graph;
auto a = graph.input({2, 3}, Precision::FP16);
auto b = graph.input({3, 4}, Precision::INT8);

auto x1 = graph.matmul(a, b, false);
auto x2 = graph.transpose(x1);
auto result = graph.matmul(b, x2, true);

float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f};
float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};

graph.set_input(a, a_data, Precision::FP16);
graph.set_input(b, b_data, Precision::INT8);

graph.execute();
void* output_data = graph.get_output(result);

graph.hard_reset();

Learn More

Reference	Language	Description
Cactus Engine	C	Chat completion, streaming, tool calling, transcription, embeddings, RAG, vision, vector index, cloud handoff
Cactus Graph	C++	Tensor operations, matrix multiplication, attention, normalization, activation functions
Cactus Kernels	C++	ARM NEON SIMD kernels for matmul, attention, convolution, quantization, DSP, image processing
Cactus Quants	C++	Rotation-and-codebook quantization from 4-bit to 1-bit for all weight tensors
Cactus Hybrid	C/Python	Route hard queries to the cloud automatically based on local model confidence
Cactus Transpiler	Python	Convert any PyTorch model to a Cactus runtime graph for on-device inference
Python Package	Python	Python package and CLI

Build

cactus build --apple       # iOS/macOS
cactus build --android     # Android
cactus build --python      # Python
cactus build               # default static lib

Bindings

Model weights: Pre-converted weights for all supported models at huggingface.co/Cactus-Compute.

Benchmarks (CPU-only, no GPU)

All weights INT4 quantised
LFM: 1k-prefill / 100-decode, values are prefill tps / decode tps
LFM-VL: 256px input, values are latency / decode tps
Parakeet: 20s audio input, values are latency / decode tps
Missing latency = no NPU support yet

Device	LFM 1.2B	LFMVL 1.6B	Parakeet 1.1B	RAM
Mac M4 Pro	582/100	0.2s/98	0.1s/900k+	76MB
iPad/Mac M3	350/60	0.3s/69	0.3s/800k+	70MB
iPhone 17 Pro	327/48	0.3s/48	0.3s/300k+	108MB
iPhone 13 Mini	148/34	0.3s/35	0.7s/90k+	1GB
Galaxy S25 Ultra	255/37	-/34	-/250k+	1.5GB
Pixel 6a	70/15	-/15	-/17k+	1GB
Galaxy A17 5G	32/10	-/11	-/40k+	727MB
CMF Phone 2 Pro	-	-	-	-
Raspberry Pi 5	69/11	13.3s/11	4.5s/180k+	869MB

Supported Transcription Model

STT: 20s audio input on Macbook Air M3 chip
Benchmark dataset: internal evals with production users

Model	Params	End2End ms	Latency ms	Decode toks/sec	NPU	RTF	WER
UsefulSensors/moonshine-base	61M	361.35	182	262	yes	0.0180	0.1395
openai/whisper-tiny	39M	232.03	137.38	581	yes	0.0116	0.1860
openai/whisper-base	74M	329.37	178.65	358	yes	0.0164	0.1628
openai/whisper-small	244M	856.79	332.63	108	yes	0.0428	0.0930
openai/whisper-medium	769M	2085.87	923.33	49	yes	0.1041	0.0930
openai/whisper-large-v3	1.55B	5994	2050	15.72	no	0.2992	-
nvidia/parakeet-ctc-0.6b	600M	201.77	201.44	5214285	yes	0.0101	0.0930
nvidia/parakeet-tdt-0.6b-v3	600M	718.91	718.82	3583333	yes	0.0359	0.0465
nvidia/parakeet-ctc-1.1b	1.1B	279.03	278.92	4562500	yes	0.0139	0.1628

Supported LLMs

Gemma weights are often gated on HuggingFace, needs tokens
Run huggingface-cli login and input your huggingface token

Model	Features
google/gemma-3-270m-it	completion
google/functiongemma-270m-it	tools
google/gemma-3-1b-it	completion, gated
google/gemma-4-E2B-it	vision, audio, completion, tools, Apple NPU
google/gemma-4-E4B-it	vision, audio, completion, tools, Apple NPU
google/gemma-3n-E2B-it	completion, tools
google/gemma-3n-E4B-it	completion, tools
Qwen/Qwen3-0.6B	completion, tools, embed
Qwen/Qwen3-Embedding-0.6B	embed
Qwen/Qwen3.5-0.8B	vision, completion, tools, embed
Qwen/Qwen3-1.7B	completion, tools, embed
Qwen/Qwen3.5-2B	vision, completion, tools, embed
LiquidAI/LFM2.5-350M	completion, tools, embed
LiquidAI/LFM2-700M	completion, tools, embed
LiquidAI/LFM2-8B-A1B	completion, tools, embed
LiquidAI/LFM2.5-1.2B-Thinking	completion, tools, embed
LiquidAI/LFM2.5-1.2B-Instruct	completion, tools, embed
LiquidAI/LFM2-2.6B	completion, tools, embed
LiquidAI/LFM2-VL-450M	vision, txt & img embed, Apple NPU
LiquidAI/LFM2.5-VL-450M	vision, txt & img embed, Apple NPU
LiquidAI/LFM2.5-VL-1.6B	vision, txt & img embed, Apple NPU
tencent/Youtu-LLM-2B	completion, tools, embed
nomic-ai/nomic-embed-text-v2-moe	embed

Using this repo

┌──────────────────────────────────────────────────────────────────────────────┐
│                                                                              │
│ Step 0: if on Linux (Ubuntu/Debian)                                          │
│ sudo apt-get install python3 python3-venv python3-pip cmake                  │
│   build-essential libcurl4-openssl-dev                                       │
│                                                                              │
│ Step 1: clone and setup                                                      │
│ git clone https://github.com/cactus-compute/cactus && cd cactus              │
│ source ./setup                                                               │
│                                                                              │
│ Step 2: use the commands                                                     │
│──────────────────────────────────────────────────────────────────────────────│
│                                                                              │
│  cactus auth                         manage Cloud API key                    │
│    --status                          show key status                         │
│    --clear                           remove saved key                        │
│                                                                              │
│  cactus run [model]                  chat playground (gemma-4-E2B-it)        │
│    --image <path>                    image file for VLM inference            │
│    --audio <path>                    audio file (WAV) for audio chat         │
│    --system <prompt>                 system prompt                           │
│    --prompt <text>                   send prompt immediately                 │
│    --thinking                        enable thinking/reasoning mode          │
│    --token <token>                   HF token (gated models)                 │
│    --reconvert                       force reconversion from source          │
│                                                                              │
│  cactus transcribe [model]           speech-to-text (parakeet-tdt-0.6b-v3)   │
│    --file <audio.wav>                audio file to transcribe (required)     │
│    --language <code>                 language code (default: en)             │
│    --token <token>                   HF token (gated models)                 │
│    --reconvert                       force reconversion from source          │
│                                                                              │
│  cactus download <model>          fetch pre-converted CQ from Cactus-Compute │
│    --bits 1|2|3|4                    CQ quantization (default: 4)            │
│    --token <token>                   HuggingFace API token                   │
│                                                                              │
│  cactus convert <model> [dir]        convert model to CQ format              │
│                                      (pre-converted if available, else       │
│                                      built from source)                      │
│    --bits 1|2|3|4                    CQ quantization (default: 4)            │
│    --token <token>                   HuggingFace API token                   │
│    --reconvert                       force build from source                 │
│                                                                              │
│  cactus build                        build for ARM → build/libcactus.a       │
│    --apple                           Apple (iOS/macOS)                       │
│    --android                         Android                                 │
│    --python                          shared lib for Python FFI               │
│                                                                              │
│  cactus test                         run unit tests and benchmarks           │
│    --model <model>                   default: google/gemma-4-E2B-it          │
│    --suite <name>                    run a specific suite (llm, vlm, stt,    │
│                                      embed, rag, graph, index, kernel, etc.) │
│    --token <token>                   HuggingFace API token                   │
│    --reconvert                       force reconversion from source          │
│    --ios                             run on connected iPhone                 │
│    --android                         run on connected Android                │
│                                                                              │
│  cactus clean                        remove all build artifacts              │
│  cactus --help                       show all commands and flags             │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Maintaining Organisations

Citation

If you use Cactus in your research, please cite it as follows:

@software{cactus,
  title        = {Cactus: AI Inference Engine for Phones & Wearables},
  author       = {Ndubuaku, Henry and Cactus Team},
  url          = {https://github.com/cactus-compute/cactus},
  year         = {2025}
}

N/B: Scroll all the way up and click the shields link for resources!