README_EN.md

March 28, 2026 · View on GitHub

简体 | English

nndeploy: An Easy-to-Use and High-Performance AI deployment framework

Documentation | Ask DeepWiki | WeChat | Discord

nndeploy

Latest Updates

[2025/05/29] 🔥 Jointly launched a free inference framework course with Huawei Ascend official Ascend Official | Bilibili Video! Based on nndeploy's internal inference sub-module, helping you quickly master core AI inference deployment technologies.

nndeploy is an easy-to-use and high-performance AI deployment framework. Based on visual workflows and multi-end inference, developers can quickly develop SDKs for specified platforms and hardware from algorithm repositories, significantly saving development time. In addition, the framework has deployed numerous AI models including LLM, AIGC generation, face swapping, object detection, image segmentation, etc., which are ready to use out of the box.

Easy to Use

Visual Workflow: Deploy AI algorithms by dragging nodes, with real-time adjustable parameters and intuitive effects.
Custom Nodes: Support Python/C++ custom nodes. Whether implementing preprocessing in Python or writing high-performance nodes in C++/CUDA, they can be seamlessly integrated into the visual workflow.
One-Click Deployment: Workflows can be exported as JSON and called through C++/Python APIs, applicable to platforms such as Linux, Windows, macOS, Android, and iOS.

Building AI Workflow on Desktop Deployment on Mobile

High Performance

Parallel Optimization: Support execution modes such as serial, pipeline parallelism, and task parallelism.
Memory Optimization: Zero-copy, memory pool, memory reuse and other optimization strategies.
High-Performance Optimization: Built-in nodes optimized with C++/CUDA/Ascend C/SIMD implementations.

Multi-End Inference: One workflow for multi-end inference, integrating 13 mainstream inference frameworks, covering full-platform deployment scenarios such as cloud, desktop, mobile, and edge.

✅

If there is a custom inference framework, it can be used completely independently without relying on any third-party frameworks.

Out-of-the-Box Algorithms

A list of deployed models with over 100+ visual nodes.

Application Scenario	Available Models	Remarks
Large Language Models	QWen-2.5, QWen-3	Support small B models
Image Generation	Stable Diffusion 1.5, Stable Diffusion XL, Stable Diffusion 3, HunyuanDiT, etc.	Support text-to-image, image-to-image, image inpainting, based on diffusers
Face Swapping	deep-live-cam
OCR	Paddle OCR
Object Detection	YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOv11, YOLOx
Object Tracking	FairMot
Image Segmentation	RBMGv1.4, PPMatting, Segment Anything
Classification	ResNet, MobileNet, EfficientNet, PPLcNet, GhostNet, ShuffleNet, SqueezeNet
API Services	OPENAI, DeepSeek, Moonshot	Support LLM and AIGC services

For more, see Detailed List of Deployed Models

Quick Start

Step 1: Installation
```
pip install --upgrade nndeploy
```
Step 2: Launch the Visual Interface
```
# Method 1: Command line
nndeploy-app --port 8000
# Method 2: Code startup
cd path/to/nndeploy
python app.py --port 8000
```
After successful launch, open http://localhost:8000 to access the workflow editor. Here, you can drag nodes, adjust parameters, and preview effects in real-time, with a what-you-see-is-what-you-get experience.

Step 3: Save and Load for Execution

After building and debugging in the visual interface, click save, and the workflow will be exported as a JSON file, which encapsulates all processing procedures. You can run it in the production environment in the following two ways:

Method 1: Command-line execution

For debugging

# Python CLI
nndeploy-run-json --json_file path/to/workflow.json
# C++ CLI
nndeploy_demo_run_json --json_file path/to/workflow.json

Method 2: Load and run in Python/C++ code

You can integrate the JSON file into your existing Python or C++ project. Here is an example code for loading and running an LLM workflow:

Python API to load and run LLM workflow

graph = nndeploy.dag.Graph("")
graph.remove_in_out_node()
graph.load_file("path/to/llm_workflow.json")
graph.init()
input = graph.get_input(0)
text = nndeploy.tokenizer.TokenizerText()
text.texts_ = [ "<|im_start|>user\nPlease introduce NBA superstar Michael Jordan<|im_end|>\n<|im_start|>assistant\n" ]
input.set(text)
status = graph.run()
output = graph.get_output(0)
result = output.get_graph_output()
graph.deinit()

C++ API to load and run LLM workflow

std::shared_ptr<dag::Graph> graph = std::make_shared<dag::Graph>("");
base::Status status = graph->loadFile("path/to/llm_workflow.json");
graph->removeInOutNode();
status = graph->init();
dag::Edge* input = graph->getInput(0);
tokenizer::TokenizerText* text = new tokenizer::TokenizerText();
text->texts_ = {
    "<|im_start|>user\nPlease introduce NBA superstar Michael Jordan<|im_end|>\n<|im_start|>assistant\n"};
input->set(text, false);
status = graph->run();
dag::Edge* output = graph->getOutput(0);
tokenizer::TokenizerText* result =
    output->getGraphOutput<tokenizer::TokenizerText>();
status = graph->deinit();

Requires Python 3.10+. By default, it includes ONNXRuntime, and MNN. For more inference backends, please use developer mode.

Documentation

Performance Testing

Test environment: Ubuntu 22.04, i7-12700, RTX3060

Pipeline parallel acceleration. End-to-end workflow total time for YOLOv11s, serial vs pipeline parallel

Execution Mode \ Inference Engine ONNXRuntime OpenVINO TensorRT
Serial 54.803 ms 34.139 ms 13.213 ms
Pipeline Parallel 47.283 ms 29.666 ms 5.681 ms
Performance Improvement 13.7% 13.1% 57%
Task parallel acceleration. End-to-end total time for combined tasks (segmentation RMBGv1.4 + detection YOLOv11s + classification ResNet50), serial vs task parallel

Execution Mode \ Inference Engine ONNXRuntime OpenVINO TensorRT
Serial 654.315 ms 489.934 ms 59.140 ms
Task Parallel 602.104 ms 435.181 ms 51.883 ms
Performance Improvement 7.98% 11.2% 12.2%

Execution Mode \ Inference Engine	ONNXRuntime	OpenVINO	TensorRT
Serial	54.803 ms	34.139 ms	13.213 ms
Pipeline Parallel	47.283 ms	29.666 ms	5.681 ms
Performance Improvement	13.7%	13.1%	57%

Execution Mode \ Inference Engine	ONNXRuntime	OpenVINO	TensorRT
Serial	654.315 ms	489.934 ms	59.140 ms
Task Parallel	602.104 ms	435.181 ms	51.883 ms
Performance Improvement	7.98%	11.2%	12.2%

Roadmap

Contact Us

If you love open source and enjoy tinkering, whether for learning purposes or to share better ideas, you are welcome to join us.
WeChat: Always031856 (Feel free to add as a friend to join the group discussion. Please note: nndeploy_name)