README_EN.md

March 28, 2026 · View on GitHub

简体 | English

nndeploy: An Easy-to-Use and High-Performance AI deployment framework

Linux Windows Android macOS iOS

Documentation | Ask DeepWiki | WeChat | Discord

nndeploy


Latest Updates

  • [2025/05/29] 🔥 Jointly launched a free inference framework course with Huawei Ascend official Ascend Official | Bilibili Video! Based on nndeploy's internal inference sub-module, helping you quickly master core AI inference deployment technologies.

Introduction

nndeploy is an easy-to-use and high-performance AI deployment framework. Based on visual workflows and multi-end inference, developers can quickly develop SDKs for specified platforms and hardware from algorithm repositories, significantly saving development time. In addition, the framework has deployed numerous AI models including LLM, AIGC generation, face swapping, object detection, image segmentation, etc., which are ready to use out of the box.

Easy to Use

  • Visual Workflow: Deploy AI algorithms by dragging nodes, with real-time adjustable parameters and intuitive effects.

  • Custom Nodes: Support Python/C++ custom nodes. Whether implementing preprocessing in Python or writing high-performance nodes in C++/CUDA, they can be seamlessly integrated into the visual workflow.

  • One-Click Deployment: Workflows can be exported as JSON and called through C++/Python APIs, applicable to platforms such as Linux, Windows, macOS, Android, and iOS.

    Building AI Workflow on Desktop Deployment on Mobile

High Performance

  • Parallel Optimization: Support execution modes such as serial, pipeline parallelism, and task parallelism.

  • Memory Optimization: Zero-copy, memory pool, memory reuse and other optimization strategies.

  • High-Performance Optimization: Built-in nodes optimized with C++/CUDA/Ascend C/SIMD implementations.

  • Multi-End Inference: One workflow for multi-end inference, integrating 13 mainstream inference frameworks, covering full-platform deployment scenarios such as cloud, desktop, mobile, and edge.

    ONNXRuntime TensorRT OpenVINO MNN TNN ncnn CoreML AscendCL RKNN SNPE TVM PyTorch nndeploy_inner

    If there is a custom inference framework, it can be used completely independently without relying on any third-party frameworks.

Out-of-the-Box Algorithms

A list of deployed models with over 100+ visual nodes.

Application ScenarioAvailable ModelsRemarks
Large Language ModelsQWen-2.5, QWen-3Support small B models
Image GenerationStable Diffusion 1.5, Stable Diffusion XL, Stable Diffusion 3, HunyuanDiT, etc.Support text-to-image, image-to-image, image inpainting, based on diffusers
Face Swappingdeep-live-cam
OCRPaddle OCR
Object DetectionYOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOv11, YOLOx
Object TrackingFairMot
Image SegmentationRBMGv1.4, PPMatting, Segment Anything
ClassificationResNet, MobileNet, EfficientNet, PPLcNet, GhostNet, ShuffleNet, SqueezeNet
API ServicesOPENAI, DeepSeek, MoonshotSupport LLM and AIGC services

For more, see Detailed List of Deployed Models

Quick Start

  • Step 1: Installation

    pip install --upgrade nndeploy
    
  • Step 2: Launch the Visual Interface

    # Method 1: Command line
    nndeploy-app --port 8000
    # Method 2: Code startup
    cd path/to/nndeploy
    python app.py --port 8000
    

    After successful launch, open http://localhost:8000 to access the workflow editor. Here, you can drag nodes, adjust parameters, and preview effects in real-time, with a what-you-see-is-what-you-get experience.

    nndeploy

  • Step 3: Save and Load for Execution

    After building and debugging in the visual interface, click save, and the workflow will be exported as a JSON file, which encapsulates all processing procedures. You can run it in the production environment in the following two ways:

    • Method 1: Command-line execution

      For debugging

      # Python CLI
      nndeploy-run-json --json_file path/to/workflow.json
      # C++ CLI
      nndeploy_demo_run_json --json_file path/to/workflow.json
      
    • Method 2: Load and run in Python/C++ code

      You can integrate the JSON file into your existing Python or C++ project. Here is an example code for loading and running an LLM workflow:

      • Python API to load and run LLM workflow
        graph = nndeploy.dag.Graph("")
        graph.remove_in_out_node()
        graph.load_file("path/to/llm_workflow.json")
        graph.init()
        input = graph.get_input(0)
        text = nndeploy.tokenizer.TokenizerText()
        text.texts_ = [ "<|im_start|>user\nPlease introduce NBA superstar Michael Jordan<|im_end|>\n<|im_start|>assistant\n" ]
        input.set(text)
        status = graph.run()
        output = graph.get_output(0)
        result = output.get_graph_output()
        graph.deinit()
        
      • C++ API to load and run LLM workflow
        std::shared_ptr<dag::Graph> graph = std::make_shared<dag::Graph>("");
        base::Status status = graph->loadFile("path/to/llm_workflow.json");
        graph->removeInOutNode();
        status = graph->init();
        dag::Edge* input = graph->getInput(0);
        tokenizer::TokenizerText* text = new tokenizer::TokenizerText();
        text->texts_ = {
            "<|im_start|>user\nPlease introduce NBA superstar Michael Jordan<|im_end|>\n<|im_start|>assistant\n"};
        input->set(text, false);
        status = graph->run();
        dag::Edge* output = graph->getOutput(0);
        tokenizer::TokenizerText* result =
            output->getGraphOutput<tokenizer::TokenizerText>();
        status = graph->deinit();
        

Requires Python 3.10+. By default, it includes ONNXRuntime, and MNN. For more inference backends, please use developer mode.

Documentation

Performance Testing

Test environment: Ubuntu 22.04, i7-12700, RTX3060

  • Pipeline parallel acceleration. End-to-end workflow total time for YOLOv11s, serial vs pipeline parallel

    Execution Mode \ Inference EngineONNXRuntimeOpenVINOTensorRT
    Serial54.803 ms34.139 ms13.213 ms
    Pipeline Parallel47.283 ms29.666 ms5.681 ms
    Performance Improvement13.7%13.1%57%
  • Task parallel acceleration. End-to-end total time for combined tasks (segmentation RMBGv1.4 + detection YOLOv11s + classification ResNet50), serial vs task parallel

    Execution Mode \ Inference EngineONNXRuntimeOpenVINOTensorRT
    Serial654.315 ms489.934 ms59.140 ms
    Task Parallel602.104 ms435.181 ms51.883 ms
    Performance Improvement7.98%11.2%12.2%

Roadmap

Contact Us

  • If you love open source and enjoy tinkering, whether for learning purposes or to share better ideas, you are welcome to join us.

  • WeChat: Always031856 (Feel free to add as a friend to join the group discussion. Please note: nndeploy_name)

Acknowledgements

Contributors

Star History Chart