Triton Inference Server Documentation
February 5, 2026 · View on GitHub
| Installation | Getting Started | User Guide | API Guide | Additional Resources | Customization Guide |
|---|
New to Triton Inference Server? Make use of these tutorials to begin your Triton journey!
Installation
Before you can use the Triton Docker image you must install Docker. If you plan on using a GPU for inference you must also install the NVIDIA Container Toolkit. DGX users should follow Preparing to use NVIDIA Containers.
Pull the image using the following command.
$ docker pull nvcr.io/nvidia/tritonserver:<yy.mm>-py3
Where <yy.mm> is the version of Triton that you want to pull. For a complete list of all the variants and versions of the Triton Inference Server Container, visit the NGC Page. More information about customizing the Triton Container can be found in this section of the User Guide.
Getting Started
This guide covers the simplest possible workflow for deploying a model using a Triton Inference Server.
Triton Inference Server has a considerable list versatile and powerful features. All new users are recommended to explore the User Guide and the additional resources sections for features most relevant to their use case.
User Guide
The User Guide describes how to configure Triton, organize and configure your models, use the C++ and Python clients, etc. This guide includes the following:
- Creating a Model Repository [Overview || Details]
- Writing a Model Configuration [Overview || Details]
- Buillding a Model Pipeline [Overview]
- Managing Model Availability [Overview || Details]
- Collecting Server Metrics [Overview || Details]
- Supporting Custom Ops/layers [Overview || Details]
- Using the Client API [Overview || Details]
- Cancelling Inference Requests [Overview || Details]
- Analyzing Performance [Overview]
- Deploying on edge (Jetson) [Overview]
- Debugging Guide Details
Model Repository
Model Repositories are the organizational hub for using Triton. All models, configuration files, and additional resources needed to serve the models are housed inside a model repository.
Model Configuration
A Model Configuration file is where you set the model-level options, such as output tensor reshaping and dynamic batch sizing.
Required Model Configuration
Triton Inference Server requires some Minimum Required parameters to be filled in the Model Configuration. These required parameters essentially pertain to the structure of the model. For ONNX and TensorRT models, users can rely on Triton to Auto Generate the Minimum Required model configuration.
Versioning Models
Users need the ability to save and serve different versions of models based on business requirements. Triton allows users to set policies to make available different versions of the model as needed. Learn More.
Instance Groups
Triton allows users to use of multiple instances of the same model. Users can specify how many instances (copies) of a model to load and whether to use GPU or CPU. If the model is being loaded on GPU, users can also select which GPUs to use. Learn more.
Optimization Settings
The Model Configuration ModelOptimizationPolicy property is used to specify optimization and prioritization settings for a model. These settings control if/how a model is optimized by the backend and how it is scheduled and executed by Triton. See the ModelConfig Protobuf and Optimization Documentation for the currently available settings.
Scheduling and Batching
Triton supports batching individual inference requests to improve compute resource utilization. This is extremely important as individual requests typically will not saturate GPU resources thus not leveraging the parallelism provided by GPUs to its extent. Learn more about Triton's Batcher and Scheduler.
Rate Limiter
Rate limiter manages the rate at which requests are scheduled on model instances by Triton. The rate limiter operates across all models loaded in Triton to allow cross-model prioritization. Learn more.
Model Warmup
For a few of the Backends (check Additional Resources) some or all of initialization is deferred until the first inference request is received, the benefit is resource conservation but comes with the downside of the initial requests getting processed slower than expected. Users can pre-"warm up" the model by instructing Triton to initialize the model. Learn more.
Inference Request/Response Cache
Triton has a feature which allows inference responses to get cached. Learn More.
Model Pipeline
Building ensembles is as easy as adding an addition configuration file which outlines the specific flow of tensors from one model to another. Any additional changes required by the model ensemble can be made in existing (individual) model configurations.
Model Management
Users can specify policies in the model configuration for loading and unloading of models. This section covers user selectable policy details.
Metrics
Triton provides Prometheus metrics like GPU Utilization, Memory Usage, Latency and more. Learn about available metrics.
Framework Custom Operations
Some frameworks provide the option of building custom layers/operations. These can be added to specific Triton Backends for the those frameworks. Learn more
Client Libraries and Examples
Use the Triton Client API to integrate client applications over the network HTTP/gRPC API or integrate applications directly with Triton using CUDA shared memory to remove network overhead.
- C++ HTTP/GRPC Libraries
- Python HTTP/GRPC Libraries
- Java HTTP Library
- GRPC Generated Libraries
- Shared Memory Extension
Cancelling Inference Requests
Triton can detect and handle requests that have been cancelled from the client-side. This document discusses scope and limitations of the feature.
Performance Analysis
Understanding Inference performance is key to better resource utilization. Use Triton's Tools to costomize your deployment.
Jetson and JetPack
Triton can be deployed on edge devices. Explore resources and examples.
Resources
The following resources are recommended to explore the full suite of Triton Inference Server's functionalities.
-
Clients: Triton Inference Server comes with C++, Python and Java APIs with which users can send HTTP/REST or gRPC(possible extensions for other languages) requests. Explore the client repository for examples and documentation.
-
Configuring Deployment: Triton comes with three tools which can be used to configure deployment setting, measure performance and recommend optimizations.
- Model Analyzer Model Analyzer is CLI tool built to recommend deployment configurations for Triton Inference Server based on user's Quality of Service Requirements. It also generates detailed reports about model performance to summarize the benefits and trade offs of different configurations.
- Perf Analyzer: Perf Analyzer is a CLI application built to generate inference requests and measures the latency of those requests and throughput of the model being served.
- Model Navigator: The Triton Model Navigator is a tool that provides the ability to automate the process of moving model from source to optimal format and configuration for deployment on Triton Inference Server. The tool supports export model from source to all possible formats and applies the Triton Inference Server backend optimizations.
-
Backends: Triton supports a wide variety of frameworks used to run models. Users can extend this functionality by creating custom backends.
- PyTorch: Widely used Open Source DL Framework
- TensorRT: NVIDIA TensorRT is an inference acceleration SDK that provide a with range of graph optimizations, kernel optimization, use of lower precision, and more.
- ONNX: ONNX Runtime is a cross-platform inference and training machine-learning accelerator.
- OpenVINO: OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference.
- Paddle Paddle: Widely used Open Source DL Framework
- Python: Users can add custom business logic, or any python code/model for serving requests.
- Forest Inference Library: Backend built for forest models trained by several popular machine learning frameworks (including XGBoost, LightGBM, Scikit-Learn, and cuML)
- DALI: NVIDIA DALI is a Data Loading Library purpose built to accelerated pre-processing and data loading steps in a Deep Learning Pipeline.
- HugeCTR: HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates
- Managed Stateful Models: This backend automatically manages the input and output states of a model. The states are associated with a sequence id and need to be tracked for inference requests associated with the sequence id.
- Faster Transformer: NVIDIA FasterTransformer (FT) is a library implementing an accelerated engine for the inference of transformer-based neural networks, with a special emphasis on large models, spanning many GPUs and nodes in a distributed manner.
- Building Custom Backends
- Sample Custom Backend: Repeat_backend: Backend built to demonstrate sending of zero, one, or multiple responses per request.
Customization Guide
This guide describes how to build and test Triton and also how Triton can be extended with new functionality.