Triton OpenAI

March 10, 2025 · View on GitHub

Triton OpenAI

A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.

Official Documentation

Official Triton Documentation

My daily challange (Triton day by day)

This project is a step-by-step learning journey where we implement various types of Triton kernels—from the simplest examples to more advanced applications—while exploring GPU programming with Triton. The goal of this repository is to help you (and others) get comfortable with Triton by:

Starting simple: begin with basic kernels such as vector addition, and understand the building blocks of writing GPU code with Triton.
Incremental learning: each day introduces a new challenge, progressively covering more complex topics, techniques, and optimizations.
Hands-on experience: code, test, and benchmark your kernels against standard implementations (e.g., PyTorch) to see performance improvements and better understand GPU behavior.

Daily challenges: every day, a new challenge is posted in this repository. Each challenge focuses on a specific aspect of Triton, such as:

Basic operations (e.g., vector addition)
Memory management and optimizations
Advanced indexing and dynamic shapes
Multi-dimensional kernels
Reduction operations and more
Detailed explanations: each kernel comes with an in-depth explanation of the code, helping you understand the concepts behind the implementation.
Benchmarking and stress tests: learn how to measure performance by comparing custom Triton kernels with standard PyTorch implementations. Get hands-on experience with benchmarking on real-world GPU workloads.

Day	Kernel	Description
#1	Constant add	This challenge is the first puzzle in our Daily Triton Challenge series. The goal is to write a Triton kernel that adds a constant value to each element of a vector.
#2	Add two vectors	Simple example of how to add two vectors using a custom GPU kernel written in Triton and compares the result to a standard PyTorch implementation.
#3	Add two vectors with speed benchmarking	This is almost the same as #2 but we meaesure kernel execution speed and compare it to Pytorch implementation.
#4	ReLU Activation with Triton	In this challenge, you will implement the ReLU (Rectified Linear Unit) activation function using Triton. ReLU is defined as: ReLU(x)=max(0,x)
#5	ReLU Activation forward and backward	In this challenge, you will implement the ReLU activation function in a way that is fully compatible with PyTorch’s autograd. That means you’ll write a custom autograd function that uses a Triton kernel for the forward pass (computing y = max(0, x)) and a second Triton kernel for the backward pass (computing the gradient of ReLU, where grad_input = grad_output if x > 0 and 0 otherwise).

Articles

Gain deeper insights into Triton through these detailed articles:

Understanding the Triton Tutorials Part 1 and Part 2
Softmax in OpenAI Triton -> more detailed Fused Softmax Triton example explanation (step-by-step)
Accelerating AI with Triton: A Deep Dive into Writing High-Performance GPU Code
Accelerating Triton Dequantization Kernels for GPTQ
Triton Tutorial #2
Triton: OpenAI’s Innovative Programming Language for Custom Deep-Learning Primitives
Triton Kernel Compilation Stages
Deep Dive into Triton Internals Part 1, Part 2 and Part 3
Exploring Triton GPU programming for neural networks in Java
Using User-Defined Triton Kernels with torch.compile
Mamba: The Hard Way
FP8: Accelerating 2D Dynamic Block Quantized Float8 GEMMs in Triton
FP8: Deep Dive on CUTLASS Ping-Pong GEMM Kernel
FP8: Deep Dive on the Hopper TMA Unit for FP8 GEMMs
Technical Review on PyTorch2.0 and Triton
Towards Agile Development of Efficient Deep Learning Operators
Developing Triton Kernels on AMD GPUs
CUDA-Free Inference for LLMs
Enabling advanced GPU features in PyTorch - Warp Specialization - Fully automated Triton warp specialization in Triton.
Teaching AI to Write GPU Code: A Deep Dive into Reinforcement Fine-Tuning

Blackwell and Triton

Accelerating the Future: Triton on Blackwell Architecture
OpenAI Triton on NVIDIA Blackwell Boosts AI Performance and Programmability - Triton compiler now supports the NVIDIA Blackwell architecture.
Running PyTorch and Triton on the RTX 5080

Research Papers

Explore the academic foundation of Triton:

Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations

Videos

Learn by watching these informative videos:

Triton community meetup

Watch Triton community meetups to be up to date with Triton recent topics.

2024-11-09

Triton-Puzzles

Challenge yourself with these engaging puzzles:

Tools

Enhance your Triton development workflow with these tools:

Triton Deja-vu Framework to reduce autotune overhead of triton-lang to zero for well known deployments. This small framework is based on the Triton autotuner and contributes two features to the Triton community: 1. store and safely restore autotuner states using JSON files, 2. ConfigSpaces to explore a defined space exhaustively. Additionally, it allows to use heuristics in combination with the autotuner.
Triton Profiler and video explaining how to use it Dev Tools: Proton/Interpreter
Triton-Viz: A Visualization Toolkit for Programming with Triton
Make Triton easier - Triton-util provides simple higher-level abstractions for frequent but repetitive tasks. This allows you to write code that is closer to how you actually think.
TritonBench is a collection of PyTorch operators used to evaluation the performance of Triton, and its integration with PyTorch.
TritonBench features two distinct channels: TritonBench-G and TritonBench-T, each with its own evaluation framework.

Conferences

Catch up on the latest advancements from Triton Conferences:

Sample Kernels

Explore practical implementations with these sample kernels:

attorch is a subset of PyTorch's nn module, written purely in Python using OpenAI's Triton
FlagGems is a high-performance general operator library implemented in OpenAI Triton. It aims to provide a suite of kernel functions to accelerate LLM training and inference.
Kernl lets you run Pytorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.
Linger-Kernel
Triton Kernels for Efficient Low-Bit Matrix Multiplication
Unsloth Kernels
This is attempt at implementing a Triton kernel for GPTQ inference. This code is based on the GPTQ-for-LLaMa codebase, which is itself based on the GPTQ codebase.
triton-index - Catalog openly available Triton kernels
Triton-based implementation of Sparse Mixture-of-Experts (SMoE) on GPUs
Variety of Triton and CUDA kernels for training and inference
EquiTriton is a project that seeks to implement high-performance kernels for commonly used building blocks in equivariant neural networks, enabling compute efficient training and inference
Expanded collection of Neural Network activation functions and other function kernels in Triton by OpenAI.
Fused kernels
Triton activations only feed forward
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance
Bitsandbytes - ibrary is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 & 4-bit quantization functions
MInference Triton Kernels - FlashAttention
GridQuant - This repository tries to implements the ideas presented in the blog post "Accelerating 2D Dynamic Block Quantized Float8 GEMMs in Triton". Designed specifically for NVIDIA H100 GPUs, it leverages advanced features like float8 computation, Triton's high-performance GPU programming capabilities, and the Tensor Memory Accelerator (TMA).
Efficient Triton implementations for Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention.

Triton integrations

jax-triton

Triton backends

Intel® XPU Backend for Triton

Triton communities

CUDA-MODE

Triton Kernel Index

Kernel	Description	Resource
VectorAdd	A simple kernel that performs element-wise addition of two vectors. Useful for understanding the basics of GPU programming in Triton.	1 2
Matmul	An optimized kernel for matrix multiplication, achieving high performance by leveraging memory hierarchy and parallelism.	1 2 Grouped GEMM
Softmax	A kernel for efficient computation of the softmax function, commonly used in machine learning models like transformers.	1 2 3
Dropout	A kernel for implementing low-memory dropout, a regularization technique to prevent overfitting in neural networks.	1 2
Layer Normalization	A kernel for layer normalization, which normalizes activations within a layer to improve training stability in deep learning models.	1 2 3
Fused Attention	A kernel that efficiently implements attention mechanisms by combining multiple operations, key to transformers and similar architectures.	1 2
Conv1d	A kernel for 1D convolution, often used in processing sequential data like time series or audio signals.	1
Conv2d	A kernel for 2D convolution, a fundamental operation in computer vision tasks such as image classification or object detection.	1
MultiheadAttention	A kernel for multi-head attention, a crucial component in transformer-based models for capturing complex relationships in data.	1
Hardsigmoid	A kernel for the Hardsigmoid activation function, an efficient approximation of the sigmoid function used in certain neural network layers.	1
GeLU	GeLU	1
GeGLU	GeGLU	1
RMSNorm	RMSNorm	1

Triton updates, news, new features

Automatic Warp Specialization Optimization

Contribution

Feel free to contribute more resources or suggest updates by opening a pull request or issue in this repository.

License

This resource list is open-sourced under the MIT license.