Triton OpenAI

March 10, 2025 · View on GitHub

Triton

Triton OpenAI

A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.

Official Documentation

My daily challange (Triton day by day)

This project is a step-by-step learning journey where we implement various types of Triton kernels—from the simplest examples to more advanced applications—while exploring GPU programming with Triton. The goal of this repository is to help you (and others) get comfortable with Triton by:

  • Starting simple: begin with basic kernels such as vector addition, and understand the building blocks of writing GPU code with Triton.
  • Incremental learning: each day introduces a new challenge, progressively covering more complex topics, techniques, and optimizations.
  • Hands-on experience: code, test, and benchmark your kernels against standard implementations (e.g., PyTorch) to see performance improvements and better understand GPU behavior.

Daily challenges: every day, a new challenge is posted in this repository. Each challenge focuses on a specific aspect of Triton, such as:

  • Basic operations (e.g., vector addition)
  • Memory management and optimizations
  • Advanced indexing and dynamic shapes
  • Multi-dimensional kernels
  • Reduction operations and more
  • Detailed explanations: each kernel comes with an in-depth explanation of the code, helping you understand the concepts behind the implementation.
  • Benchmarking and stress tests: learn how to measure performance by comparing custom Triton kernels with standard PyTorch implementations. Get hands-on experience with benchmarking on real-world GPU workloads.
DayKernelDescription
#1Constant addThis challenge is the first puzzle in our Daily Triton Challenge series. The goal is to write a Triton kernel that adds a constant value to each element of a vector.
#2Add two vectorsSimple example of how to add two vectors using a custom GPU kernel written in Triton and compares the result to a standard PyTorch implementation.
#3Add two vectors with speed benchmarkingThis is almost the same as #2 but we meaesure kernel execution speed and compare it to Pytorch implementation.
#4ReLU Activation with TritonIn this challenge, you will implement the ReLU (Rectified Linear Unit) activation function using Triton. ReLU is defined as: ReLU(x)=max(0,x)
#5ReLU Activation forward and backwardIn this challenge, you will implement the ReLU activation function in a way that is fully compatible with PyTorch’s autograd. That means you’ll write a custom autograd function that uses a Triton kernel for the forward pass (computing y = max(0, x)) and a second Triton kernel for the backward pass (computing the gradient of ReLU, where grad_input = grad_output if x > 0 and 0 otherwise).

Articles

Gain deeper insights into Triton through these detailed articles:

Blackwell and Triton

Research Papers

Explore the academic foundation of Triton:

Videos

Learn by watching these informative videos:

Triton community meetup

Watch Triton community meetups to be up to date with Triton recent topics.

Triton-Puzzles

Challenge yourself with these engaging puzzles:

Tools

Enhance your Triton development workflow with these tools:

Conferences

Catch up on the latest advancements from Triton Conferences:

Sample Kernels

Explore practical implementations with these sample kernels:

Triton integrations

Triton backends

Triton communities


Triton Kernel Index

KernelDescriptionResource
VectorAddA simple kernel that performs element-wise addition of two vectors. Useful for understanding the basics of GPU programming in Triton.1 2
MatmulAn optimized kernel for matrix multiplication, achieving high performance by leveraging memory hierarchy and parallelism.1 2 Grouped GEMM
SoftmaxA kernel for efficient computation of the softmax function, commonly used in machine learning models like transformers.1 2 3
DropoutA kernel for implementing low-memory dropout, a regularization technique to prevent overfitting in neural networks.1 2
Layer NormalizationA kernel for layer normalization, which normalizes activations within a layer to improve training stability in deep learning models.1 2 3
Fused AttentionA kernel that efficiently implements attention mechanisms by combining multiple operations, key to transformers and similar architectures.1 2
Conv1dA kernel for 1D convolution, often used in processing sequential data like time series or audio signals.1
Conv2dA kernel for 2D convolution, a fundamental operation in computer vision tasks such as image classification or object detection.1
MultiheadAttentionA kernel for multi-head attention, a crucial component in transformer-based models for capturing complex relationships in data.1
HardsigmoidA kernel for the Hardsigmoid activation function, an efficient approximation of the sigmoid function used in certain neural network layers.1
GeLUGeLU1
GeGLUGeGLU1
RMSNormRMSNorm1

Triton updates, news, new features

Contribution

Feel free to contribute more resources or suggest updates by opening a pull request or issue in this repository.


License

This resource list is open-sourced under the MIT license.