AOCL-DLP (Deep Learning Primitives)

April 6, 2026 · View on GitHub

AOCL-DLP is a library designed to provide optimized deep learning primitives for AMD processors. It implements Low Precision GEMM and batch GEMM for deep learning applications, supporting multiple data types as well as pre-operations and post-operations. The library is tailored to leverage the full potential of AMD hardware, ensuring efficient computation, scalability, and accelerated deep learning workloads.

Features of AOCL-DLP

  • Highly Optimized GEMM Operations: Implements high-performance matrix multiplication operations targeting AMD CPUs with specialized instruction sets (AVX2, AVX512, AVX512_VNNI, AVX512_BF16, AVX512_FP16)
  • Multiple Data Type Support: Works with various precision formats including FP32, FP16, BF16, INT8, INT4 for efficient model training and inference
  • Pre-operations and Post-operations: Includes comprehensive support for operations common in deep learning workloads
  • Batch GEMM Support: Optimized for handling multiple GEMM operations simultaneously
  • Symmetric Quantization Support: Provides specialized routines for symmetric quantization
  • Extensive Thread Support: Optimized for parallel execution via OpenMP

Data Type Terminology

TerminologyDescription
u4/s4uint4_t or int4_t
u8/s8uint8_t or int8_t
u32/s32uint32_t or int32_t
f32float32
f16float16
bf16bfloat16

Supported Data Types

AOCL-DLP provides support for various data type combinations for GEMM operations:

Input AInput BOutput CAccumulatorFunction Suffix
u8/s8s8s32/s8/u8/f32/bf16s32<u8|s8>s8s32o<s32|s8|u8|f32|bf16>
bf16/f32s8s32/s8/u8/f32/bf16s32<bf16|f32>s8s32o<s32|s8|u8|f32|bf16>
bf16s4/u4f32/bf16f32bf16<s4|u4>f32o<f32|bf16>
bf16bf16f32/bf16f32bf16bf16f32o<f32|bf16>
f32f32f32f32f32f32f32of32
f16f16f16f16f16f16f16of16
u8s4s32s32u8s4s32os32

Notes:

  • u8s4s32os32 only has reorder and get_reorder_buf_size APIs (no gemm)
  • s8s8s32o<f32|bf16> also has _sym_quant variants for symmetric quantization
  • Mixed-precision reorder: f32obf16 (converts f32 input to bf16 reordered output)

Pre-Operations

AOCL-DLP supports the following pre-operations:

Pre-Op TypeDescription
Zero PointInput tensor zero point compensation for quantized operations
Scale FactorInput tensor scaling for quantized operations

These pre-operations support different storage types:

  • AOCL_GEMM_F32 (float)
  • AOCL_GEMM_BF16 (bfloat16)
  • AOCL_GEMM_INT8 (int8_t)
  • AOCL_GEMM_UINT8 (uint8_t)
  • AOCL_GEMM_INT32 (int32_t)

Post-Operations

AOCL-DLP supports the following post-operations:

Post-Op TypeDescription
ELTWISEElement-wise operations including activation functions
BIASBias addition to result
SCALEScaling operation
MATRIX_ADDMatrix addition with optional scaling
MATRIX_MULMatrix multiplication with optional scaling

Eltwise Algorithm Types

The following eltwise algorithm types are supported:

Eltwise TypeDescription
RELURectified Linear Unit activation
PRELUParametric Rectified Linear Unit activation
GELU_TANHGaussian Error Linear Unit (tanh approximation)
GELU_ERFGaussian Error Linear Unit (erf approximation)
CLIPClipping values to a specified range
SWISHSwish activation function
TANHHyperbolic tangent activation
SIGMOIDSigmoid activation function

Utility Functions

Standalone utility functions include:

Utility FunctionDescription
gelu_tanh_f32GELU activation with tanh approximation for float
gelu_erf_f32GELU activation with erf approximation for float
softmax_f32Softmax function for float

Eltwise Operations

The library provides specialized element-wise operations:

Eltwise OperationDescription
bf16of32bfloat16 input to float output
bf16obf16bfloat16 input to bfloat16 output
f32of32float input to float output
f32obf16float input to bfloat16 output
f32os32float input to int32_t output
f32os8float input to int8_t output

Hardware Requirements

AOCL-DLP is optimized for AMD processors and requires specific minimum architecture support based on the functions being used:

Minimum Architecture Requirements

Function TypeMinimum Required ISAAvailable On
f32 (float)AVX2/FMA3AMD Zen1 and newer, Intel Haswell and newer
bf16 (bfloat16)AVX2/FMA3AMD Zen1 and newer, Intel Haswell and newer
AVX512_BF16 (optimal)AMD Zen4 and newer, Intel Cooper Lake and newer
int8 (int8, uint8)AVX512_VNNIAMD Zen4 and newer, Intel Cascade Lake and newer

While optimized for AMD processors, the library is compatible with any x86_64 CPU that meets these minimum requirements. For best performance on AMD processors, it is recommended to use Zen4 or newer architectures which support all instruction sets.

Build

Refer to BUILD.md for detailed build instructions, including support for both GNU Make and Ninja.

Install

Refer to INSTALL.md for installation steps, including default and custom prefix install.

Wiki

Refer to Wiki for more information.

License

AOCL-DLP is licensed under the terms and conditions as specified in the LICENSE file.