Optimizing Components of fastRAG on Intel Hardware

December 24, 2023 · View on GitHub

Models can be further optimized through software frameworks to improve latency and throughput. Software packages such as optimum-intel developed by Intel and partners are designed to leverage the CPU extensions found in the most recent Intel processors. Transformer-based models can undergo quantization, sparsification, or enhancement through knowledge distillation by utilizing the optimum-intel library.

Quantization

Quantization is a process that minimizes both computational overhead and memory footprint during inference. This is achieved by adopting lower-precision data types, such as 8-bit integers (int8), instead of the standard 32-bit floating-point numbers (float32) to represent model weights and activations. To facilitate these optimizations, frameworks like the Intel Extension for Pytorch and optimum-intel provide specialized support for the latest Intel CPU features.

Why should we optimize using quantization?

Reduction in bit count leads to a model that requires less memory storage, potentially reduces energy consumption, and enables faster operations, such as matrix multiplication, through integer arithmetic.

Available Optimizations

	framework	backend
LLM Quantization	`optimum-intel`	CPU
Bi-encoder Quantization	`optimum-intel`	CPU
Cross-encoder Quantization	`neural-compressor`, `ipex`	CPU
LlamaCPP LLMs	`llama_cpp`	CPU