Uniform Quantization with Fine-Tuning

March 18, 2026 · View on GitHub

A uniform "fake" quantization method supports an arbitrary number of bits (>=2) which is used to represent weights and activations. The method performs differentiable sampling of the continuous signal (for example, activations or weights) during forward pass, simulating inference with integer arithmetic.

Common Quantization Formula

Quantization is parametrized by clamping range and number of quantization levels. The sampling formula is the following:

ZP=inputlowsZP = \lfloor - input\\_low * s \rceil

output=(clamp(input;inputlow,inputhigh)inputlow)sZPsoutput = \frac{\left\lfloor (clamp(input; input\\_low, input\\_high)-input\\_low) * s- ZP \right\rceil} {s}

clamp(input;inputlow,inputhigh)clamp(input; input\\_low, input\\_high)

s=levels1inputhighinputlows = \frac{levels - 1}{input\\_high - input\\_low}

inputlowinput\\_low and inputhighinput\\_high represent the quantization range and \left\lfloor \cdot \right\rceil denotes rounding to the nearest integer.

Symmetric Quantization

During the training, we optimize the scale parameter that represents the range [input_low, input_range] of the original signal using gradient descent:

inputlow=scalelevellowlevelhighinput\\_low=scale*\frac{level\\_low}{level\\_high}

inputhigh=scaleinput\\_high=scale

In the formula above, levellowlevel\\_low and levelhighlevel\\_high represent the range of the discrete signal.

  • For weights:

    levellow=2bits1+1level\\_low=-2^{bits-1}+1

    levelhigh=2bits11level\\_high=2^{bits-1}-1

    levels=255levels=255

  • For unsigned activations:

    levellow=0level\\_low=0

    levelhigh=2bits1level\\_high=2^{bits}-1

    levels=256levels=256

  • For signed activations:

    levellow=2bits1level\\_low=-2^{bits-1}

    levelhigh=2bits11level\\_high=2^{bits-1}-1

    levels=256levels=256

For all the cases listed above, the common quantization formula is simplified after substitution of inputlowinput\\_low, inputhighinput\\_high and levelslevels:

output=clamp(inputlevelhighscale,levellow,levelhigh)scalelevelhighoutput = \left\lfloor clamp(input * \frac{level\\_high}{scale}, level\\_low, level\\_high)\right \rceil * \frac{scale}{level\\_high}

Use the num_init_samples parameter from the initializer group to initialize the values of scale and determine which activation should be signed or unsigned from the collected statistics using given number of samples.

Asymmetric Quantization

During the training we optimize the input_low and input_range parameters using gradient descent:

inputhigh=inputlow+inputrangeinput\\_high=input\\_low + input\\_range

levels=256levels=256

levellow=0level\\_low=0

levelhigh=2bits1level\\_high=2^{bits}-1

For better accuracy, floating-point zero should be within quantization range and strictly mapped into quant (without rounding). Therefore, the following scheme is applied to ranges of weight and activation quantizers before applying actual quantization:

inputlow=min(inputlow,0){input\\_low}' = min(input\\_low, 0)

inputhigh=max(inputhigh,0){input\\_high}' = max(input\\_high, 0)

ZP=inputlow(levels1)inputhighinputlowZP= \left\lfloor \frac{-{input\\_low}'*(levels-1)}{{input\\_high}'-{input\\_low}'} \right \rceil

inputhigh=ZPlevels+1ZPinputlow{input\\_high}''=\frac{ZP-levels+1}{ZP}*{input\\_low}'

inputlow=ZPZPlevels+1inputhigh{input\\_low}''=\frac{ZP}{ZP-levels+1}*{input\\_high}'

\begin{flalign} & {input\\_low,input\\_high} = \begin{cases} {input\\_low}',{input\\_high}', \& ZP \in {0,levels-1} \\ {input\\_low}',{input\\_high}'', \& {input\\_high}'' - {input\\_low}' > {input\\_high}' - {input\\_low}'' \\ {input\\_low}'',{input\\_high}', \& {input\\_high}'' - {input\\_low}' <= {input\\_high}' - {input\\_low}'' \end{cases} &\end{flalign}

You can use the num_init_samples parameter from the initializer group to initialize the values of input_low and input_range from the collected statistics using given number of samples.

Quantizer setup and hardware config files

NNCF allows to quantize models for best results on a given Intel hardware type when executed using OpenVINO runtime. To achieve this, the quantizer setup should be performed with following considerations in mind:

  1. every operation that can accept quantized inputs on a given HW (i.e. can be executed using quantized input values) should have its inputs quantized in NNCF
  2. the quantized inputs should be quantized with a configuration that is supported on a given HW for a given operation (e.g. per-tensor vs per-channel quantization, or 8 bits vs. 4 bits)
  3. for operations that are agnostic to quantization, the execution should handle quantized tensors rather than full-precision tensors.
  4. certain operation sequences will be runtime-optimized to execute in a single kernel call ("fused"), and additional quantizer insertion/quantization simulation within such operation sequences will be detrimental to overall performance

These requirements are fulfilled by the quantizer propagation algorithm. The algorithm first searches the internal NNCF representation of the model's control flow graph for predefined patterns that are "fusible", and apply the fusing to the internal graph representation as well. Next, the operations in the graph that can be associated to input-quantizable operations on a given target hardware are assigned a single quantizer for each its quantizable activation input, with a number of possible quantizer configurations attached to it (that are feasible on target HW). The quantizers are then "propagated" against the data flow in the model's control flow graph as far as possible, potentially merging with other quantizers. Once all quantizers have reached a standstill in their propagation process, each will have a final (possibly reduced) set of possible quantizer configurations, from which a single one is either chosen manually, or using a precision initialization algorithm (which accepts the potential quantizer locations and associated potential quantizer configuration sets). The resulting configuration is then applied as a final quantizer setup configuration.

Note that this algorithm applies to activation quantization only - the weight quantizers do not require propagation. However, the possible configurations of weight quantizers themselves are also sourced from the HW config file definitions.

The HW to target for a given quantization algorithm run can be specified in NNCF config using the global "target_device" option. The default corresponds to CPU-friendly quantization. "TRIAL" corresponds to a configuration that uses the general quantizer propagation algorithm, but does not use any HW-specific information about quantizability of given operation types or possible quantizer configs for associated inputs or operation weights. Instead it uses a default, basic 8-bit symmetric per-tensor quantization configuration for each quantizer, and quantizes inputs of a certain default operation set, which at the moment is defined internally in NNCF. The quantization configuration in the "target_device": "TRIAL" case may be overridden using the regular "activations" and "weights" sections in the quantization compression algorithm sub-config, see below.

For all target HW types, parts of the model graph can be marked as non-quantizable by using the "ignored_scopes" field - inputs and weights of matching nodes in the NNCF internal graph representation will not be quantized, and the downstream quantizers will not propagate upwards through such nodes.

Quantization Implementation

In our implementation, we use a slightly transformed formula. It is equivalent by order of floating-point operations to simplified symmetric formula and the asymmetric one. The small difference is addition of small positive number eps to prevent division by zero and taking absolute value of range, since it might become negative on backward:

output=clamp((inputinputlow)sZP,levellow,levelhigh)soutput = \frac{clamp(\left\lfloor (input-input\\_low^{*}) *s - ZP \right \rceil, level\\_low, level\\_high)}{s}

s=levelhighinputrange+epss = \frac{level\\_high}{|input\\_range^{*}| + eps}

ZP=inputlowsZP = \lfloor-input\\_low * s\rceil

For asymmetric:

inputlow=inputlowinput\\_low^{*} = input\\_low

inputrange=inputrangeinput\\_range^{*} = input\\_range

For symmetric:

inputlow=0input\\_low^{*} = 0

inputrange=scaleinput\\_range^{*} = scale

The most common case of applying quantization is 8-bit uniform quantization.

Gradient Computation for Quantization-Aware Training

The forward quantization formula contains two non-differentiable operations: clamping and rounding. To enable gradient-based optimization of the quantization parameters during QAT, NNCF defines custom surrogate gradients using a Straight-Through Estimator (STE) for rounding and piecewise-defined surrogate gradients for the clamp boundaries.

This approach is a form of learned-range fake quantization — it is related to Learned Step Size Quantization (LSQ), but uses a different parameterization (input_low, input_range) instead of (step size, zero point), and omits LSQ's gradient scaling factor.

In this section, xx denotes an element of the input tensor and FQ(x)FQ(x) denotes the corresponding fake-quantized output. The quantization parameters inputlowinput\\_low and inputrangeinput\\_range are the values that enter the forward kernel (i.e. after the absolute-value and range-tuning steps described above), so inputrange>0input\\_range > 0. We write s=(levels1)/inputranges = (levels - 1) / input\\_range for the scale factor.

Input Partitioning

The input tensor is partitioned element-wise into three regions based on the quantization range:

  • Below range: x<inputlowx < input\\_low
  • In range: inputlowxinputlow+inputrangeinput\\_low \le x \le input\\_low + input\\_range
  • Above range: x>inputlow+inputrangex > input\\_low + input\\_range

Gradient w.r.t. xx (STE)

The upstream gradient is passed through unchanged when xx is within the quantization range, and zeroed out otherwise:

Lx={LFQin range0below or above range\frac{\partial \mathcal{L}}{\partial x} = \begin{cases} \dfrac{\partial \mathcal{L}}{\partial FQ} & \text{in range} \\ 0 & \text{below or above range} \end{cases}

Gradient w.r.t. inputrangeinput\\_range

The gradient with respect to inputrangeinput\\_range depends on which region the input falls in. Per-element gradients are summed to match the shape of inputrangeinput\\_range (scalar for per-tensor quantization, or per-channel).

Linputrange={LFQFQ(x)xinputrangein rangeLFQabove rangeLFQlevellowlevelhighbelow range\frac{\partial \mathcal{L}}{\partial \mathrm{input\\_range}} = \begin{cases} \dfrac{\partial \mathcal{L}}{\partial FQ} \cdot \dfrac{FQ(x) - x}{\mathrm{input\\_range}} & \text{in range} \\ \dfrac{\partial \mathcal{L}}{\partial FQ} & \text{above range} \\ \dfrac{\partial \mathcal{L}}{\partial FQ} \cdot \dfrac{\mathrm{level\\_low}}{\mathrm{level\\_high}} & \text{below range} \end{cases}

Derivation of the in-range term

For in-range xx, the forward pass is:

FQ(x)=(xinputlow)sZPsFQ(x) = \frac{\left\lfloor (x - \mathrm{input\\_low}) \cdot s - ZP \right\rceil}{s}

where ZP=inputlowsZP = \lfloor -input\\_low \cdot s \rceil. The STE treats each rounding as identity plus a constant residual: u=u+ϵ\lfloor u \rceil = u + \epsilon where ϵ=uu\epsilon = \lfloor u \rceil - u is held constant during differentiation. Applying this to ZPZP and then to the outer rounding, the two inputlowsinput\\_low \cdot s contributions cancel and we obtain:

FQ(x)xs+ϵs=x+ϵsFQ(x) \approx \frac{x \cdot s + \epsilon}{s} = x + \frac{\epsilon}{s}

where ϵ\epsilon is the combined residual from both rounding operations. The xx term is independent of inputrangeinput\\_range; the ϵ/s\epsilon / s term depends on it through $1/s = input\_range / (levels - 1),with, with \epsilon$ treated as constant:

FQinputrange=ϵlevels1\frac{\partial FQ}{\partial \mathrm{input\\_range}} = \frac{\epsilon}{levels - 1}

To re-express ϵ\epsilon in terms of knowable quantities: from the STE expansion above, ϵ=(FQ(x)x)s\epsilon = (FQ(x) - x) \cdot s. Substituting:

FQinputrange=(FQ(x)x)slevels1=FQ(x)xinputrange\frac{\partial FQ}{\partial \mathrm{input\\_range}} = \frac{(FQ(x) - x) \cdot s}{levels - 1} = \frac{FQ(x) - x}{\mathrm{input\\_range}}

This gradient nudges inputrangeinput\\_range to reduce quantization error: if FQ(x)>xFQ(x) > x, the gradient encourages shrinking inputrangeinput\\_range (finer step size), and vice versa.

Above-range term

For x>inputlow+inputrangex > input\\_low + input\\_range, the output is clamped: FQ(x)=inputlow+inputrangeFQ(x) = input\\_low + input\\_range, so FQ/inputrange=1\partial FQ / \partial input\\_range = 1.

Below-range term

For x<inputlowx < input\\_low, the output is clamped to inputlowinput\\_low, which does not depend on inputrangeinput\\_range (in asymmetric mode). The code uses the surrogate gradient α=levellow/levelhigh\alpha = level\\_low / level\\_high rather than the analytic derivative (which would be $0).Forasymmetricquantization(). For asymmetric quantization (level\_low = 0),thisgives), this gives \alpha = 0.Forsymmetricquantizationwithsignedrange(. For symmetric quantization with signed range (level\_low < 0),thenonzero), the non-zero \alphamatchestheanalyticderivativeofmatches the analytic derivative ofinput\_lowwithrespecttothewith respect to thescaleparameter,sinceparameter, sinceinput\_low = scale \cdot level\_low / level\_high$ in symmetric mode.

Gradient w.r.t. inputlowinput\\_low

Linputlow={0in rangeLFQbelow or above range\frac{\partial \mathcal{L}}{\partial \mathrm{input\\_low}} = \begin{cases} 0 & \text{in range} \\ \dfrac{\partial \mathcal{L}}{\partial FQ} & \text{below or above range} \end{cases}

Per-element gradients are summed to match the shape of inputlowinput\\_low.

In-range term. Under the STE, shifting inputlowinput\\_low moves the quantization grid, but the zero-point ZP=inputlowsinputlowsZP = \lfloor -input\\_low \cdot s \rceil \approx -input\\_low \cdot s shifts to compensate. In the STE expansion above, these two contributions cancel (the inputlows-input\\_low \cdot s and +inputlows+input\\_low \cdot s terms), making FQ(x)x+ϵ/sFQ(x) \approx x + \epsilon/s with no dependence on inputlowinput\\_low. This is unlike inputrangeinput\\_range, which affects the step size $1/sandthereforescalestheroundingresidualand therefore scales the rounding residual\epsilon$.

Below- and above-range terms. Outside the range, the clamped output is either inputlowinput\\_low (below) or inputlow+inputrangeinput\\_low + input\\_range (above), both of which have /inputlow=1\partial / \partial input\\_low = 1.

Note: In symmetric quantization mode, inputlowinput\\_low is derived from inputrangeinput\\_range (i.e. scalescale) and is not an independent learnable parameter, so its gradient is not used directly.


NOTE

There is a known issue with AVX2 and AVX512 CPU devices. The issue appears with 8-bit matrix calculations with tensors which elements are close to the maximum or saturated. AVX2 and AVX512 utilize a 16-bit register to store the result of operations on tensors. In case when tensors are saturated the buffer overflow happens. This leads to accuracy degradation. For more details of the overflow issue please refer here.

To fix this issue inside NNCF, by default, all weight tensors are quantized in 8 bits but only 7 bits are effectively used. This regime is used when target_device=TargetDevice.CPU or target_device=TargetDevice.ANY set. This fix, potentially, requires longer fine-tuning.

To control the application of overflow fix, nncf.AdvancedQuantizationParameters(overflow_fix=OverflowFix.ENABLE) config option is introduced.