Pipelined sort (linear time)

March 7, 2022 ยท View on GitHub

A sort has to be n.log(n), right? Well, not if you can design a pipelined architecture!

Note: This is a toy example demonstrating Silice pipelines, as well as Silice pre-processor for code generation.

What this does

This algorithm is designed to sort a stream of no more than N values as they come in. The sort is pipelined such that if N values are received, they are sorted N cycles after the last one came in (linear time!).

Of course there is no free lunch, and this comes at the cost of a pipeline of depth N. So beyond being a fun -- and hopefully interesting -- example, this will only be useful in practice for relatively small values of N. But such cases do occur in practice, and this algorithm may come in handy.

How to test

Run make verilator or make icarus (the latter will open gtkwave with the simulated waves).

You will see a list of sorted entries, such as:

[          0] =   1
[          1] =   4
[          2] =  45
[          3] =  50
[          4] =  78
[          5] =  89
[          6] = 122
[          7] = 131
[          8] = 144
[          9] = 149
[         10] = 181
[         11] = 190
[         12] = 206
[         13] = 209
[         14] = 219
[         15] = 228

Walkthrough the code

Now, let's have a look at main.si.

The algorithm is built around a main while loop:

  uint8 i = 0;
  while (i<\$2*N$) {
     ...
  }

The pipeline is within the loop, and the loop iterates 2.N times to ensure the pipeline is fully flushed at the end (N cycles after the last value is inserted).

The syntax $2*N$ is using the Lua preprocessor. Indeed N is a preprocessor variable defined before with this line:

$$N=16

Using $$ at the start of a line indicates that the entire line is preprocessor code, while using $...$ within a line of code inserts the preprocessor result within the current code line. So for instance $2*N$ becomes 32 when N=16.

This loop inserts (streams) the values from the in_values array into the pipeline. For each iteration below N, a value is read from in_value and placed in front of the pipeline top stage. For iterations above N we insert the MAX value, which in our case is 255, so that they no longer influence the result.

to_insert = i < $N$ ? in_values[i] : 255;

Again, the last N iterations are only here to flush the pipeline, ensuring the last inserted value can fully propagate ; with this simple algorithm this requires N cycles.

From the point of view of a single pipeline stage things are fairly simple. Stage $n$ is responsible for the value stored at rank n in the sorted array, sorted_$n$. The variables sorted_$n$ together are the result array: At the end sorted_0 contains the smallest value and sorted_$N-1$ the largest (we sort in ascending order). The role of each pipeline stage is to maintain the value of sorted_$n$ given incoming values.

At each clock cycle each stage receives a value to be inserted, to_insert. Each stage compares this incoming value to the value currently in its sorted_$n$. If the incoming value is larger, nothing is changed and the value is passed further down the pipeline. If the incoming value is smaller, it replaces the current value in sorted_$n$, and the evicted value is passed to the next stage, for further insertion: next cycle the evicted value becomes the one to insert at stage n+1. Let's have a look at the code for a single stage:

    if ( i > $n$                    // do nothing before the pipeline reached this stage
      && to_insert < sorted_$n$ ) { // if the value to insert is smaller, we insert here
        // the current value is evicted and becomes the next one to insert
        uint8 tmp  = uninitialized;
        tmp        = sorted_$n$;
        sorted_$n$ = to_insert;
        to_insert  = tmp;
    }

The really important thing here is that all stages execute in parallel, such that the evicted values trickle down the pipeline all together, at each clock cycle. There are, in fact, N versions of to_insert at all times, one per stage. It takes N cycles to flush the pipeline, as if the last inserted value is the largest one it has to trickle down all N stages.

Now that we understand each stage, how do we tell Silice to build a pipeline? The syntax is simply:

{
  // stage 0
} -> {
  // stage 1
} -> {
  // final stage
}

Note how Silice takes care of ensuring that to_insert is passed from one stage to another. In a Silice pipeline, as soon as a variable is written it automatically trickles down the pipeline: it is passed from one stage to another. Here, we capture to_insert in the pipeline in the first stage, by writing it as:

to_insert = i < $N$ ? in_values[i] : 255;

Please refer to the documentation for more details on pipelines (Note: this part of the documentation is not yet written ;) ).

As our pipeline depth depends on the value of N, we build it with the preprocessor:

$$for n=0,N-1 do
    {
      // [removed] code for stage n
    }
$$if n < N-1 then
    -> // pipe to next stage (if not last)
$$end
$$end

And that's it! We have a linear time sorting algorithm for an incoming stream of values.

Example run

The following figure illustrates a run for N=3. It takes 6 cycles to guarantee that the sort is fully terminated.

The figure shows 6 clock cycles from left to right. For each cycle, we see the incoming stream value at the top, the comparisons of each stage (vertically) and the values to insert trickling down the pipeline (orange arrows). The horizontal green arrows show when a value is changed in the sorted array.

For instance, consider the value 20. Its insertion starts at cycle 2. Stage 0 pushed it down as it is greater than 5, already inserted there. On cycle 3, 20 evicts MAX and replaces it in the result array. However, 1 started its insertion at cycle 3 and evicted 5. On cycle 4, 5 reaches stage 1 and evicts 20. On cycle 5, in stage 2, 20 evicts MAX and replaces it in the result array.

pipeline sort

Further reading

For efficient sorting in parallel refer to sorting networks. Typical parallel algorithms are odd-even sort and bitonic sort.