Columnar Datatoad

November 2, 2025 · View on GitHub

Datatoad, the interpreted Datalog I've been working on, has recently gone "full columnar". This is an exciting milestone for me, and it has some stories attached to it that seemed worth sharing! There will be performance regressions, and while I don't love them they are real, and worth unpacking. There is a minor redemption arc where we recover much of the performance, but not all.

I should stress that this is 100%-amateur-hour for columnar data; many other people have spent a lot more time here than I have, and you should not uncritically adopt anything written here. At the same time, I didn't find existing solutions to the things I wanted to do, which were not OLAP-style data warehousing, and perhaps there is still space to learn and communicate.

"Full-columnar"

This is not a technical term, I imagine.

Columnar data systems organize their data collections not by lists of records (each a "datum"), but by shredding the data into lists for each "attribute" or "column". You end up with a few lists of homogenous primitive types, rather than one list of records with complex types. As an oversimplified example, consider things we might know about some people:

first	last	age	animal
frank	mcsherry	49	chicken
frank	oz	81	pig
frank	zappa	52	weasel
jim	henson	53	frog
...	...	...	...

We could view this as one long list of "records", each with four different attributes (row-oriented). Alternately, we could view it as four long lists, one for each attributes, of primitive types (column-oriented).

There are various reasons to prefer one to another, but I'm currently most excited about the columnar representations. This isn't the right place to spill out all of the reasons, but I personally find it helps me with abstraction. Most of the algorithms I work with aren't too particular about the types of their columns, and work the same "way" no matter what the types are. To execute, the computer needs to resolve the generality of the types of the columns at some point, and doing it efficiently row-by-row seems to require compilation for each configuration of types. Doing things column-by-column allows you to resolve the generality at run-time, for a cost you pay once per column rather than once per row.

There are many more reasons, pro and con, and I'm certainly not the first to have thoughts here.

The main requirement I'm going to use is that the type of each column could be entirely unknown to rest of the system. From our point of view, each column will be a list of .. "things", though not things that we know enough about to even hold in our hands. Only the columns themselves will understand how to handle their values properly. The forces us to write algorithms that are generic over the types of the columns, and which interact with the columns through a narrow interface.

In the limit, I imagine we'll have a trait Column, and each instance of a column will be a Box<dyn Column>. This prevents folks holding them from knowing any more than that certain column operations are supported.

Columnar Representation

The first shift is from row-oriented data to column-oriented data.

This happened a long time ago for datatoad, when we shifted from a "list of rows" format to a layered trie format. Informally, we represent our whole dataset as a list of columns, where each column has as many entries as there are distinct prefixes up through that column. For example, in the dataset above our first column would only have two entries, frank and jim. Repeating this results in a tree, with one layer per column, which spells out the rows in our set as we move through the tree.

Although we can see the whole collection of data, we can also see that perhaps some columns might use different types. First and last name might be strings, age is probably an integer, and favorite animal .. could be a string or an enumeration or an actual physical animal, I guess. Changing the representation, perhaps pivoting from default strings to integers or enumerations, perhaps engaging dictionary compression or entropy coding, all of these are relatively easier with the data as columns than as fields in a row.

More was typed previously talking about the columnar representation and some of the plans. In fact, there's also a bit of content there about columnar operation, which is what we'll need to do next.

Columnar Operation

Writing your data as columns is a great start, but one also needs to be able to operate on columnar data. Some operations are much easier on columns ("what is the average age?") but others are much harder ("which row has the least hash value?").

There is a general principle to follow for data-parallel algorithms, where you record the state you might have on the stack for a row-oriented representation, do it all at once for a column, and then move that state (as a column itself) on to the next data column. This doesn't always work great, but often does.

For example, to intersect two columnar collections, you start with the first pairs of columns and mark which rows match (perhaps as pairs of integers). You take this information to the second column, and as of these matching pairs, which match under the second column. You repeat through each of the columns, and end up identifying the row indexes in each of the final columns whose whole row matches. None of the colunms needed to understand any details of other columns, except that the pairs of columns that were expected to match should understand each other.

We've previously discussed some computation on columnar representations, for example intersection as above, but also union, semijoin, antijoin, and then started to explain a general equijoin and the hardest operator of all .. project. We stalled out at join and project previously, because they are hard. But we'll have something to say here, and some measurements to report, and generally some conclusion if not unconditional success.

Columnar Projection

The main challenging operator for me was columnar projection. You have a collection of data represented as columns, each a list of ordered lists of values (this is how the trie layout works). Some jerk decides that you want the same data, but with a subset of columns and in a different order.

If you weren't too bothered about the trie-structure, where the lists are of ordered lists, you could probably just swing the columns into the new order. We care a lot about the trie-structure because it is what unlocks (relatively) fast random access to the facts. The tries are navigable because each list is ordered, so as you descend you can find matches using binary search.

Transforming one columnar representation to another, without the ability to land at a flat row-oriented representation in the middle, is tricky. The approach we'll use is to approach each column with a list [(group, index)] of integer pairs, indicating the distinct prefixes so far (distinct values of group) and their association with items in the column (index). This information is sufficient to produce a new output column, by

selecting and sorting (group, value[index]),
deduplicating the resulting pairs,
forming lists for each distinct group.

This produces a list of ordered lists, as we expect for a column, where the values for each group are colocated, sorted, and deduplicated. The groups themselves are put in a specific order, and we have the column we want.

We are missing just a little bit, in that we'll also need to prepare the [(group, index)] information for the next column. Each distinct (group, value[index]) turns in to a new group value for the next layer, and we'll want to associate that back to the original index, as it is the only frame of reference that makes sense to the next column.

All put together, we have a column sorting method that takes in auxiliary grouping information, and produces a new column and the grouping information needed to sort the next column.

The sorting we have to do can be best implemented by each column. For example, when the column contains u32 data, I find LSB radixsort to be a great answer, especially as the grouping information is also dense integers. If the column contained text data instead, I can imagine a MSB radixsort being best, or some other flavor of text sorting (e.g. burstsort, which I've never implemented). Ultimately, the column itself is best positioned to make the call on the basis of its knowledge about its contained type, and the data distribution.

We didn't say too much about where the [(group, index)] information comes from, and it derives from the relative order of input columns and the target order for the output columns. An output column that now follows an column it previously preceded projects its index information along the trie to each group of the preceding column. Otherwise, the grouping information is projected along the trie to each index of the column. Probably not very clear at this point, but .. these end up being the two cases.

Columnar Join

When joining two columnar collections, we will attempt equate their first arity columns. It is on the caller to put the collections in the right order (see "columnar projection" just above), and to tell us the correct arity.

The first thing we do is an intersection between the first arity columns of each input. This provides a list of matching indexes, one from each input, identifying values in the last of the matched columns corresponding to prefixes that are equal in the two inputs. Each pair (index0, index1) tells us of a prefix match in the two inputs, and all we need to do now is emit the product of all pairs of rows that continue the prefix, one from each input. In fact, most of the "hard work" is now done, in that we only need to enumerate the outputs, which the list of index pairs and the input trie structure reveal to us.

The surprising part, for me at least, is that this is where the computational work begins. We don't just get rest on our laurels for having found the breadcrumbs for the matches. We don't just get to count the number of matching pairs, or anything as easy as that. Like with projection, we have to produce a columnar projection of the set of matching rows, and that projection is almost always in a computationally unhelpful order (we are joining to correlate values, which are at the tail ends of the input columns).

The second half of joining is surprisingy similar to projection, with the complication that the output columns now come from three distinct "regions".

The output column could be an equated column, present in each input,
The output column could be a non-equated column from the first input,
The output column could be a non-equated column from the second input.

Algorithmically, it will turn out we really only have two cases, as the second and third have identical logic (with different arguments).

Actually, we'll use columnar projection on each input, restricted to their matched indexes, to put the non-equated columns in the order we need to produce them for the join output. This will dramatically simplify (for my brain) the bookkeeping we'll need to do.

With this change, our list [(index0, index1)] is a "todo list" of the products we still need to take from each of the inputs. We can move through the output columns, which are either an equated column (special), or the "next" column from either of the two inputs. If from either of the two inputs, we use the same columnar sorting approach from projection to put the values of the column in an order that works for us, recording grouping information as we go. Unlike projection, we'll need to update our todo list, as when we produce an output column we need to replace each (index0, index1) with either (lower..upper, index1) or (index0, lower..upper), where lower and upper are indexes in the next column.

Producing an equated column is similar to the special case from projection where an output column now follows a column it used to precede. All equated columns "precede" the aligned indexes, and for each such column there is exactly one value for each pair of aligned indexes. We just need to shuttle it forward in that case, and then subject it to the column sorting.

Perhaps a bit underdeveloped, but this is most of the algorithm. We very quickly intersect the column prefixes of the inputs, and then spend a great deal of time producing their matched rows as columnar tries in a different column order. In fairness, this checks out: joining is fast when things are sorted, and the hard work is doing the sorting to put fresh data into this representation.

Columnar Construction

A third thing I had totally forgotten about was the need to produce columnar data in the first place. This code used to be row-oriented, with a row-based builder to extract the columns. However, there's a cheeky way to get this done by showing up with your data in columns, just not trie columns. All you need to do is say "this is a trie, just one where everyone has exactly one child". The column sorting works fine even if the data didn't start as a trie, and will turn the non-trie columns into trie columns.

Performance

Now we get to the drama!

Landing the code was awkward because performance went in the wrong direction on my test cases. These test cases all have the property that they line up great for row-oriented representations: they all have u32 data, so it's easy to talk about a [u32; K] row, and sort long lists of them using radixsort. The columnar approach is more accommodating of heterogenous columns, but we didn't have any of those and things just looked worse.

workload	row-oriented	col-oriented
galen	13.20s	19.42s
snomed	6.34s	9.10s
hep-th	40.41s	34.70s
vsp-fin	192.25s	271.80s
graspan	6.62s	10.75s

The special case is hep-th, which suffers from wanting a much larger row accumulation buffer than we were using, and the columnar approach didn't much care. If you make the accumulator much larger the hep-th times go down to about 20s.

What were the regressions, then? It's a mix of a few things.

First up, we are just doing more work than the row-oriented form has to do. If you have three columns and they are all u32, then sorting a list of [u32;3] is about as much work as well do for each of the columns. Our group and index indirections can just be replaced by the data themselves, if you are allowed to do so. Not much to do here, other than have a drink and get over it. Either find some broader benchmarks, or sneak in some performance wins elsewhere.

Secondly, we introduced some regressions when generalizing some methods, like radix sorting.

Finally, we were wasting a bunch of time with mediocre implementations of the "control information" that would either be compiled away, or live briefly on the stack.

We'll go through a few PRs that started to address each of these, not arriving at the row-oriented performance yet, but telling a bit of a story along that way that suggests that perhaps we can get there. We'll talk through them first, and report with a chart at the end.

Flattening bytes to radixsort

The first PR flattened [[u8;K];W] down to [u8;K*W], which resulted in better codegen for the radixsort in Rust. I'm not certain of the exact reason, but plausibly if you aren't certain indexes will be in-bounds then you might want to know which indexing (inner or outer) was the source of errors. All we really wanted was to access a byte, and convincing Rust to just do that without many checks took some work.

It's possible this ends up being redundant with a later PR that made eliding bounds checks easier.

Columnar control state

Our list [(index0, index1)] is better represented as individual lists of indexes. The next PR did that, and saw a relatively healthy performance improvement. Potentially there was better codegen from just having lists of values (rather than lists of pairs), or less memory traversed, but decoupling the data was quite helpful.

Ensure radixing is in-bounds

When we generalized our sorting needs, we also went from "sort all the bytes" to "sort only these bytes". Sorting a variable number of user-specified bytes raises the reasonable questions of 1. how many are there, and 2. are they valid? The next PR reframed the code to traverse all in-bounds radix indexes, and optionally do work for them. This convinced the codegen that we could flatten out the work (non-variable number) and elide bounds checks.

Specialize column sorting

Most column sorting need to produce output [(group, index)] information for the next column. The last column to be sorted, which also happens to be the longest, doesn't need to do this, and can save a fair bit of time by skipping it. I couldn't figure out how to write this generically in Rust, so the next PR introduced a fair bit of copy/paste that gave the performance.

Columnar control state, redux

Finally, for now at least, the operations on the columnar control state benefited from being columnar themselves. Specifically, when expanding (index0, index1) to (lower..upper, index1), it makes sense to break that down into:

For each index0 replace it by lower and produce a list of upper-lower.
Sum upper-lower to find the total length of indexes we'll need, and pre-allocate.
For each lower, produce lower .. upper.
For each index1, produce std::iter::repeat(index1).take(upper-lower).

This avoided memcpy overhead from the dynamic sizing needed by a Rust iterator that just evaluates the ranges as it goes, and also made the logic much simpler at the same time. I can imagine generalizing it further to "just keep the list of upper-lower and don't actually copy anything around", but haven't gotten there yet.

Evaluation updated

I didn't capture a measurement for PR #9, and I didn't do a great job measuring with much more than a timer, once. Nonetheless, the arc of progress is pretty clear, even if the certainty about which idea contributed which result is not.

workload	pre-#8	PR #8	PR #10	PR #11	PR #12	PR #13
galen	13.20s	19.42s	17.02s	15.69s	14.94s	14.47s
snomed	6.34s	9.10s	8.03s	7.50s	7.19s	6.89s
hep-th	40.41s	34.70s	33.00s	34.49s	23.33s	23.61s
vsp-fin	192.25s	271.80s	240.51s	235.02s	219.77s	208.91s
graspan	6.62s	10.75s	8.92s	8.14s	7.93s	6.89s

By the end, we've gotten a fair bit closer to the original measurements, and with a framework that is now full-columnar. Any of the columns in our data could be text instead of u32, and we'd be able to accommodate it without falling back to the "all text" implementation the row-oriented approach would require. For example, one of the columns in galen happens to fit into a u16, and you could swap that in for some light gains (I tried that, but didn't like the copy/paste nedeed for each distinct column implementation).

There are more improvement opportunities still available. Things get a bit faster if you use u32 rather than usize for indexes, for example. Not quite to pre-#8 days, but about half of the gap. I'm not sure I want to do that, but it makes it clear that the trade-off is the overhead from the control state against the columnar execution kernels.

I'm most likely to instead move on to worst-case optimal joins, as the algorithmic pattern I'd rather optimize for. Columnar WCOJ, at least, tends to intersect many columns and add relatively fewer, which has the potential to highlight different opportunities within the implementation. No point getting too stressed about the limited performance of a sub-par approach to complex joins.