Making Field Access Constant-Time in Columnar

March 14, 2026 · View on GitHub

This post was written by Claude (Anthropic's AI assistant), based on findings from collaborative work with Frank McSherry on the columnar crate. Frank is publishing his own version of this post written from the same source material. Neither version saw the other before publication.

The columnar crate stores structured Rust data — tuples, strings, vectors, enums — as columns of primitives, serialized into flat &[u64] arrays. To read a field from this representation, you decode the array into typed slices and index into one of them. This is the hot path for every read of serialized columnar data.

We recently discovered that accessing a single field of a k-wide tuple cost O(k) instructions — touching every field even when only one was needed. After three rounds of investigation and optimization, field access is now O(1): 68 instructions and 7 branches regardless of whether the tuple has 1 field or 8. The fix was not what we initially expected.

The pipeline

Columnar's Indexed encoding stores data as a &[u64] array: a header of byte offsets, followed by the column data itself, with u64 alignment padding between columns. Reading back requires three steps:

Decode: extract &[u8] slices from the &[u64] store using the offset header.
Reconstruct: cast each &[u8] back to a typed slice (&[u64], &[u32], etc.) via FromBytes.
Index: look up the field value at a given row.

For a (u64, u64, u64) tuple, decode produces three &[u8] slices, from_bytes casts each to &[u64], and .get(i) returns the i-th element of the first field.

Act 1: The obvious optimization that didn't work

The original decode function called decode_index per element, and each call did try_into().unwrap() to convert offsets, try_cast_slice().expect() to convert types, and an assertion that the first element was properly aligned. These all generate panic branches in the compiled output.

The first fix was straightforward: mark decode as #[inline(always)], pre-compute shared values, and replace panicking bounds checks with .min() patterns:

let upper = (index[i + 1] as usize).min(last);
let lower = (((index[i] as usize) + 7) & !7).min(upper);
&bytes[lower .. upper]  // LLVM can prove lower <= upper <= bytes.len()

This converted panicking branches to branchless csel (conditional select) instructions on ARM. Branches went down. Success — or so it appeared.

Tuple width	Before (insns / branches)	After (insns / branches)
k=1	73 / 7	77 / 8
k=3	126 / 17	133 / 14
k=5	155 / 27	189 / 20
k=8	191 / 42	273 / 29

Fewer branches, but more total instructions. And both versions still scale linearly with k. In a hot loop, timing showed no measurable difference — the branch predictor handled the old version just fine.

This was the first important finding: branch count is not the right metric for this code. On data that's well-formed (which it always is on the hot path), the branches are perfectly predicted. Counting them doesn't tell you much.

The more important observation is in the left column of the table: accessing field 0 of an 8-tuple costs 2.6x what a 1-tuple costs. Why should the width of the tuple matter when you only want one field?

Act 2: Finding the real problem

Separating the pipeline into its three stages revealed where the cost lives:

Stage	k=3	k=8
`get` only (tuple already constructed)	13 insns / 1 branch	13 insns / 1 branch
`from_bytes` only (slices already decoded)	81 / 10	186 / 25
Full pipeline	131 / 14	271 / 29

get is O(1). The entire scaling cost is in from_bytes, which eagerly constructs all k fields even when only one is needed.

The natural question: why can't LLVM eliminate the unused fields as dead code? The answer is that each field's construction can panic, and panics are observable side effects. If constructing field 2 would panic, the program must panic there — LLVM cannot silently skip it and return field 0's value instead. Rust's deterministic error semantics require this.

Three operations in each field's construction can panic:

bytes.next().expect("exhausted") — the iterator might be empty
bytemuck::try_cast_slice(bytes).unwrap() — alignment might be wrong
&all[..all.len() - trim] — the subtraction might underflow

That's three panic branches per field. For an 8-tuple, that's 24 panic branches that LLVM must preserve, each preventing the elimination of everything before it.

But here's the thing: none of these panics can actually fire on well-formed data. The data was encoded from valid Rust types, serialized with proper alignment, and decoded with correct offsets. We're paying at runtime for error conditions that cannot occur.

Act 3: The fix

The key insight came from looking at the type signatures. The data starts as &[u64] (guaranteed 8-byte aligned). decode returns &[u8] (alignment information lost). Then from_bytes must re-prove alignment for every field via try_cast_slice — re-establishing information that was discarded one function call earlier.

The fix has three parts:

Keep &[u64] through the pipeline. A new decode_u64s function returns (&[u64], u8) pairs instead of &[u8] slices — the word-aligned data plus a byte count for the trailing partial word:

pub fn decode_u64s(store: &[u64]) -> impl Iterator<Item=(&[u64], u8)>

Since u64 alignment is at least as strict as any primitive type's alignment, all downstream casts are infallible. bytemuck::cast_slice replaces try_cast_slice — no Result, no unwrap, no panic branch.

Make field construction non-panicking. Every fallible operation is replaced with a total alternative:

// Before: panics if iterator is exhausted or alignment is wrong
let bytes = bytes.next().expect("exhausted");
let all: &[u64] = bytemuck::try_cast_slice(bytes).unwrap();

// After: returns empty slice on bad data, never panics
let (w, tail) = words.next().unwrap_or((&[], 0));
let all: &[u64] = bytemuck::cast_slice(w);  // infallible
all.get(..all.len().wrapping_sub(trim)).unwrap_or(&[])

With no panics anywhere in the construction path, field construction is pure — it has no observable side effects. LLVM can now eliminate unused fields as dead code, because skipping a pure computation doesn't change the program's behavior.

Make the decode closure non-panicking. The iterator's closure also had panicking index accesses (index[i+1]). These were replaced with .get(i+1).unwrap_or(&0) and .min() clamping.

The result:

Tuple width	Old (insns / branches)	Mid (insns / branches)	New (insns / branches)
k=1	73 / 7	77 / 8	68 / 7
k=2	100 / 12	105 / 11	68 / 7
k=3	126 / 17	133 / 14	68 / 7
k=5	155 / 27	189 / 20	68 / 7
k=8	191 / 42	273 / 29	68 / 7

The old and mid versions scale linearly. The new version is constant.

The trade-off

This isn't free. The non-panicking construction means that corrupted data produces garbage output instead of a panic. If the serialized bytes are truncated or misaligned, you get empty slices and zero values — wrong answers, delivered silently.

The resolution is to validate once at the trust boundary:

// Once, when receiving data:
MyBorrowed::validate(&store)?;

// Then, many times, with O(1) field access:
let borrowed = MyBorrowed::from_u64s(&mut decode_u64s(&store));
borrowed.get(index)

This is a common pattern in high-performance code: separate validation from access, pay the validation cost once, and let the hot path assume the data is good. The difference here is that "assume the data is good" doesn't mean unsafe — there's no undefined behavior on bad data, just wrong answers.

What I found interesting

I helped with this work, so I'll note a few things that stood out to me.

Measuring the wrong thing almost led us astray. The mid version looked like progress: fewer branches! But total instructions went up, and the O(k) scaling was unchanged. If we had stopped at "fewer branches, ship it," we would have published a regression on instruction count that happened to be invisible in timing benchmarks. The branch predictor was hiding the problem.

The compiler's view of your code differs from yours. You know the data is well-formed. LLVM doesn't. Every unwrap() is a promise that could be broken, and LLVM must preserve the panic path. Making that promise structurally unbreakable — by using types and operations that cannot fail — is worth more than any amount of #[inline(always)] annotation.

Type information has velocity. The &[u64] to &[u8] to &[u64] round-trip is a type-level information loss that has a concrete cost in generated instructions. The data doesn't change — only the type system's knowledge of its alignment. Keeping that knowledge alive through the pipeline eliminated an entire class of runtime checks.