SlabWriter Performance Experiments

March 19, 2026 · View on GitHub

Platform: M1 Pro, 10 goroutines, parallel file I/O, 200B JSON log entries. Baseline: original SlabWriter with atomic.Int64 counters = 425 ns/op.

What shipped

Change	ns/op	Delta	Mechanism
Baseline (atomic counters)	425	—
Remove `atomic.Int64.Add` for `written` inside mutex	348	-18%	Atomic inside mutex causes cache-line bouncing across cores; plain `int++` under mutex has zero atomic overhead
+ Remove write loop, direct `copy`	348	-2%	After early swap, message guaranteed to fit — loop with bounds check replaced by single `copy`
+ Conditional `msgCount++` (only in dropOnFull)	~347	~1 ns	Avoid unnecessary increment on hot path
+ Early swap (doesn't fit in remainder)	no regression		Prevents torn writes; one extra comparison on hot path
+ Oversized message support (> slabSize)	no regression		Dedicated `make+copy` only for rare large messages
+ `safeReport` with `recover()`	no impact		Prevents errW panic from crashing ioLoop

Final: 348 ns/op (was 425). -18% from baseline.

Idea: Replace mutex with CAS on packed state [writers:16|pos:48]. Each Write does CompareAndSwap to reserve space, copies data, then atomic.Add(-writersInc) to signal completion. Mutex only for swap.

Result: 385 ns (+11% vs shipped 348 ns).

Why: CAS + atomic decrement = 2 atomic ops per Write. Under 10-goroutine contention, these atomic ops cause the same cache-line bouncing that killed the written counter. The mutex version does 1 Lock + 1 Unlock (which are also atomic ops, but the runtime's adaptive spinning is more efficient than our CAS retry loop for ~25 ns critical sections).

Lesson: Lock-free is not always faster. For very short critical sections (<50 ns), Go's mutex with adaptive spinning beats hand-rolled CAS under moderate contention.

Copy outside mutex (reserve under lock, copy outside)

Idea: Reserve space and capture slab pointer under mutex, copy data outside mutex. Use atomic.Int32 writers counter so swap knows when all copies are done.

Result: 428 ns (+23% vs shipped 348 ns).

Why: Two writers.Add calls (increment under lock + decrement outside) are two atomic ops — same problem as lock-free CAS. The 15 ns saved by moving copy outside is overwhelmed by 2×5 ns atomic overhead amplified by cache bouncing.

Lesson: Moving work outside a mutex only helps if the synchronization cost of the "completion signal" is cheaper than the work itself. For 200-byte memcpy (~15 ns), it's not.

Spinlock (CAS-based, no goroutine parking)

Idea: Replace sync.Mutex with a custom spinlock that spins (with runtime.procyield PAUSE hint) instead of parking goroutines. Avoids the ~1µs park/wake overhead visible in CPU profiles.

Result: 340 ns (-2.3% vs shipped 348 ns).

Why it wasn't shipped: Requires //go:linkname runtime.procyield — an unstable internal API that can break on Go upgrades. The 8 ns gain doesn't justify the maintenance risk.

Lesson: Spinlock does help for <50 ns critical sections, but Go's //go:linkname is too fragile for production code.

Sharded SlabWriter (per-P or per-goroutine shards)

Idea: N independent slab shards (one per GOMAXPROCS), each with its own mutex. Goroutines pick a shard via stack-address hash. Eliminates cross-goroutine mutex contention almost entirely.

Result: 175 ns (-50% vs shipped 348 ns).

Why it wasn't shipped: Breaks strict message ordering between goroutines (different shards → different slabs → reordered at destination). Also: +50% memory, stack-address hash is imperfect (occasional collisions cause contention spikes), more complex Flush/Close (must iterate all shards).

Sharding results by shard count:

Shards	ns/op	vs baseline
1 (shipped)	348	—
2	293	-16%
4	194	-44%
auto (GOMAXPROCS=10)	175	-50%

Lesson: Sharding is the only technique that fundamentally reduces contention. All other approaches (lock-free, spinlock, copy-outside) are limited by the single-mutex bottleneck. The trade-off is ordering.

Key insight: atomics inside mutex

The single most impactful discovery: atomic.Int64.Add inside a mutex causes a +22% performance regression (425→348 ns) on ARM64. The atomic instruction's memory barrier forces the CPU to drain its store buffer, invalidating cache lines on all cores. The next mutex.Unlock (which is also an atomic) then suffers a cache miss.

Rule: never use atomic operations on data protected by a mutex. Use plain int++ instead. Read the value under mutex in Stats().

This effect is invisible in single-goroutine benchmarks (~5 ns overhead) and only manifests under parallel contention with cache pressure from other work (encoder pool operations, buffer allocation). The BenchmarkParallelConcurrentSlabCachePressure benchmark was created specifically to catch this class of regression.

CPU profile breakdown (fileparallel, 10 goroutines)

Component	% CPU
Mutex contention (usleep + pthread_cond)	65%
Mutex lock/unlock	20%
Encoder (JSON)	5%
File I/O (ioLoop)	3%
Everything else	7%

The mutex is 85% of CPU. The encoder is already fast (5%). Further optimization requires reducing contention (sharding) or reducing critical section time (already minimized to ~25 ns memcpy).