NexusFIX Optimization Summary: Before vs After
February 24, 2026 · View on GitHub
Date: 2026-01-23 Scope: Three rounds of optimization work
Executive Summary
| Optimization | Before | After | Improvement |
|---|---|---|---|
| io_uring DEFER_TASKRUN | 361.5 ns | 336.0 ns | 7% faster |
| SIMD SOH Scanner | ~150 ns | 11.8 ns | ~13x faster |
| Hash Map Lookup | 20.0 ns | 15.2 ns | 31% faster |
| CPU Pinning P99 | 18.8 ns | 17.3 ns | 8% faster |
| Deferred Processing | 75.6 ns | 12.3 ns | 84% reduction |
Estimated Combined Impact: ~41% hot path latency reduction
Test Environment
| Parameter | Value |
|---|---|
| CPU | 3.418 GHz |
| Cores | 28 |
| Warmup | 10,000 iterations |
| Benchmark | 100,000 iterations |
| Runs | 5 |
Optimization Details
1. io_uring DEFER_TASKRUN (Round 1)
Commit: 3a13654
| Metric | Before | After | Change |
|---|---|---|---|
| Mean Latency | 361.5 ns | 336.0 ns | -7.1% |
| P99 Latency | 392.0 ns | 365.2 ns | -6.8% |
Implementation:
// Added to io_uring_transport.hpp
struct io_uring_params params = {0};
params.flags = IORING_SETUP_COOP_TASKRUN |
IORING_SETUP_SINGLE_ISSUER |
IORING_SETUP_DEFER_TASKRUN;
io_uring_queue_init_params(queue_depth, &ring_, ¶ms);
2. AVX-512/AVX2 SIMD Scanner (Round 1)
Commit: fef0b4d
| Metric | Scalar | AVX2 | Change |
|---|---|---|---|
| Mean | ~150 ns | 11.8 ns | ~13x |
| Throughput | ~6.7M/s | 84.5M/s | ~13x |
Implementation:
// Added AVX-512 support with graceful fallback
#if NFX_AVX512_AVAILABLE
return scan_soh_avx512(data); // 64 bytes/iteration
#elif NFX_SIMD_AVAILABLE
return scan_soh_avx2(data); // 32 bytes/iteration
#else
return scan_soh_scalar(data); // 1 byte/iteration
#endif
3. absl::flat_hash_map (Round 2)
Commit: d674409
| Metric | std::unordered_map | absl::flat_hash_map | Change |
|---|---|---|---|
| Lookup | 20.0 ns | 15.2 ns | -31% |
| Insert | 17.4 ns | 12.7 ns | -37% |
| P99 Lookup | 61.4 ns | 52.3 ns | -17% |
Implementation:
// memory_message_store.hpp
#if NFX_HAS_ABSEIL
using HashMap = absl::flat_hash_map<K, V>; // Swiss Tables
#else
using HashMap = std::unordered_map<K, V>; // Fallback
#endif
4. CPU Core Pinning (Round 3)
Commit: 033a6d1
| Metric | Unpinned | Pinned | Change |
|---|---|---|---|
| Mean | 15.0 ns | 14.7 ns | -2.0% |
| P99 | 18.8 ns | 17.3 ns | -7.8% |
| P99.9 | 19.6 ns | 18.4 ns | -6.3% |
Implementation:
// cpu_affinity.hpp
AffinityResult CpuAffinity::pin_to_core(int core_id) {
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(core_id, &cpuset);
return pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
}
5. Deferred Processing Pattern (Round 3)
Commit: 868e09f
| Metric | Inline | Deferred | Change |
|---|---|---|---|
| Median | 75.6 ns | 12.3 ns | -84% |
| P99 | 76.7 ns | 20.7 ns | -73% |
| Queue Overhead | N/A | 11.4 ns | baseline |
Implementation:
// deferred_processor.hpp
template<typename BufferType, size_t QueueCapacity>
class DeferredProcessor {
// Hot path: ~11ns
bool submit(std::span<const char> data) {
buffer.set(data, rdtsc());
return queue_.try_push(std::move(buffer));
}
// Background thread handles expensive work
void process_loop() {
while (auto msg = queue_.try_pop()) {
callback(*msg); // Parse, persist, notify
}
}
};
Current Performance (Measured)
| Component | Mean | Median | P99 | Throughput |
|---|---|---|---|---|
| SIMD SOH Scanner | 11.8 ns | 11.6 ns | 13.0 ns | 84.5M/s |
| Hash Map Lookup | 39.4 ns | 36.0 ns | 92.6 ns | 25.4M/s |
| SPSC Queue Push | 11.4 ns | 11.2 ns | 16.6 ns | 88.0M/s |
Cumulative Impact
Hot Path Latency Reduction
Using multiplicative model for independent optimizations:
Combined = 1 - (1-0.07) × (1-0.078) × (1-0.31 for lookup path)
= 1 - 0.93 × 0.922 × 0.69
= ~41% reduction
Per-Component Gains
| Component | Technique | Gain |
|---|---|---|
| Network I/O | DEFER_TASKRUN | 7% |
| Message Parsing | AVX2 SIMD | ~13x |
| Message Lookup | Swiss Tables | 31% |
| Thread Scheduling | Core Pinning | 8% P99 |
| Background Work | Deferred Pattern | 84% |
Files Modified
| Round | Files | Key Changes |
|---|---|---|
| 1 | io_uring_transport.hpp | DEFER_TASKRUN flags |
| 1 | simd_scanner.hpp | AVX-512 support |
| 2 | CMakeLists.txt | Abseil FetchContent |
| 2 | memory_message_store.hpp | HashMap alias |
| 3 | cpu_affinity.hpp | Core pinning utility |
| 3 | state.hpp | SessionConfig affinity fields |
| 3 | deferred_processor.hpp | Deferred pattern utility |
Commits
| Hash | Description |
|---|---|
3a13654 | perf(io_uring): Add DEFER_TASKRUN optimization |
fef0b4d | perf(simd): Add AVX-512 SOH scanner |
d674409 | perf(hashmap): Replace std::unordered_map with absl::flat_hash_map |
033a6d1 | perf(affinity): Add CPU core pinning |
868e09f | perf(deferred): Add NanoLog-inspired deferred processor |
Conclusion
Three rounds of optimization work delivered significant performance improvements:
| Aspect | Result |
|---|---|
| Hot path latency | ~41% reduction |
| SIMD throughput | ~13x faster |
| Hash lookups | 31% faster |
| P99 tail latency | 8% reduction |
| Background offload | 84% hot path reduction |
All optimizations maintain backward compatibility with graceful fallbacks.