NexusFIX Optimization Summary: Before vs After

February 24, 2026 · View on GitHub

Date: 2026-01-23 Scope: Three rounds of optimization work


Executive Summary

OptimizationBeforeAfterImprovement
io_uring DEFER_TASKRUN361.5 ns336.0 ns7% faster
SIMD SOH Scanner~150 ns11.8 ns~13x faster
Hash Map Lookup20.0 ns15.2 ns31% faster
CPU Pinning P9918.8 ns17.3 ns8% faster
Deferred Processing75.6 ns12.3 ns84% reduction

Estimated Combined Impact: ~41% hot path latency reduction


Test Environment

ParameterValue
CPU3.418 GHz
Cores28
Warmup10,000 iterations
Benchmark100,000 iterations
Runs5

Optimization Details

1. io_uring DEFER_TASKRUN (Round 1)

Commit: 3a13654

MetricBeforeAfterChange
Mean Latency361.5 ns336.0 ns-7.1%
P99 Latency392.0 ns365.2 ns-6.8%

Implementation:

// Added to io_uring_transport.hpp
struct io_uring_params params = {0};
params.flags = IORING_SETUP_COOP_TASKRUN |
               IORING_SETUP_SINGLE_ISSUER |
               IORING_SETUP_DEFER_TASKRUN;
io_uring_queue_init_params(queue_depth, &ring_, &params);

2. AVX-512/AVX2 SIMD Scanner (Round 1)

Commit: fef0b4d

MetricScalarAVX2Change
Mean~150 ns11.8 ns~13x
Throughput~6.7M/s84.5M/s~13x

Implementation:

// Added AVX-512 support with graceful fallback
#if NFX_AVX512_AVAILABLE
    return scan_soh_avx512(data);  // 64 bytes/iteration
#elif NFX_SIMD_AVAILABLE
    return scan_soh_avx2(data);    // 32 bytes/iteration
#else
    return scan_soh_scalar(data);  // 1 byte/iteration
#endif

3. absl::flat_hash_map (Round 2)

Commit: d674409

Metricstd::unordered_mapabsl::flat_hash_mapChange
Lookup20.0 ns15.2 ns-31%
Insert17.4 ns12.7 ns-37%
P99 Lookup61.4 ns52.3 ns-17%

Implementation:

// memory_message_store.hpp
#if NFX_HAS_ABSEIL
using HashMap = absl::flat_hash_map<K, V>;  // Swiss Tables
#else
using HashMap = std::unordered_map<K, V>;   // Fallback
#endif

4. CPU Core Pinning (Round 3)

Commit: 033a6d1

MetricUnpinnedPinnedChange
Mean15.0 ns14.7 ns-2.0%
P9918.8 ns17.3 ns-7.8%
P99.919.6 ns18.4 ns-6.3%

Implementation:

// cpu_affinity.hpp
AffinityResult CpuAffinity::pin_to_core(int core_id) {
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(core_id, &cpuset);
    return pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
}

5. Deferred Processing Pattern (Round 3)

Commit: 868e09f

MetricInlineDeferredChange
Median75.6 ns12.3 ns-84%
P9976.7 ns20.7 ns-73%
Queue OverheadN/A11.4 nsbaseline

Implementation:

// deferred_processor.hpp
template<typename BufferType, size_t QueueCapacity>
class DeferredProcessor {
    // Hot path: ~11ns
    bool submit(std::span<const char> data) {
        buffer.set(data, rdtsc());
        return queue_.try_push(std::move(buffer));
    }

    // Background thread handles expensive work
    void process_loop() {
        while (auto msg = queue_.try_pop()) {
            callback(*msg);  // Parse, persist, notify
        }
    }
};

Current Performance (Measured)

ComponentMeanMedianP99Throughput
SIMD SOH Scanner11.8 ns11.6 ns13.0 ns84.5M/s
Hash Map Lookup39.4 ns36.0 ns92.6 ns25.4M/s
SPSC Queue Push11.4 ns11.2 ns16.6 ns88.0M/s

Cumulative Impact

Hot Path Latency Reduction

Using multiplicative model for independent optimizations:

Combined = 1 - (1-0.07) × (1-0.078) × (1-0.31 for lookup path)
         = 1 - 0.93 × 0.922 × 0.69
         = ~41% reduction

Per-Component Gains

ComponentTechniqueGain
Network I/ODEFER_TASKRUN7%
Message ParsingAVX2 SIMD~13x
Message LookupSwiss Tables31%
Thread SchedulingCore Pinning8% P99
Background WorkDeferred Pattern84%

Files Modified

RoundFilesKey Changes
1io_uring_transport.hppDEFER_TASKRUN flags
1simd_scanner.hppAVX-512 support
2CMakeLists.txtAbseil FetchContent
2memory_message_store.hppHashMap alias
3cpu_affinity.hppCore pinning utility
3state.hppSessionConfig affinity fields
3deferred_processor.hppDeferred pattern utility

Commits

HashDescription
3a13654perf(io_uring): Add DEFER_TASKRUN optimization
fef0b4dperf(simd): Add AVX-512 SOH scanner
d674409perf(hashmap): Replace std::unordered_map with absl::flat_hash_map
033a6d1perf(affinity): Add CPU core pinning
868e09fperf(deferred): Add NanoLog-inspired deferred processor

Conclusion

Three rounds of optimization work delivered significant performance improvements:

AspectResult
Hot path latency~41% reduction
SIMD throughput~13x faster
Hash lookups31% faster
P99 tail latency8% reduction
Background offload84% hot path reduction

All optimizations maintain backward compatibility with graceful fallbacks.


References