NexusFIX Optimization Diary
March 27, 2026 · View on GitHub
How we achieved 3x performance improvement over QuickFIX: from 730ns to 246ns.
Executive Summary
| Phase | Technique | Before | After | Improvement |
|---|---|---|---|---|
| 1 | Zero-Copy Parsing | 730 ns | 520 ns | 1.4x |
| 2 | O(1) Field Lookup | 520 ns | 380 ns | 1.4x |
| 3 | SIMD Delimiter Scan | 380 ns | 290 ns | 1.3x |
| 4 | Compile-Time Offsets | 290 ns | 260 ns | 1.1x |
| 5 | Cache-Line Alignment | 260 ns | 246 ns | 1.05x |
| Total | 730 ns | 246 ns | 3.0x |
Phase 1: Zero-Copy Parsing
Problem
QuickFIX copies every field value into std::string:
// QuickFIX approach - constructs std::string for each field
std::string orderID = message.getField(37); // string construction + memcpy
std::string symbol = message.getField(55); // string construction + memcpy
Each getField() call triggers:
std::stringconstruction withmemcpyinto internal buffer- Heap allocation for fields exceeding SSO capacity (typically >15 bytes on libstdc++/MSVC, >22 bytes on libc++)
- Non-deterministic latency: short fields (e.g. "AAPL") hit SSO, longer fields (e.g. UUIDs, FreeText) allocate on heap - this variance is problematic for HFT
- Destructor overhead for each
std::stringobject
Solution
Use std::span<const char> to create views into the original buffer:
// NexusFIX approach - zero allocation
std::span<const char> orderID = message.get_view(Tag::OrderID); // just pointer + length
std::span<const char> symbol = message.get_view(Tag::Symbol); // no copy
Why It Works
std::spanis just 16 bytes (pointer + size) on stack- No heap allocation, no
malloc()syscall - CPU cache stays hot - data never moves
Result
730ns → 520ns (1.4x improvement)
Phase 2: O(1) Field Lookup
Problem
QuickFIX uses std::map<int, std::string> for field storage:
// QuickFIX internal structure
std::map<int, std::string> fields_; // O(log n) lookup, ~5-7 comparisons
Tree traversal causes:
- Multiple pointer dereferences
- Cache misses on each node visit
- Branch mispredictions
Solution
Pre-indexed array by tag number:
// NexusFIX internal structure
struct FieldEntry {
uint16_t offset; // position in buffer
uint16_t length; // field length
};
std::array<FieldEntry, 1024> fields_; // O(1) direct indexing
Lookup is single array access:
auto& entry = fields_[tag]; // One memory access
return std::span{buffer + entry.offset, entry.length};
Why It Works
- Direct indexing:
fields_[37]compiles to singlemovinstruction - Array is cache-friendly - sequential memory layout
- No branch mispredictions
Result
520ns → 380ns (1.4x improvement)
Phase 3: SIMD Delimiter Scanning
Problem
FIX messages use SOH (\x01) as field delimiter. Sequential scanning:
// Traditional approach - 1 byte per cycle
for (size_t i = 0; i < len; ++i) {
if (buffer[i] == '\x01') {
// found delimiter
}
}
Processing 1 byte per iteration on modern CPU is wasteful.
Solution
AVX2 SIMD processes 32 bytes simultaneously:
// NexusFIX approach - 32 bytes per cycle
// Performance: AVX2 scans 32 bytes/cycle vs 1 byte/cycle sequential
// This reduces delimiter detection from O(n) to O(n/32)
__m256i soh = _mm256_set1_epi8('\x01');
__m256i chunk = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(ptr));
__m256i cmp = _mm256_cmpeq_epi8(chunk, soh);
uint32_t mask = _mm256_movemask_epi8(cmp);
if (mask) {
return ptr + __builtin_ctz(mask); // position of first SOH
}
Why It Works
- Single instruction compares 32 bytes
_mm256_movemask_epi8extracts comparison results to 32-bit integer__builtin_ctzfinds first set bit in ~1 cycle
Result
380ns → 290ns (1.3x improvement)
Phase 4: Compile-Time Field Offsets
Problem
Runtime offset calculation for well-known fields:
// Runtime calculation
int offset = calculate_header_size(version); // branches, function call
Solution
consteval computes offsets at compile time:
// Compile-time calculation
consteval size_t header_offset() {
// 8=FIX.4.4 | 9=xxx | 35=x | ...
return 8 + 1 + 4 + 1 + 3 + 1; // computed during compilation
}
static constexpr size_t HEADER_OFFSET = header_offset();
Why It Works
- Zero runtime cost - offset is embedded as immediate value
- Compiler can optimize subsequent code knowing exact value
- No branch for version checking in hot path
Result
290ns → 260ns (1.1x improvement)
Phase 5: Cache-Line Alignment
Problem
Hot data structures crossing cache line boundaries:
struct ParseState {
char* buffer; // 8 bytes
size_t position; // 8 bytes
size_t length; // 8 bytes
FieldTable fields; // 4096 bytes - crosses cache lines
};
Solution
Align hot data to 64-byte cache lines:
struct alignas(64) ParseState {
// Hot data - first cache line
char* buffer;
size_t position;
size_t length;
uint32_t field_count;
// Cold data - separate cache line
alignas(64) FieldTable fields;
};
Also applied [[gnu::hot]] to critical functions:
[[gnu::hot]] [[nodiscard]]
auto parse(std::span<const char> buffer) noexcept -> ParseResult;
Why It Works
- Hot data fits in single cache line (64 bytes)
[[gnu::hot]]hints compiler to optimize for speed over size- Prevents false sharing in multi-threaded scenarios
Result
260ns → 246ns (1.05x improvement)
Additional Optimizations
Branch Hints
if (tag == Tag::MsgType) [[likely]] {
// Most messages have MsgType early
return fast_path();
} else [[unlikely]] {
return slow_path();
}
Restrict Pointers
void parse(const char* __restrict input,
FieldEntry* __restrict output) {
// Compiler knows input and output don't alias
// Enables better vectorization
}
Link-Time Optimization
set(CMAKE_INTERPROCEDURAL_OPTIMIZATION TRUE) # -flto
Enables cross-file inlining and dead code elimination.
Lessons Learned
- Measure First - RDTSC + perf counters before any optimization
- Memory is King - Most gains came from eliminating allocations
- Cache Matters - Data layout impacts performance more than algorithms
- Compiler is Smart -
constevaland LTO let compiler do heavy lifting - SIMD Carefully - Only where data is naturally parallel (delimiter scanning)
Benchmark Methodology
# CPU isolation
taskset -c 0 ./benchmark
# Warm-up
for (int i = 0; i < 10000; ++i) parse(msg); // I-Cache warming
# Measurement
auto start = rdtsc_fenced();
for (int i = 0; i < 100000; ++i) parse(msg);
auto end = rdtsc_fenced();
# Statistics
# Report P50, P99, P999, min, max
All measurements on isolated CPU core with governor set to performance.
References
- Modern C++ for Quantitative Trading - Full technique catalog
- Benchmark Report - Detailed comparison data
- Intel Intrinsics Guide - AVX2 instruction reference