Memory Ordering
June 27, 2026 ยท View on GitHub
๐ Practice & deep-dive on EmbeddedInterviewLab
Get these architecture concepts as ranked interview questions with model answers, plus interactive deep-dive guides.
๐ Browse MCU & architecture questions โ ย ยทย Browse the MCU & Architecture guides โ
Memory Ordering
Understanding Memory Barriers and Atomic Operations
Comprehensive coverage of memory ordering models, synchronization primitives, and concurrent programming
๐ Table of Contents
- Memory Ordering Fundamentals
- Memory Ordering Models
- Memory Barriers and Fences
- Atomic Operations
- Synchronization Primitives
- Memory Ordering in Multi-core Systems
- Performance Implications
- Best Practices and Guidelines
๐๏ธ Memory Ordering Fundamentals
What is Memory Ordering?
Memory ordering refers to the rules that govern how memory operations from different threads or cores appear to execute relative to each other. In modern multi-core systems, memory operations can be reordered by hardware and software optimizations, making the actual execution order different from the program order.
Understanding memory ordering is crucial for writing correct concurrent programs, as the behavior of multi-threaded applications depends on the memory ordering guarantees provided by the underlying hardware and programming language.
Memory Ordering Philosophy
Memory ordering embodies the principle of relaxed consistency, where the hardware and compiler are allowed to reorder operations for performance optimization, as long as the reordering doesn't violate the specified memory ordering constraints.
This approach provides several benefits:
- Performance: Allows hardware and compiler optimizations
- Scalability: Enables better performance in multi-core systems
- Flexibility: Provides different ordering guarantees for different use cases
- Efficiency: Minimizes unnecessary synchronization overhead
Why Memory Ordering Matters
In single-threaded programs, memory operations appear to execute in program order. However, in multi-threaded programs, this assumption can lead to subtle bugs:
Memory Ordering Problem Example:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Thread 1 Thread 2 โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ x = 1; โ โ while (flag == 0) { } โ โ
โ โ flag = 1; โ โ print(x); โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ Problem: Without proper memory ordering, Thread 2 might โ
โ see flag = 1 before x = 1, printing an uninitialized value โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Memory Reordering Types
Memory operations can be reordered in several ways:
- Compiler Reordering: Compiler optimizations that reorder operations
- Hardware Reordering: CPU out-of-order execution and memory subsystem reordering
- Cache Coherency: Different cores seeing operations in different orders
- Memory Controller Reordering: Memory controller optimizing access patterns
๐ Memory Ordering Models
Sequential Consistency
Sequential consistency is the strongest memory ordering model, requiring that all memory operations appear to execute in a single sequential order that respects the program order of each thread.
Sequential Consistency Example:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Thread 1: Thread 2: โ
โ x = 1; y = 1; โ
โ r1 = y; r2 = x; โ
โ โ
โ Possible outcomes under sequential consistency: โ
โ โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ x โ y โ r1 โ r2 โ Description โ โ
โ โ 1 โ 1 โ 0 โ 0 โ Both reads see initial values โ โ
โ โ 1 โ 1 โ 1 โ 0 โ Thread 1 sees Thread 2's write โ โ
โ โ 1 โ 1 โ 0 โ 1 โ Thread 2 sees Thread 1's write โ โ
โ โ 1 โ 1 โ 1 โ 1 โ Both see each other's writes โ โ
โ โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ Impossible outcome: โ
โ x=1, y=1, r1=0, r2=0 (if both reads happen after both writes)โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Total Store Ordering (TSO)
Total Store Ordering allows store-load reordering but maintains other ordering constraints. This model is used by x86 processors and provides a good balance between performance and programmer expectations.
TSO Memory Ordering:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ TSO Constraints: โ
โ โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Load-Load โ Store-Store โ Store-Load โ โ
โ โ Ordering โ Ordering โ Ordering โ โ
โ โ Maintained โ Maintained โ May be reordered โ โ
โ โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ Example of allowed reordering: โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Original: store A, load B โ โ
โ โ Reordered: load B, store A โ โ
โ โ (Store-Load reordering allowed) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Partial Store Ordering (PSO)
Partial Store Ordering allows both store-load and store-store reordering, providing more flexibility for hardware optimization but requiring more careful programming.
Weak Ordering
Weak ordering allows most reorderings but requires explicit synchronization operations to establish ordering constraints. This model provides maximum flexibility for hardware optimization.
Release Consistency
Release consistency provides synchronization at specific points (acquire and release operations) while allowing reordering between these points.
๐ง Memory Barriers and Fences
Memory Barrier Fundamentals
Memory barriers (also called fences) are instructions that enforce ordering constraints on memory operations. They prevent certain types of reordering and ensure that memory operations are visible to other threads in the expected order.
Types of Memory Barriers
Load-Load Barrier (LL)
A load-load barrier ensures that all loads before the barrier complete before any loads after the barrier begin.
Load-Load Barrier Example:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Without Barrier: โ
โ โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ load A โ load B โ load C โ โ
โ โ โ โ โ โ
โ โ Possible reordering: load B, load A, load C โ โ
โ โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ With Load-Load Barrier: โ โ
โ โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ load A โ LL Barrier โ load B โ โ
โ โ โ โ โ โ
โ โ load C โ โ load D โ โ
โ โ โ โ โ โ
โ โ A and C must complete before B and D begin โ โ
โ โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Store-Store Barrier (SS)
A store-store barrier ensures that all stores before the barrier complete before any stores after the barrier begin.
Store-Store Barrier Example:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Without Barrier: โ
โ โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ store A โ store B โ store C โ โ
โ โ โ โ โ โ
โ โ Possible reordering: store B, store A, store C โ โ
โ โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ With Store-Store Barrier: โ โ
โ โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ store A โ SS Barrier โ store B โ โ
โ โ โ โ โ โ
โ โ store C โ โ store D โ โ
โ โ โ โ โ โ
โ โ A and C must complete before B and D begin โ โ
โ โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Load-Store Barrier (LS)
A load-store barrier ensures that all loads before the barrier complete before any stores after the barrier begin.
Store-Load Barrier (SL)
A store-load barrier ensures that all stores before the barrier complete before any loads after the barrier begin.
Full Memory Barrier (MF)
A full memory barrier ensures that all memory operations before the barrier complete before any memory operations after the barrier begin.
Memory Barrier Implementation
Memory barriers are implemented differently on different architectures:
x86 Memory Barriers
// x86 memory barriers
#include <immintrin.h>
void x86_memory_barriers() {
// Compiler barrier (prevents compiler reordering)
__asm__ __volatile__("" ::: "memory");
// Full memory barrier
_mm_mfence();
// Store barrier
_mm_sfence();
// Load barrier
_mm_lfence();
}
ARM Memory Barriers
// ARM memory barriers
void arm_memory_barriers() {
// Full memory barrier
__asm__ __volatile__("dmb ish" ::: "memory");
// Store barrier
__asm__ __volatile__("dmb ishst" ::: "memory");
// Load barrier
__asm__ __volatile__("dmb ishld" ::: "memory");
}
โ๏ธ Atomic Operations
Atomic Operation Fundamentals
Atomic operations are operations that appear to execute as a single, indivisible unit. They are essential for implementing synchronization primitives and ensuring correct behavior in concurrent programs.
Types of Atomic Operations
Read-Modify-Write Operations
Read-modify-write operations atomically read a value, modify it, and write it back:
- Compare-and-Swap (CAS): Atomically compare and conditionally swap
- Fetch-and-Add: Atomically add a value and return the old value
- Test-and-Set: Atomically set a value and return the old value
Atomic Compare-and-Swap:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CAS Operation: โ
โ โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ bool CAS(T* ptr, T expected, T desired) โ โ
โ โ { โ โ
โ โ if (*ptr == expected) { โ โ
โ โ *ptr = desired; โ โ
โ โ return true; โ โ
โ โ } โ โ
โ โ return false; โ โ
โ โ } โ โ
โ โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ Usage Example: โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ while (!CAS(&lock, 0, 1)) { โ โ
โ โ // Spin until lock is acquired โ โ
โ โ } โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Load and Store Operations
Atomic load and store operations ensure that the operation is performed atomically:
// C11 atomic operations
#include <stdatomic.h>
void atomic_operations_example() {
atomic_int counter = ATOMIC_VAR_INIT(0);
// Atomic load
int value = atomic_load(&counter);
// Atomic store
atomic_store(&counter, 42);
// Atomic fetch-and-add
int old_value = atomic_fetch_add(&counter, 10);
// Atomic compare-and-swap
int expected = 42;
int desired = 43;
bool success = atomic_compare_exchange_weak(&counter, &expected, desired);
}
Memory Ordering in Atomic Operations
Atomic operations can specify memory ordering constraints:
// Atomic operations with memory ordering
void atomic_with_ordering() {
atomic_int flag = ATOMIC_VAR_INIT(0);
atomic_int data = ATOMIC_VAR_INIT(0);
// Release store: ensures all previous operations are visible
atomic_store_explicit(&data, 42, memory_order_release);
// Acquire load: ensures all subsequent operations are visible
int value = atomic_load_explicit(&data, memory_order_acquire);
// Relaxed operation: no ordering guarantees
atomic_fetch_add_explicit(&flag, 1, memory_order_relaxed);
}
๐ Synchronization Primitives
Mutex Implementation with Memory Ordering
Mutexes can be implemented using atomic operations and memory barriers:
// Simple spinlock mutex implementation
typedef struct {
atomic_int locked;
} spinlock_t;
void spinlock_init(spinlock_t* lock) {
atomic_store(&lock->locked, 0);
}
void spinlock_acquire(spinlock_t* lock) {
while (atomic_exchange_explicit(&lock->locked, 1,
memory_order_acquire)) {
// Spin until lock is acquired
while (atomic_load_explicit(&lock->locked,
memory_order_relaxed)) {
// Optional: yield or pause
}
}
}
void spinlock_release(spinlock_t* lock) {
atomic_store_explicit(&lock->locked, 0, memory_order_release);
}
Semaphore Implementation
Semaphores can also be implemented using atomic operations:
// Binary semaphore implementation
typedef struct {
atomic_int count;
} semaphore_t;
void semaphore_init(semaphore_t* sem, int initial_count) {
atomic_store(&sem->count, initial_count);
}
void semaphore_wait(semaphore_t* sem) {
int expected;
do {
expected = atomic_load(&sem->count);
if (expected <= 0) {
// Wait for signal
continue;
}
} while (!atomic_compare_exchange_weak(&sem->count, &expected, expected - 1));
}
void semaphore_signal(semaphore_t* sem) {
atomic_fetch_add(&sem->count, 1);
}
๐ Memory Ordering in Multi-core Systems
Cache Coherency and Memory Ordering
Cache coherency protocols ensure that all cores see a consistent view of memory, but they don't guarantee memory ordering. Memory barriers are still needed to establish ordering constraints.
Cache Coherency vs. Memory Ordering:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Cache Coherency: โ
โ โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Ensures all cores see the same value for a memory location โ โ
โ โ Does NOT guarantee the order of operations โ โ
โ โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ Memory Ordering: โ โ
โ โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Establishes the order of operations โ โ
โ โ Requires explicit memory barriers โ โ
โ โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Memory Ordering in Different Architectures
Different processor architectures provide different memory ordering guarantees:
x86/x64 Architecture
- TSO (Total Store Ordering): Store-load reordering allowed
- Strong memory model: Most reorderings prevented by default
- Explicit barriers:
mfence,sfence,lfenceinstructions
ARM Architecture
- Weak memory model: Most reorderings allowed by default
- Explicit barriers:
dmb,dsb,isbinstructions - Load-acquire/Store-release: Built-in ordering semantics
PowerPC Architecture
- Weak memory model: Most reorderings allowed by default
- Explicit barriers:
sync,lwsync,ptsyncinstructions - Load-reserved/Store-conditional: Atomic operations
โก Performance Implications
Memory Barrier Performance Cost
Memory barriers have performance costs that vary by architecture:
- Pipeline Stalls: Barriers can cause pipeline stalls
- Memory Access Serialization: Some barriers serialize memory access
- Cache Effects: Barriers can affect cache behavior
- Compiler Limitations: Barriers limit compiler optimizations
Optimization Strategies
Several strategies can minimize the performance impact of memory barriers:
Barrier Placement Optimization
// Minimize barrier usage
void optimized_synchronization() {
// Do work that doesn't need ordering
int local_result = compute_something();
// Place barrier as late as possible
atomic_thread_fence(memory_order_release);
// Only the final result needs ordering
atomic_store(&shared_result, local_result);
}
Batching Operations
// Batch operations to reduce barrier overhead
void batched_operations() {
// Collect multiple updates
int updates[10];
for (int i = 0; i < 10; i++) {
updates[i] = compute_update(i);
}
// Single barrier for all updates
atomic_thread_fence(memory_order_release);
// Apply all updates atomically
for (int i = 0; i < 10; i++) {
atomic_store(&shared_data[i], updates[i]);
}
}
๐ฏ Best Practices and Guidelines
General Memory Ordering Guidelines
- Use the Weakest Ordering Possible: Choose the weakest memory ordering that provides the required guarantees
- Understand Your Target Architecture: Different architectures have different default behaviors
- Test Thoroughly: Memory ordering bugs can be subtle and architecture-dependent
- Document Assumptions: Clearly document memory ordering requirements
Common Pitfalls to Avoid
- Assuming Sequential Consistency: Don't assume operations execute in program order
- Ignoring Compiler Reordering: Compiler optimizations can reorder operations
- Mixing Memory Models: Be consistent with memory ordering within a program
- Over-synchronization: Don't use stronger ordering than necessary
Debugging Memory Ordering Issues
Memory ordering issues can be difficult to debug:
- Use Memory Ordering Tools: Tools like ThreadSanitizer can detect some issues
- Stress Testing: Run concurrent tests under various system loads
- Architecture-Specific Testing: Test on different processor architectures
- Formal Verification: Use formal methods for critical concurrent code
๐ Further Reading and Resources
- Memory Barriers: a Hardware View for Software Hackers by Paul E. McKenney
- The Art of Multiprocessor Programming by Herlihy and Shavit
- C++ Concurrency in Action by Anthony Williams
- ARM Architecture Reference Manual
- Intel 64 and IA-32 Architectures Software Developer's Manual
This comprehensive guide to Memory Ordering provides the foundation for understanding how modern multi-core systems handle concurrent memory access. The concepts covered here are essential for embedded software engineers working with concurrent programming and understanding the behavior of multi-threaded applications.