I know that a lot of people struggle with understanding the memory model introduced in C11/C++11. In a recent conversation I was alerted to the existence of this blog post, which while correct (by my understanding) is in my opinion aimed at those who already have a good understanding of some of the underlying concepts. I’m going to try to set out my understanding of the model, which is hopefully correct, in more straightforward terms and with a more “ground up” explanation.
Edit: some redditors complained that the description of cache operation (such as it was) in the original version of this post contained technical inaccuracies, if interpreted as a description of current actual mainstream architecture. I believe those inaccuracies were not important for the understanding of the C/C++ memory model, however, they have now been rectified. Still: If you really want to understand anything about how processor architecture works, this is not the article for you. Rather, this is about a software model that is designed to abstract away much of that detail, and is deliberately light on hardware-level specifics.
Introduction: Program order versus memory order
You have your C/C++ program and it stores values (sets a variable or stores a value through a pointer) and it loads values (reads a variable or reads a value through a pointer). The order that your program specifies these loads and stores is the “program order”. However, there are a number of things that can cause the order in which the stores and loads actually occur to your system’s main memory to be different to that of the program order. It’s worth noting that the effect of these can never be visible to a single-threaded program, because the compiler must make the sure that the code behaves as if the loads/stores are in the order that the program specifies. Some examples:
- the compiler will often cache values in registers rather than storing them to memory immediately;
- the compiler may re-order reads and writes to memory when it can determine that such re-ordering won’t affect program behaviour (so long as the program obeys the language rules);
- the processor may perform out-of-order execution in some cases;
- even when the generated code stores values to memory rather than a register, the store might go into a processor cache and stay there for some time before it goes through to main memory;
- similarly, loads may read values which are already in the cache, which means they don’t need to be read from main memory (at all);
- finally, when a cache writes lines back out to main memory it may do so in an order that is different to the order in which the values entered the cache.
The good news is, you don’t really need to understand any of the above points: all you need to understand is that loads and stores can be re-ordered before they go to memory, but this won’t affect the actual behaviour of a single-threaded program.
Now for the important part: in a multi-threaded program, some of these re-orderings – together with some additional issues associated with the methods used by the hardware for maintaining cache coherency – also can cause different threads to see loads and stores made by other threads as occurring in a different order. It’s probably worth noting that a program in which this actually happened could well be exhibiting undefined behaviour – what should happen is that the program uses the memory consistency primitives to make sure that it sees a “correct” ordering. We’re coming to that now.
Atomic operations
If we forget about standard synchronisation tools such as mutexes, then there are two essential mechanisms to making ordering consistent between threads: atomic operations, and memory barriers.
Atomic operations are operations which complete indivisibly. One classic example is the “atomic increment”. A “classic” (non-atomic) increment can be broken into three separate parts: it loads a value from memory, it increments the value, and then it stores the value back to the same location in memory. In multi-threaded programs, the problem with this that the value might be changed by another thread in between the load and store, and that results in storing the wrong value; the intermediate store value is lost. In particular consider two threads that are both trying to increment the value starting at 0:
- Thread A loads the value 0
- Thread B loads the value 0
- Thread A increments the value, calculating a result of 1
- Thread B also calculates a result of 1
- Thread A stores the value 1
- Thread B stores the value 1
The problem is that the value has gone from 0 to 1, despite there being two increments – one of the increments got lost. (Again, if this actually happened in a C/C++ program, that program would be exhibiting undefined behaviour). An atomic increment happens as one step, atomically (indivisibly). So, to avoid the problem above, you could use an atomic integer type and an atomic increment operation.
Another classic atomic operation is “compare and exchange”. This can be used to implement a spin lock (a kind of mutex) – you treat a value of 0 as “unlocked” and a value of 1 as locked. Then, to lock the mutex, a thread can atomically compare the current value with 0 and, if it is equal to 0, set it to 1. The operation returns the original value – if 0, the lock was successfully, if 1 then another thread has the lock and so it must be retried until the other thread releases the lock (typically this would be done in a loop, “spinning” on the lock until it becomes available).
Before we go any further, let’s note the following important points about atomic operations:
- Atomic operations on a particular object (variable) always have a total ordering which is consistent with program ordering. This means that every atomic operation from every thread which operates on the same atomic variable happens either before or after other every other operation, and if a thread does two operations one after the other then they will appear in the same sequence in the total ordering.
- If an atomic operation is performed in one thread and another thread then performs another atomic operation on the same variable, the second thread will definitely see the effect of the operation performed by the first (assuming there are no other threads stepping in to perform other another operation in the meantime).
What atomic operations do not necessarily do is enforce ordering on any loads and stores that occur around them (I say necessarily because, as you may be aware, you can specifying memory ordering constraints when you perform an atomic operation*. We’re getting to that).
Consider the case of the spin lock that we discussed above: the lock is probably designed to protect some data structure. So you might have a sequence that looks like this:
- Thread A acquires lock
- Thread A stores values to data structure
- Thread A releases lock
- Thread B acquires lock
- Thread B reads values from data structure
- Thread C releases lock
The problem is that the atomic operations used to implement the lock by themselves don’t impose ordering on steps 2 and 5, so thread B might not see the values written by thread A, or even worse, might see some of the values but not all of them (this is a data race, and is not actually allowed to happen in a correct program; a correct program would use the appropriate memory ordering controls to prevent it from happening).
To prevent re-ordering of non-atomic loads and stores, we use barriers (aka fences).
Barriers
A barrier prevents loads and stores from being re-ordered past the barrier, in one or in both directions. Specifically: an acquire barrier operation prevents loads/stores which are after the barrier from being re-ordered to appear before the barrier, and a release barrier prevents loads/stores which are before the barrier from being moved after the barrier (there are other types of barrier operation, but I won’t cover them here right now).
Through the use of appropriate barriers, we can almost fix our spinlock example:
- Thread A acquires lock
- Thread A issues an acquire barrier
- Thread A stores values to the data structure
- Thread A issues a release barrier
- Thread A then releases the lock
- Thread B acquires the lock
- Thread B issues an acquire barrier
- Thread B reads values from the data structure
- Thread B issues a release barrier
- Thread B then releases the lock
There is still one minor issue, which is that the operations which actually acquire and release the lock could be moved past their corresponding barriers (e.g. step 5 could be re-ordered in front of step 4). There are two ways that C++11 resolves this:
First, you can combine atomic operations which acquire or release some lock with a barrier, so that the operation and the barrier are essentially indivisible (this is performed by passing the appropriate consistency constraint constant as a parameter to the atomic operation);
Second, a standalone barrier (std::atomic_thread_fence) actually acts as a stronger barrier than what is stated above. Specifically, a standalone release barrier actually also prevents subsequent stores – including atomic stores – from being moved in front of the barrier, and a standalone acquire barrier also prevents earlier loads from being moved behind the barrier.
Conclusion and key take-aways
Some key points that I wanted to cover which some more advanced tutorials tended to gloss over:
- Atomic operations and memory ordering constraints are really two separate things;
- Memory barriers are useless without some kind of synchronisation (which atomic operations provide). Memory barriers do not themselves provide any synchronisation. Atomic operations themselves do not provide a memory barrier;
- Atomic operations can have a memory ordering constraint specified as a parameter. This effectively creates a barrier of the specified type, and forces the atomic operation to stick on the appropriate side of that barrier.
There are of course a lot of things I didn’t cover here, but there’s plenty of other material out there which covers the concepts in more detail. I hope the explanations here, which I’ve tried to reduce to the bare essentials, are helpful to some.
* Edit: it should be noted that the default ordering constraint on atomic operations is actually memory_order_seq_cst which implies both acquire and release semantics (depending on the operation) and provides additional guarantees beyond. So, while an atomic operation doesn’t itself imply a memory barrier, an atomic operation with the default memory order constraint does so. It may be preferable to use acquire/release semantics explicitly however since the default constraint can affect performance.