My previous blog post, Understanding the C/C++ memory model, got more (and still receives more) attention than I ever thought it would, so I feel obliged to add a follow up to expand on the topic a little more.
First, a few things that might not have been entirely clear or were missing from the previous post:
- The post was focused on acquire and release memory ordering constraints for atomic operations (and memory barriers), while the language model provides various other memory ordering constraint options. While some of these should be reasonably obvious (memory_order_relaxed, which essentially implies no constraints, and memory_order_acq_rel, which combines both acquire and release semantics in a single operation), some others warrant explanation.
- In case it was not clear, the default memory ordering constraint for the atomic operations provided by the standard library is actually memory_order_seq_cst. Importantly, if this default is used exclusively, some of the issues applicable to the use of acquire/release orderings are no longer of concern. However, this ordering potentially imposes a performance penalty if used when it is not actually required.
- In some cases acquire/release is not enough and we do need to make use of a stronger ordering constraint.
However, with all that in mind, I mostly want to focus in this post on presenting a more detailed model for understanding exactly what acquire/release entail. In particular, I aim to present a comprehensible model of processor hardware – even if the model is not accurate to any particular hardware – that allows for understanding the C/C++ memory model without having to rely solely on language-defined semantics. I’m not claiming that this model is complete nor even entirely correct, but I believe it may help to understand acquire/release semantics.
A hardware model to explain acquire/release
So, bearing in mind that this is not what real processor architecture necessarily looks like, let’s suppose that each processor (or processor core) is connected directly to a pair of caches (or buffers if you prefer) which are then connected to the main memory of the system:

Note that I’ve opted to call them “buffers” in the diagram, since I’ve learned that describing them as caches might lead to pedants claiming that the model is wrong, although it’s worth bearing in mind that (a) many people will best understand the behaviour of the buffers as being that of a cache and that (b) this model is not supposed to accurately represent any particular hardware.
I hope that the function of the store buffer and read buffer in the diagram is obvious: writes performed by the processor core go through the store buffer before reaching main memory, and reads need to go through the read buffer if the value to be read isn’t otherwise already available in either buffer. Both buffers can hold multiple values, each value corresponding to a different address. At arbitrary times and in unspecified order, values from the read buffer can be discarded, and values from the store buffer can be flushed to main memory.
Note that the buffers allow for different threads to observe reads and writes to be observed in different orders by different threads (cores), but that a thread should always see its own reads and writes in program order – in particular, if a thread writes to an address and then reads from the same address, the read operation should see the same value as was written, unless that address may have been written to by another thread in the meantime.
What may be a little confusing is that the link between the processor core and each buffer is depicted as being two-way. In fact, this is necessary for maintaining consistency between the buffers: if a single core is to store a value to a particular address, and then read that particular address, then it shouldn’t be possible that the read loads a value from main memory before the prior store has actually reached main memory. So in fact, a store operation from the processor both updates an existing value or stores a new value in the store buffer, and updates any existing value (for the corresponding address) in the read buffer.
With all this considered, how do acquire and release operations tie in? Quite simply: an acquire operation forcefully flushes the read buffer (discarding its contents), and a release operation forcefully flushes the store buffer (writing its contents to main memory). Importantly, this means that a value written must be followed by a release in the same thread, and then an acquire – in another thread – before it is read by the 2nd thread. This is what the language model itself requires.
What about atomic operations themselves? Unfortunately they are a bit hard to explain directly in terms of this simple model. We could say that atomic operations simply bypass the buffers and operate directly on main memory; the main issue with this is that it doesn’t allow for re-ordering of atomic operations on different addresses, which the language model does allow (although the orderings must still obey the associated constraints).
What happens if two processor cores have a store in their store buffer to the same address? Well, that’s most likely a race condition; similarly if a store in one processor buffer matches a read from another.
Sequential Consistency
The default ordering constraint for atomic operations provided by the C/C++ standard library is memory_order_seq_cst, where “seq_cst” is a shortening of “sequentially consistent”. Put simply, an atomic operation with sequential consistency implies both acquire semantics (if the operation is a load) and release semantics (if the operation is a store), as well as preventing other atomic operations (of any kind) from being re-ordered with respect to the sequentially consistent operation.
Why is this ever necessary, though? Well, consider the example in the previous post of a mutual exclusion device (mutex) implemented via a spinlock, using a compare-and-exchange operation to acquire the mutex and a store to release it. Previously I suggested that acquiring the mutex could use (naturally enough) acquire semantics and releasing it could use, similarly, release semantics. In the normal case of a single mutex protecting some data structure, this is generally correct.
Sometimes, however, more complex applications can require more than a single mutex to be locked at the same time. With careless design, this can lead to deadlocks: if there are two mutexes, A and B, and one thread locks them in [A,B] order while another does so in [B,A] order, you potentially get the situation where each thread holds one mutex and requires the other (but can never obtain it). A typical approach to avoiding such problems is to impose a consistent ordering of mutex acquisition: the program could be designed, for instance, to always acquire the mutexes in the order [A,B]. Of course it is still possible to acquire either mutex alone without causing deadlock, so long as the B mutex is released before acquiring the A mutex: the important thing is that the mutexes are never acquired in the order [B,A].
In our particular example, even though atomic operations may be re-ordered, the acquisition of the B mutex will not be re-ordered with respect to the acquisition of the A mutex, since no store or load (including via an atomic operation) can move prior to an acquire operation. However, what can happen is that the release of one mutex can be re-ordered with respect to the acquisition of another (by moving it from before to behind), and this is problematic: Suppose that we have the sequence, in one thread:
- acquire B
- release B
- acquire A
- release A
This should be ok – it doesn’t acquire the mutex A while B is held – but as has just been suggested, this sequence could be re-ordered (by the compiler or processor) as follows:
- acquire B
- acquire A
- release B
- release A
If another thread is obtaining both mutexes in the proscribed order, A and then B, this will lead to deadlock! The solution is to use sequentially consistent ordering, which prevents any atomic operation being re-ordered with respect to the sequentially consistent operation. Note that it’s not necessary to make every mutex operation sequentially consistent; I’ll leave it as an exercise to decide on the best solution.
Edit 3 April: Preshing believes that the problematic re-ordering detailed above isn’t allowed by the C++ language standard (he doesn’t mention C and I’ve not checked whether it has similar wording). I’m not entirely convinced myself either that the paragraph he quotes necessarily means that this is the case, nor that it is intended to do so, though I’ve not given it full consideration yet. Another example of where sequential consistency makes a difference is given in an answer to this stackoverflow question.
Conclusion
The hardware model I’ve presented is hopefully simple enough to understand while providing comprehensible rationale for the particulars of the release/acquire semantics provided by the language model. It may however not be correct in every aspect and is not complete; use it only for what it is. I’ve also discussed sequential consistency and why it is sometimes necessary to avoid deadlocks.
I haven’t discussed the “consume” memory ordering, and may leave that for another time, but I encourage anyone who’s interested to seek out explanations elsewhere. I hope that this post and the one preceding it are enough to give a good grounding in the issues addressed by the memory model.
I’m curious about the use of the word “thread”, for instance in this passage:
“Note that the buffers allow for different threads to observe reads and writes to be observed in different orders by different threads (cores), but that a thread should always see its own reads and writes in program order – in particular, if a thread writes to an address and then reads from the same address, the read operation should see the same value as was written, unless that address may have been written to by another thread in the meantime.”
By “thread” do you really mean “core”? Or, if you really do mean “thread” is there some mechanism that ensures that as threads get moved between cores their non-observed state gets moved also?
Never mind — I see that you are apparently using “thread” and “core” interchangeably. That’s a bit confusing if you ask me.
That aside, thanks for a nice write-up — it’s been quite helpful in understanding some of the (very) un-intuitive behavior exhibited by modern CPUs.
You’re right that I could/should have put the relationship between cores and threads more explicitly, though for purposes of understanding the language memory model you can assume that each thread runs on its own core. In the hardware model I’ve presented, if you wanted to allow for migration of threads between (a potentially limited number of) cores, then a context switch – that is, a switch to running a different thread on some core – requires that the read and store buffer be flushed (which would need to be arranged by the system).
To my mind, it feels more natural to think in terms of a cache for which the state of every address is “absent”, “clean”, “dirty”, or “poisoned”. An acquire set all clean or poisoned entries to absent, and a release would convert all dirty entries to clean ones (by writing them to memory). Such a model could handle a couple of situations which yours presently does not
(1) Given the sequence “read X (yielding value #1); write value #2 to X; release; read X”, the read buffer would receive the value #1 from the first read, and nothing that happens later would change that. The act of writing value #2 needs to remove value #1 from the read buffer. With a single cache, that would happen implicitly; if using two caches, you would need to specify the means by which it occurs.
(2) Given core #1 write; core #2 acquire; core #1 release; core #2 read” (using other intervening operations, not involving the address in question, to ensure that core #1 release occurs before core #2 read) your model would suggest that core #2’s acquire would flush the read buffer, and it would thus be empty when it performs the read. I’m pretty sure, though, core #2 would be allowed to load data from memory into the read buffer at any arbitrary time between the acquire and the read request, though. Saying that spontaneous loads cannot occur with dirty addresses seems cleaner than saying that data can only appear spontaneously in the read buffer when no corresponding data is in the write buffer.
Perhaps the model with separate read/write buffers can be adjusted to handle all corner cases, but I think the description above misses important cases which would be awkward to describe under that model.
for (1) I think this is handled by:
So in fact, a store operation from the processor both updates an existing value or stores a new value in the store buffer, and updates any existing value (for the corresponding address) in the read buffer
for (2) I think we could just say that loads into the read buffer can occur at any time without an explicit read operation. (Note that It wasn’t actually explicitly specified otherwise).
Handling either situation individually wouldn’t be too bad even with the two-buffer model, but handling them both in a two-buffer model gets tricky. If acquire operations flush the read buffer, then it will be necessary to either have spontaneous reloads loads of read-buffer contents check for pending data in the write buffer, or have actions that remove committed data from the write buffer clean up any superseded data that might be sitting in the read buffer. While doing such things wouldn’t be impossible, all of them would effectively require an arrow that connects the read and write buffers, and so far as I can tell would be semantically equivalent to simply having one buffer where each row can be absent, clean, or dirty.
I think you are overthinking the issue in an attempt to get a complete model. This model wasn’t intended to complete – it’s intended to be simple to understand and give an understanding of when, in common cases, it is necessary to do an acquire or release. It doesn’t capture all race conditions, though (and nor, as far as I can see, does your own proposed model).