Understanding the C/C++ memory model – part 2

My previous blog post, Understanding the C/C++ memory model, got more (and still receives more) attention than I ever thought it would, so I feel obliged to add a follow up to expand on the topic a little more.

First, a few things that might not have been entirely clear or were missing from the previous post:

  1. The post was focused on acquire and release memory ordering constraints for atomic operations (and memory barriers), while the language model provides various other memory ordering constraint options. While some of these should be reasonably obvious (memory_order_relaxed, which essentially implies no constraints, and memory_order_acq_rel, which combines both acquire and release semantics in a single operation), some others  warrant explanation.
  2. In case it was not clear, the default memory ordering constraint for the atomic operations provided by the standard library is actually memory_order_seq_cst. Importantly, if this default is used exclusively, some of the issues applicable to the use of acquire/release orderings are no longer of concern. However, this ordering potentially imposes a performance penalty if used when it is not actually required.
  3. In some cases acquire/release is not enough and we do need to make use of a stronger ordering constraint.

However, with all that in mind, I mostly want to focus in this post on presenting a more detailed model for understanding exactly what acquire/release entail. In particular, I aim to present a comprehensible model of processor hardware – even if the model is not accurate to any particular hardware – that allows for understanding the C/C++ memory model without having to rely solely on language-defined semantics. I’m not claiming that this model is complete nor even entirely correct, but I believe it may help to understand acquire/release semantics.

A hardware model to explain acquire/release

So, bearing in mind that this is not what real processor architecture necessarily looks like, let’s suppose that each processor (or processor core) is connected directly to a pair of caches (or buffers if you prefer) which are then connected to the main memory of the system:

acquire-release.svg

Note that I’ve opted to call them “buffers” in the diagram, since I’ve learned that describing them as caches might lead to pedants claiming that the model is wrong, although it’s worth bearing in mind that (a) many people will best understand the behaviour of the buffers as being that of a cache and that (b) this model is not supposed to accurately represent any particular hardware.

I hope that the function of the store buffer and read buffer in the diagram is obvious: writes performed by the processor core go through the store buffer before reaching main memory, and reads need to go through the read buffer if the value to be read isn’t otherwise already available in either buffer. Both buffers can hold multiple values, each value corresponding to a different address. At arbitrary times and in unspecified order, values from the read buffer can be discarded, and values from the store buffer can be flushed to main memory.

Note that the buffers allow for different threads to observe reads and writes to be observed in different orders by different threads (cores), but that a thread should always see its own reads and writes in program order – in particular, if a thread writes to an address and then reads from the same address, the read operation should see the same value as was written, unless that address may have been written to by another thread in the meantime.

What may be a little confusing is that the link between the processor core and each buffer is depicted as being two-way. In fact, this is necessary for maintaining consistency between the buffers: if a single core is to store a value to a particular address, and then read that particular address, then it shouldn’t be possible that the read loads a value from main memory before the prior store has actually reached main memory. So in fact, a store operation from the processor both updates an existing value or stores a new value in the store buffer, and updates any existing value (for the corresponding address) in the read buffer.

With all this considered, how do acquire and release operations tie in? Quite simply: an acquire operation forcefully flushes the read buffer (discarding its contents), and a release operation forcefully flushes the store buffer (writing its contents to main memory). Importantly, this means that a value written must be followed by a release in the same thread, and then an acquire – in another thread – before it is read by the 2nd thread. This is what the language model itself requires.

What about atomic operations themselves? Unfortunately they are a bit hard to explain directly in terms of this simple model. We could say that atomic operations simply bypass the buffers and operate directly on main memory; the main issue with this is that it doesn’t allow for re-ordering of atomic operations on different addresses, which the language model does allow (although the orderings must still obey the associated constraints).

What happens if two processor cores have a store in their store buffer to the same address? Well, that’s most likely a race condition; similarly if a store in one processor buffer matches a read from another.

Sequential Consistency

The default ordering constraint for atomic operations provided by the C/C++ standard library is memory_order_seq_cst, where “seq_cst” is a shortening of “sequentially consistent”. Put simply, an atomic operation with sequential consistency implies both acquire semantics (if the operation is a load) and release semantics (if the operation is a store), as well as preventing other atomic operations (of any kind) from being re-ordered with respect to the sequentially consistent operation.

Why is this ever necessary, though? Well, consider the example in the previous post of a mutual exclusion device (mutex) implemented via a spinlock, using a compare-and-exchange operation to acquire the mutex and a store to release it. Previously I suggested that acquiring the mutex could use (naturally enough) acquire semantics and releasing it could use, similarly, release semantics. In the normal case of a single mutex protecting some data structure, this is generally correct.

Sometimes, however, more complex applications can require more than a single mutex to be locked at the same time. With careless design, this can lead to deadlocks: if there are two mutexes, A and B, and one thread locks them in [A,B] order while another does so in [B,A] order, you potentially get the situation where each thread holds one mutex and requires the other (but can never obtain it). A typical approach to avoiding such problems is to impose a consistent ordering of mutex acquisition: the program could be designed, for instance, to always acquire the mutexes in the order [A,B]. Of course it is still possible to acquire either mutex alone without causing deadlock, so long as the B mutex is released before acquiring the A mutex: the important thing is that the mutexes are never acquired in the order [B,A].

In our particular example, even though atomic operations may be re-ordered, the acquisition of the B mutex will not be re-ordered with respect to the acquisition of the A mutex, since no store or load (including via an atomic operation) can move prior to an acquire operation. However, what can happen is that the release of one mutex can be re-ordered with respect to the acquisition of another (by moving it from before to behind), and this is problematic: Suppose that we have the sequence, in one thread:

  1. acquire B
  2. release B
  3. acquire A
  4. release A

This should be ok – it doesn’t acquire the mutex A while B is held – but as has just been suggested, this sequence could be re-ordered (by the compiler or processor) as follows:

  1. acquire B
  2. acquire A
  3. release B
  4. release A

If another thread is obtaining both mutexes in the proscribed order, A and then B, this will lead to deadlock! The solution is to use sequentially consistent ordering, which prevents any atomic operation being re-ordered with respect to the sequentially consistent operation. Note that it’s not necessary to make every mutex operation sequentially consistent; I’ll leave it as an exercise to decide on the best solution.

Edit 3 April: Preshing believes that the problematic re-ordering detailed above isn’t allowed by the C++ language standard (he doesn’t mention C and I’ve not checked whether it has similar wording). I’m not entirely convinced myself either that the paragraph he quotes necessarily means that this is the case, nor that it is intended to do so, though I’ve not given it full consideration yet. Another example of where sequential consistency makes a difference is given in an answer to this stackoverflow question.

Conclusion

The hardware model I’ve presented is hopefully simple enough to understand while providing comprehensible rationale for the particulars of the release/acquire semantics provided by the language model. It may however not be correct in every aspect and is not complete; use it only for what it is. I’ve also discussed sequential consistency and why it is sometimes necessary to avoid deadlocks.

I haven’t discussed the “consume” memory ordering, and may leave that for another time,  but I encourage anyone who’s interested to seek out explanations elsewhere. I hope that this post and the one preceding it are enough to give a good grounding in the issues addressed by the memory model.

Advertisements

13 thoughts on “Understanding the C/C++ memory model – part 2

  1. I’m curious about the use of the word “thread”, for instance in this passage:

    “Note that the buffers allow for different threads to observe reads and writes to be observed in different orders by different threads (cores), but that a thread should always see its own reads and writes in program order – in particular, if a thread writes to an address and then reads from the same address, the read operation should see the same value as was written, unless that address may have been written to by another thread in the meantime.”

    By “thread” do you really mean “core”? Or, if you really do mean “thread” is there some mechanism that ensures that as threads get moved between cores their non-observed state gets moved also?

    1. Never mind — I see that you are apparently using “thread” and “core” interchangeably. That’s a bit confusing if you ask me.

      That aside, thanks for a nice write-up — it’s been quite helpful in understanding some of the (very) un-intuitive behavior exhibited by modern CPUs.

      1. You’re right that I could/should have put the relationship between cores and threads more explicitly, though for purposes of understanding the language memory model you can assume that each thread runs on its own core. In the hardware model I’ve presented, if you wanted to allow for migration of threads between (a potentially limited number of) cores, then a context switch – that is, a switch to running a different thread on some core – requires that the read and store buffer be flushed (which would need to be arranged by the system).

  2. To my mind, it feels more natural to think in terms of a cache for which the state of every address is “absent”, “clean”, “dirty”, or “poisoned”. An acquire set all clean or poisoned entries to absent, and a release would convert all dirty entries to clean ones (by writing them to memory). Such a model could handle a couple of situations which yours presently does not

    (1) Given the sequence “read X (yielding value #1); write value #2 to X; release; read X”, the read buffer would receive the value #1 from the first read, and nothing that happens later would change that. The act of writing value #2 needs to remove value #1 from the read buffer. With a single cache, that would happen implicitly; if using two caches, you would need to specify the means by which it occurs.

    (2) Given core #1 write; core #2 acquire; core #1 release; core #2 read” (using other intervening operations, not involving the address in question, to ensure that core #1 release occurs before core #2 read) your model would suggest that core #2’s acquire would flush the read buffer, and it would thus be empty when it performs the read. I’m pretty sure, though, core #2 would be allowed to load data from memory into the read buffer at any arbitrary time between the acquire and the read request, though. Saying that spontaneous loads cannot occur with dirty addresses seems cleaner than saying that data can only appear spontaneously in the read buffer when no corresponding data is in the write buffer.

    Perhaps the model with separate read/write buffers can be adjusted to handle all corner cases, but I think the description above misses important cases which would be awkward to describe under that model.

    1. for (1) I think this is handled by:

      So in fact, a store operation from the processor both updates an existing value or stores a new value in the store buffer, and updates any existing value (for the corresponding address) in the read buffer

      for (2) I think we could just say that loads into the read buffer can occur at any time without an explicit read operation. (Note that It wasn’t actually explicitly specified otherwise).

      1. Handling either situation individually wouldn’t be too bad even with the two-buffer model, but handling them both in a two-buffer model gets tricky. If acquire operations flush the read buffer, then it will be necessary to either have spontaneous reloads loads of read-buffer contents check for pending data in the write buffer, or have actions that remove committed data from the write buffer clean up any superseded data that might be sitting in the read buffer. While doing such things wouldn’t be impossible, all of them would effectively require an arrow that connects the read and write buffers, and so far as I can tell would be semantically equivalent to simply having one buffer where each row can be absent, clean, or dirty.

        1. I think you are overthinking the issue in an attempt to get a complete model. This model wasn’t intended to complete – it’s intended to be simple to understand and give an understanding of when, in common cases, it is necessary to do an acquire or release. It doesn’t capture all race conditions, though (and nor, as far as I can see, does your own proposed model).

  3. It is definitely the case that the dual mutex reordering shouldn’t happen, and it would almost certainly be treated as a bug in the standard if it can.

    This area is full of tricky corner cases, many of which the authors of the standard didn’t foresee. In that light, statements of intent are more powerful than the actual text of the standard — they indicate where issues like this would be considered bugs.

    Pershing mentioned in the link that the C++17 language added such an unambiguous statement of intent that at the very least complier optimizations not lead to this scenario.

    There is, however, a far more basic such statement that covers this scenario: the names “acquire” and “release” come from the ordering semantics needed to implement “acquire” and “release” for a mutex, and so the use of these words by the C and C++ languages indicates that they should be sufficient for acquire and release of (simple) mutexes, which they would not be if your scenario were allowed.

    1. It is definitely the case that the dual mutex reordering shouldn’t happen

      “Definitely” is a strong word and you’re using it here without sufficiently strong justification, in my opinion.

      I guess that by “statement of intent” you are referring to 4.7.2:18:

      An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

      I don’t think that’s meant to be a statement of intent, rather, it’s a hard requirement. Interpreted literally (as it arguably should be) Preshing rightly points out that it precludes re-ordering of acquire/release operations where it would potentially prevent another thread from seeing an update indefinitely. However, I’m not sure that this was really the intent of that paragraph and Preshing also retains doubts. I suspect instead that it is meant to cover the simple case of a value being stored atomically in one thread while another thread continuously loads it (with appropriate ordering constraints in both cases) – the implementation must ensure that the reading thread does see the value written by the other, at some point.

      There is, however, a far more basic such statement that covers this scenario: the names “acquire” and “release” come from the ordering semantics needed to implement “acquire” and “release” for a mutex, and so the use of these words by the C and C++ languages indicates that they should be sufficient for acquire and release of (simple) mutexes, which they would not be if your scenario were allowed.

      This is not the sort of logic you can apply when reading a formal language specification. Terms are only meaningful in so far as they are defined in the document itself. In any case, the ordering as specified is certainly sufficient for acquire and release of individual mutexes and that’s enough to explain the use of these terms.

      1. I don’t think that’s meant to be a statement of intent, rather, it’s a hard requirement.

        If it were a hard requirement, the standard would have used the word “must” rather than “should”.

        This is not the sort of logic you can apply when reading a formal language specification.

        Formal language specifications, like programs, have bugs. That’s a fact of life. When you’re trying to figure out what a buggy program is supposed to be doing, you have to take into account context; you’re not going to figure it out by mechanically parsing what the buggy code does.

        Use of commonly understood terminology like “acquire” and “release” has the same sort of power as a code comment — it doesn’t tell you anything directly about what the standard formally says right now, but it says a lot as to what future versions of the formal standard will say, and what implementers of said standard will do.

        Further, practically speaking there are enough existing users that assume acquire/release RMWs are sufficient — certainly, for example, abseil’s mutex and spinlock implementations do. A compiler that made acquire/release insufficient for mutexes would break a lot of code out there.

        1. If it were a hard requirement, the standard would have used the word “must” rather than “should”.

          Fair point. A soft requirement, not a hard requirement. I still wouldn’t personally call it a “statement of intent”, but that’s just a matter of word choice, I suppose.

          Use of commonly understood terminology like “acquire” and “release” has the same sort of power as a code comment

          Sure, but I already pointed out that the described acquire/release semantics are sufficient for lone mutexes. That’s enough to explain the use of these names. I don’t know the “abseil” you speak of, but it seems to me that its equally possibly its mutex/spinlock implementations are buggy. At least, either the wording of the standard is wrong (or incomplete) or the code in that implementation is wrong. It could be either. (In practice, of course, the code might well behave as intended on all current compilers). But the existence of some code (even a lot of code) which relies on a particular interpretation of the language standard by itself doesn’t mean that this interpretation is correct. For example, there are/were plenty of C programs, that are considered definitely wrong by compiler vendors, but which worked properly for years (until compilers started performing more aggressive optimisations).

          1. Most C implementations are designed to produce an executable which will be run in an environment outside the implementation’s control. If an environment offers a means of attempting an operation on a “best effort” basis that will usually succeed, but it can’t guarantee success, allowing an implementation to support the standard means of using that feature even though it’s not perfect may be be more useful than requiring that the implementation report that it doesn’t support the feature.

            In fact, one of my biggest complaints with the C and C++ Standards is the excessively strong desire to classify things into those that implementations must do and those that they should not be particularly expected to do, without recognizing the large number of actions which implementations should be expected to do absent a documented reason for being unable to do so. The Standard shouldn’t need to concern itself with whether a reason is “good enough”; the marketplace should be able to do that if implementations are required to admit their weaknesses.

        2. In addition, something I just noticed, going back to your first comment:

          It is definitely the case that the dual mutex reordering shouldn’t happen, and it would almost certainly be treated as a bug in the standard if it can.

          I should clarify that the post is not saying that mutex lock/unlock re-ordering is allowed by the standard; it’s saying that a particular mutex implementation may be problematic in that it would allow such re-ordering.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.