Wrap on integer overflow is not a good idea

A discussion of undefined behaviour and compiler optimisation, particularly in regards to signed integer overflow.

C (and C++) compilers are becoming notorious for exploiting the notion of undefined behaviour – the idea that certain things a program might do have no behaviour proscribed by the language standard, and that the compiler can assume the program doesn’t do these things when it is generating object code. Quite a few people have been objecting to this, since it can result in the generated code not doing what the programmer intended; the problem is becoming more noticeable over time, as compilers introduce more sophisticated optimisation techniques which are more likely to exploit the notion.

One prominent example is that of signed integer overflow. Most C programmers are developing for machines which use a 2’s complement representation of integers; addition and subtraction, with such a representation, is implemented in exactly the same way as for unsigned arithmetic. If the addition of two positive signed integers overflows – that is, if the result is larger than can be represented – the processor will produce a number that, when interpreted as a 2’s complement signed integer, will appear to be negative. This is called “wrapping” because the value has “wrapped around” from the high end of the numeric range to the low end.

For this reason, you occasionally see C code that looks something like this:

int b = a + 1000;
if (b < a) { // overflow
    puts("input too large!"); return;
}

The “if” statement is designed to detect the overflow condition (in this case from adding 1000 to the value from the variable ‘a’) and report an error. The problem is that, in C, signed integer overflow is one case of undefined behaviour. Compilers, for some time now, have performed an analysis which shows that the condition can never be true: if I add 1000 (or any positive number) to another value, the result cannot be smaller than the original value; if overflow occurred, that is undefined behaviour, and it is the programmer’s responsibility (arguably) to ensure that their program never exhibits such behaviour. Therefore, the compiler may decide that the entire if statement can be removed as an optimisation (it can never be true, it can never have an effect, it may as well not be there).

The problem with this compiler optimisation, in this case, is that it has removed the test that the programmer specifically used in an attempt to detect the overflow situation and handle it. An example of this with a real compiler can be seen here. (Side note: the godbolt.org site on which that example is hosted is great! you can edit the code and see the compiled form with a wide range of compilers. Play with it!). Observe that the overflow check is not removed if the type is changed to an unsigned integer, since unsigned overflow has defined behaviour in C (or rather, more accurately, unsigned arithmetic is defined to wrap and thus the overflow does not actually occur).

So is this wrong? Some have argued that it is, though it’s clear that many compiler vendors feel that it’s legitimate. The main arguments made by proponents of (edit: implementation-defined)wrapping overflow behaviour, if I understand them correctly, boil down to variants of the following:

  • Wrapping on overflow is a useful behaviour.
  • Wrapping is the behaviour expected by programmers.
  • Exploiting undefined behaviour semantics on overflow gives no significant benefit.
  • The C language standard, in regards to undefined behaviour, gives license for implementations “ignoring the situation completely, with unpredictable results”, but this doesn’t allow optimisations to assume that the situation for which the undefined behaviour is proscribed will not come about.

Let’s look at these one by one:

Wrapping on overflow is a useful behaviour?

The main utility for a wrapping behaviour is to be able to detect overflow after it occurs. (If there are other uses, that could not be handled using unsigned integers instead, I am not immediately unable to think of any and suspect they are rare). While this would indeed simplify the problem of avoiding the use of erroneously overflowed results, it certainly doesn’t help in all cases (consider multiplication, or addition of two unknown quantities with unknown sign).

For the trivial case where wrapping behaviour does allow simply detecting overflow after it occurs, it is also straightforward to determine whether overflow would occur, before it actually does so. The example above can be rewritten as follows:

if (a > INT_MAX - 1000) { // would overflow
    puts("input too large!");
    return;
}
int b = a + 1000;

That is, you can perform a check to see whether the result of an addition will exceed the maximum representable value, rather than performing the addition and then trying to determine whether that overflow occurred by checking if the result is mathematically inconsistent. (If the sign of both operands is unknown, the check becomes significantly more complicated, but this is also true when checking for overflow after the operation with wrapping overflow semantics).

With this in mind, I’m not really convinced that wrapping overflow is generally useful.

Wrapping is the behaviour expected by programmers?

It’s more difficult to argue against this point, since clearly at least some C programmers have written code which expects wrapping semantics for signed integer overflow. However, I don’t think that this alone is a strong argument for implementing wrapping semantics by default (note that several compilers implement options for wrapping overflow, if it really is desired).

An obvious mitigation for the problem of programmers expecting this particular behaviour is for the compiler to issue a warning when it optimises based on the alternative undefined-behaviour-is-assumed-not-to-occur semantics. Unfortunately as we see in the godbolt.org link above, compilers don’t always do so (Gcc 7.3 does but 8.1 does not, so this appears to be a regression).

Exploiting undefined behaviour semantics on overflow gives no significant benefit?

If true in all cases this would be a compelling argument for having compilers default to wrap-on-overflow, since it is probably better to allow the “detect overflow after it occurs” mechanism described above to work even if it is technically incorrect – if only because that mechanism may be in use in code which is arguably broken.

I suspect that with typical C programs the benefit of this particular optimisation (removing checks for mathematically impossible conditions) is usually negligible, because C programs tend to be written by programmers who are seeking good performance and who tend to hand-optimise their code anyway: that is, if it’s obvious that particular “if” statement has a condition that can never be true, the programmer would likely have removed the statement themselves. Indeed, a search reveals a few studies where the effectiveness of this optimisation has been questioned, tested, and found to be mostly insignificant for the particular benchmarks under test. However, while in many cases there is no benefit for C, the code generation engines and optimisers in compilers are commonly general and could be used for other languages where the same might not be so generally true; consider C++, where it is somewhat idiomatic in templated code to rely on the optimiser from removing redundant code, rather than doing it manually. There is also the case of languages being transpiled to C and relying on the C compiler to optimise away redundant code.

Also, even without overflow check elimination, it is not necessarily correct to assume that wrapping integers has minimal direct cost even on machines which use 2’s complement representation. The Mips architecture, for example, can perform arithmetic operations only in registers, which are fixed size (32 bit). A “short int” is generally 16 bits and a “char” is 8 bits; if assigned to a register, the underlying width of a variable with one of these types will expand, and forcing it to wrap according to the limit of the declared type would require at least one additional operation and possibly the use of an additional register (to contain an appropriate bitmask). I have to admit that it’s been a while since I’ve had exposure to any Mips code and so I’m a little fuzzy on the precise cost involved, but I’m certain it is non-zero and other RISC architectures may well have similar issues.

The language standard does not allow for signed integer overflow not to wrap, if that’s what the underlying architecture does?

This argument is particularly weak when examined. It essentially states that there is a requirement that “undefined behaviour” actually grants only limited license to the implementation (compiler), by the text of the standard. What the text that proponents latch on to says precisely is the following, as part of the definition of undefined behaviour:

NOTE Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, …

The claim is that “ignoring the situation completely” would not allow for assuming that the situation leading to the undefined behaviour – overflowing addition, for example – could not happen, but rather that, if it does happen, the implementation must carry on as if it did not happen but must respect the result it would obtain from asking the processor to perform such an operation (putting it another way: as if the translation from source to machine code was direct and naive).

First, we should observe that this text is in a NOTE and therefore non-normative (may not proscribe behaviour), according to the ISO directive mentioned in the foreword of the same document:

In accordance with Part 3 of the ISO/IEC Directives, this foreword, the introduction, notes, footnotes, and examples are also for information only.

Given that the “possible undefined behaviour” appears in such a note, it is not proscriptive. Note that the actual definition text for “undefined behavior” reads:

behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements.

I have added emphasis on the important part: there are no requirements for undefined behaviour; the list of “possible undefined behaviors” in the note contains merely examples and cannot be definitive. “Imposes no requirements” is unambiguous.

Some extend the argument to say that, regardless of what the text actually says, the intention of the language committee when those words were drafted was that the behaviour should in general match that of the underlying hardware, as closely as possible, assuming a naive translation to machine code. This may be true, though I’ve not seen any evidence (such as historical text) that supports it. Even if it were true, however, it would not necessarily apply to the current incarnation of the text.

Final thoughts

The arguments for wrapping on overflow are mostly flawed. Probably the strongest argument that can be made is a combination: it is occasionally expected by less experienced programmers (who do not understand the nuances of C and of its undefined behaviour), and is not particularly harmful to performance – however, the latter is not true in all cases, and the former is a somewhat dubious reasoning when considered by itself.

Personally, I feel that I would much rather have trap on overflow than wrap. That is, I would rather that a program crash instead of continuing with either undefined behaviour or a potentially incorrect value, either of which could be a security issue. This would certainly have a slight performance impact on most(?) architectures, particularly x86, but on the other hand it would immediately flag overflow bugs rather than allowing them to be exploited or to produce incorrect results further down the line. It could also in theory allow the compiler to safely remove redundant comparisons following a potential overflow, because it ensures that they really can’t happen, though I note that Clang and Gcc both apparently fail to take advantage of this.

Fortunately, both trapping and wrapping options are implemented by the compiler I use most often, which is Gcc. The “-ftrapv” and “-fwrapv” command line arguments can be used to enable each respectively.

There are of course a number of other causes of undefined behaviour; integer overflow is only one. I don’t necessarily think that all of these are useful and I do think that there are plenty of specific cases where the semantics should be defined by the language, or at least classified as implementation-defined. And I’m wary of compiler vendors being too liberal in their interpretation: if the compiler behaves in ways that are counter-intuitive, especially for someone who has read the language specification themselves, there is always the risk of real software bugs resulting; if the opportunities for optimisation that such an interpretation opens up are negligible, it is hardly worthwhile to adopt it. An examination of some issues around this area may be the topic of a future post.

Addendum (24 Aug 2018)

I’ve realised that much of the above could be better written. To briefly summarise, clarify, and add some minor points:

  • I was not trying to argue that undefined behaviour on overflow is preferable to wrapping, but rather that wrapping is not much better than undefined behaviour on overflow in practice. In particular, you can get security issues from wrapping behavior in much the same way as you can with undefined behaviour – and I’d argue that many security issues resulting from unchecked integer overflow, other than those which come from the compiler removing erroneous post-overflow checks, actually come from the fact that the value has wrapped around rather than any other undefined behaviour associated with the overflow.
  • The only real benefit of wrap-on-overflow is that it doesn’t cause post-overflow checks to be removed. While that might eliminate some attack vectors, it leaves open the possibility that some overflows won’t be checked for at all (i.e. the programmer did not include an overflow check) and will be uncaught.
  • If security is not a concern but speed of execution is, undefined behaviour on overflow may allow better optimisation possibilities and provide a performance benefit, at least in some cases. On the other hand if security is a concern, wrap-on-overflow potentially leaves holes open.
  • This means that between trap-on-overflow, wrap-on-overflow, and undefined-behaviour-on-overflow, I see very few cases where wrap-on-overflow should be the preferred choice.
  • In regards to post-overflow checks, I have concerns that leaving them in place could lead to the wrong perception (that post-overflow checks work and are guaranteed to work). Trap-on-overflow avoids this problem. Good warnings help to alleviate it.
  • I think that any programmer writing security-sensitive code will ideally have a good grasp of the semantics of, and the potential pitfalls of, the language they are writing in. For C, this means understanding the overflow semantics and nuances of undefined behaviour. It’s unfortunate that some C programmers still don’t seem to have that level of understanding.
  • I have seen a claim that “most C programmers expect wrapping behaviour”, but I’m not aware of any evidence that this is true. (I’ve said “some” above since I’ve seen anecdotal evidence and I doubt this would be disputed anyway).
  • There are two separate issues: one is what the C language standard should require, and another is what the compilers should implement. I’m (somewhat) ok with the language standard specifying undefined behaviour for overflow. This post is arguing for what the behaviour of compilers should be.
  • Trap-on-overflow need not require that every operation is checked for overflow; ideally it would only mandate that the program either behaves in a mathematically consistent manner or aborts, which allows for “temporary overflow” that doesn’t generate an incorrect result. This would allow optimising “a + b – b” to “a” (which wrapping also does) and “(a * b) / b” to “a” (which wrapping doesn’t).

Addendum (21 Nov 2018)

It turns out that the GCC implementation (circa 8.2) of -ftrapv is terrible – it doesn’t detect all overflow cases, and it causes a large performance degradation due to farming out most arithmetic operations to support functions (i.e. it imposes a function call overhead for a simple addition). The Clang/LLVM implementation is much better.

 

Advertisements

Understanding the C/C++ memory model – part 2

My previous blog post, Understanding the C/C++ memory model, got more (and still receives more) attention than I ever thought it would, so I feel obliged to add a follow up to expand on the topic a little more.

First, a few things that might not have been entirely clear or were missing from the previous post:

  1. The post was focused on acquire and release memory ordering constraints for atomic operations (and memory barriers), while the language model provides various other memory ordering constraint options. While some of these should be reasonably obvious (memory_order_relaxed, which essentially implies no constraints, and memory_order_acq_rel, which combines both acquire and release semantics in a single operation), some others  warrant explanation.
  2. In case it was not clear, the default memory ordering constraint for the atomic operations provided by the standard library is actually memory_order_seq_cst. Importantly, if this default is used exclusively, some of the issues applicable to the use of acquire/release orderings are no longer of concern. However, this ordering potentially imposes a performance penalty if used when it is not actually required.
  3. In some cases acquire/release is not enough and we do need to make use of a stronger ordering constraint.

However, with all that in mind, I mostly want to focus in this post on presenting a more detailed model for understanding exactly what acquire/release entail. In particular, I aim to present a comprehensible model of processor hardware – even if the model is not accurate to any particular hardware – that allows for understanding the C/C++ memory model without having to rely solely on language-defined semantics. I’m not claiming that this model is complete nor even entirely correct, but I believe it may help to understand acquire/release semantics.

A hardware model to explain acquire/release

So, bearing in mind that this is not what real processor architecture necessarily looks like, let’s suppose that each processor (or processor core) is connected directly to a pair of caches (or buffers if you prefer) which are then connected to the main memory of the system:

acquire-release.svg

Note that I’ve opted to call them “buffers” in the diagram, since I’ve learned that describing them as caches might lead to pedants claiming that the model is wrong, although it’s worth bearing in mind that (a) many people will best understand the behaviour of the buffers as being that of a cache and that (b) this model is not supposed to accurately represent any particular hardware.

I hope that the function of the store buffer and read buffer in the diagram is obvious: writes performed by the processor core go through the store buffer before reaching main memory, and reads need to go through the read buffer if the value to be read isn’t otherwise already available in either buffer. Both buffers can hold multiple values, each value corresponding to a different address. At arbitrary times and in unspecified order, values from the read buffer can be discarded, and values from the store buffer can be flushed to main memory.

Note that the buffers allow for different threads to observe reads and writes to be observed in different orders by different threads (cores), but that a thread should always see its own reads and writes in program order – in particular, if a thread writes to an address and then reads from the same address, the read operation should see the same value as was written, unless that address may have been written to by another thread in the meantime.

What may be a little confusing is that the link between the processor core and each buffer is depicted as being two-way. In fact, this is necessary for maintaining consistency between the buffers: if a single core is to store a value to a particular address, and then read that particular address, then it shouldn’t be possible that the read loads a value from main memory before the prior store has actually reached main memory. So in fact, a store operation from the processor both updates an existing value or stores a new value in the store buffer, and updates any existing value (for the corresponding address) in the read buffer.

With all this considered, how do acquire and release operations tie in? Quite simply: an acquire operation forcefully flushes the read buffer (discarding its contents), and a release operation forcefully flushes the store buffer (writing its contents to main memory). Importantly, this means that a value written must be followed by a release in the same thread, and then an acquire – in another thread – before it is read by the 2nd thread. This is what the language model itself requires.

What about atomic operations themselves? Unfortunately they are a bit hard to explain directly in terms of this simple model. We could say that atomic operations simply bypass the buffers and operate directly on main memory; the main issue with this is that it doesn’t allow for re-ordering of atomic operations on different addresses, which the language model does allow (although the orderings must still obey the associated constraints).

What happens if two processor cores have a store in their store buffer to the same address? Well, that’s most likely a race condition; similarly if a store in one processor buffer matches a read from another.

Sequential Consistency

The default ordering constraint for atomic operations provided by the C/C++ standard library is memory_order_seq_cst, where “seq_cst” is a shortening of “sequentially consistent”. Put simply, an atomic operation with sequential consistency implies both acquire semantics (if the operation is a load) and release semantics (if the operation is a store), as well as preventing other atomic operations (of any kind) from being re-ordered with respect to the sequentially consistent operation.

Why is this ever necessary, though? Well, consider the example in the previous post of a mutual exclusion device (mutex) implemented via a spinlock, using a compare-and-exchange operation to acquire the mutex and a store to release it. Previously I suggested that acquiring the mutex could use (naturally enough) acquire semantics and releasing it could use, similarly, release semantics. In the normal case of a single mutex protecting some data structure, this is generally correct.

Sometimes, however, more complex applications can require more than a single mutex to be locked at the same time. With careless design, this can lead to deadlocks: if there are two mutexes, A and B, and one thread locks them in [A,B] order while another does so in [B,A] order, you potentially get the situation where each thread holds one mutex and requires the other (but can never obtain it). A typical approach to avoiding such problems is to impose a consistent ordering of mutex acquisition: the program could be designed, for instance, to always acquire the mutexes in the order [A,B]. Of course it is still possible to acquire either mutex alone without causing deadlock, so long as the B mutex is released before acquiring the A mutex: the important thing is that the mutexes are never acquired in the order [B,A].

In our particular example, even though atomic operations may be re-ordered, the acquisition of the B mutex will not be re-ordered with respect to the acquisition of the A mutex, since no store or load (including via an atomic operation) can move prior to an acquire operation. However, what can happen is that the release of one mutex can be re-ordered with respect to the acquisition of another (by moving it from before to behind), and this is problematic: Suppose that we have the sequence, in one thread:

  1. acquire B
  2. release B
  3. acquire A
  4. release A

This should be ok – it doesn’t acquire the mutex A while B is held – but as has just been suggested, this sequence could be re-ordered (by the compiler or processor) as follows:

  1. acquire B
  2. acquire A
  3. release B
  4. release A

If another thread is obtaining both mutexes in the proscribed order, A and then B, this will lead to deadlock! The solution is to use sequentially consistent ordering, which prevents any atomic operation being re-ordered with respect to the sequentially consistent operation. Note that it’s not necessary to make every mutex operation sequentially consistent; I’ll leave it as an exercise to decide on the best solution.

Edit 3 April: Preshing believes that the problematic re-ordering detailed above isn’t allowed by the C++ language standard (he doesn’t mention C and I’ve not checked whether it has similar wording). I’m not entirely convinced myself either that the paragraph he quotes necessarily means that this is the case, nor that it is intended to do so, though I’ve not given it full consideration yet. Another example of where sequential consistency makes a difference is given in an answer to this stackoverflow question.

Conclusion

The hardware model I’ve presented is hopefully simple enough to understand while providing comprehensible rationale for the particulars of the release/acquire semantics provided by the language model. It may however not be correct in every aspect and is not complete; use it only for what it is. I’ve also discussed sequential consistency and why it is sometimes necessary to avoid deadlocks.

I haven’t discussed the “consume” memory ordering, and may leave that for another time,  but I encourage anyone who’s interested to seek out explanations elsewhere. I hope that this post and the one preceding it are enough to give a good grounding in the issues addressed by the memory model.

Understanding the C/C++ memory model

I know that a lot of people struggle with understanding the memory model introduced in C11/C++11. In a recent conversation I was alerted to the existence of this blog post, which while correct (by my understanding) is in my opinion aimed at those who already have a good understanding of some of the underlying concepts. I’m going to try to set out my understanding of the model, which is hopefully correct, in more straightforward terms and with a more “ground up” explanation.

Edit: some redditors complained that the description of cache operation (such as it was) in the original version of this post contained technical inaccuracies, if interpreted as a description of current actual mainstream architecture. I believe those inaccuracies were not important for the understanding of the C/C++ memory model, however, they have now been rectified. Still: If you really want to understand anything about how processor architecture works, this is not the article for you. Rather, this is about a software model that is designed to abstract away much of that detail, and is deliberately light on hardware-level specifics.

Introduction: Program order versus memory order

You have your C/C++ program and it stores values (sets a variable or stores a value through a pointer) and it loads values (reads a variable or reads a value through a pointer). The order that your program specifies these loads and stores is the “program order”. However, there are a number of things that can cause the order in which the stores and loads actually occur to your system’s main memory to be different to that of the program order. It’s worth noting that the effect of these can never be visible to a single-threaded program, because the compiler must make the sure that the code behaves as if the loads/stores are in the order that the program specifies. Some examples:

  • the compiler will often cache values in registers rather than storing them to memory immediately;
  • the compiler may re-order reads and writes to memory when it can determine that such re-ordering won’t affect program behaviour (so long as the program obeys the language rules);
  • the processor may perform out-of-order execution in some cases;
  • even when the generated code stores values to memory rather than a register, the store might go into a processor cache and stay there for some time before it goes through to main memory;
  • similarly, loads may read values which are already in the cache, which means they don’t need to be read from main memory (at all);
  • finally, when a cache writes lines back out to main memory it may do so in an order that is different to the order in which the values entered the cache.

The good news is, you don’t really need to understand any of the above points: all you need to understand is that loads and stores can be re-ordered before they go to memory, but this won’t affect the actual behaviour of a single-threaded program.

Now for the important part: in a multi-threaded program, some of these re-orderings – together with some additional issues associated with the methods used by the hardware for maintaining cache coherency – also can cause different threads to see loads and stores made by other threads as occurring in a different order. It’s probably worth noting that a program in which this actually happened could well be exhibiting undefined behaviour – what should happen is that the program uses the memory consistency primitives to make sure that it sees a “correct” ordering. We’re coming to that now.

Atomic operations

If we forget about standard synchronisation tools such as mutexes, then there are two essential mechanisms to making ordering consistent between threads: atomic operations, and memory barriers.

Atomic operations are operations which complete indivisibly. One classic example is the “atomic increment”. A “classic” (non-atomic) increment can be broken into three separate parts: it loads a value from memory, it increments the value, and then it stores the value back to the same location in memory. In multi-threaded programs, the problem with this that the value might be changed by another thread in between the load and store, and that results in storing the wrong value; the intermediate store value is lost. In particular consider two threads that are both trying to increment the value starting at 0:

  1. Thread A loads the value 0
  2. Thread B loads the value 0
  3. Thread A increments the value, calculating a result of 1
  4. Thread B also calculates a result of 1
  5. Thread A stores the value 1
  6. Thread B stores the value 1

The problem is that the value has gone from 0 to 1, despite there being two increments – one of the increments got lost. (Again, if this actually happened in a C/C++ program, that program would be exhibiting undefined behaviour). An atomic increment happens as one step, atomically (indivisibly). So, to avoid the problem above, you could use an atomic integer type and an atomic increment operation.

Another classic atomic operation is “compare and exchange”. This can be used to implement a spin lock (a kind of mutex) – you treat a value of 0 as “unlocked” and a value of 1 as locked. Then, to lock the mutex, a thread can atomically compare the current value with 0 and, if it is equal to 0, set it to 1. The operation returns the original value – if 0, the lock was successfully, if 1 then another thread has the lock and so it must be retried until the other thread releases the lock (typically this would be done in a loop, “spinning” on the lock until it becomes available).

Before we go any further, let’s note the following important points about atomic operations:

  • Atomic operations on a particular object (variable) always have a total ordering which is consistent with program ordering. This means that every atomic operation from every thread which operates on the same atomic variable happens either before or after other every other operation, and if a thread does two operations one after the other then they will appear in the same sequence in the total ordering.
  • If an atomic operation is performed in one thread and another thread then performs another atomic operation on the same variable, the second thread will definitely see the effect of the operation performed by the first (assuming there are no other threads stepping in to perform other another operation in the meantime).

What atomic operations do not necessarily do is enforce ordering on any loads and stores that occur around them (I say necessarily because, as you may be aware, you can specifying memory ordering constraints when you perform an atomic operation*. We’re getting to that).

Consider the case of the spin lock that we discussed above: the lock is probably designed to protect some data structure. So you might have a sequence that looks like this:

  1. Thread A acquires lock
  2. Thread A stores values to data structure
  3. Thread A releases lock
  4. Thread B acquires lock
  5. Thread B reads values from data structure
  6. Thread C releases lock

The problem is that the atomic operations used to implement the lock by themselves don’t impose ordering on steps 2 and 5, so thread B might not see the values written by thread A, or even worse, might see some of the values but not all of them (this is a data race, and is not actually allowed to happen in a correct program; a correct program would use the appropriate memory ordering controls to prevent it from happening).

To prevent re-ordering of non-atomic loads and stores, we use barriers (aka fences).

Barriers

A barrier prevents loads and stores from being re-ordered past the barrier, in one or in both directions. Specifically: an acquire barrier operation prevents loads/stores which are after the barrier from being re-ordered to appear before the barrier, and a release barrier prevents loads/stores which are before the barrier from being moved after the barrier (there are other types of barrier operation, but I won’t cover them here right now).

Through the use of appropriate barriers, we can almost fix our spinlock example:

  1. Thread A acquires lock
  2. Thread A issues an acquire barrier
  3. Thread A stores values to the data structure
  4. Thread A issues a release barrier
  5. Thread A then releases the lock
  6. Thread B acquires the lock
  7. Thread B issues an acquire barrier
  8. Thread B reads values from the data structure
  9. Thread B issues a release barrier
  10. Thread B then releases the lock

There is still one minor issue, which is that the operations which actually acquire and release the lock could be moved past their corresponding barriers (e.g. step 5 could be re-ordered in front of step 4). There are two ways that C++11 resolves this:

First, you can combine atomic operations which acquire or release some lock with a barrier, so that the operation and the barrier are essentially indivisible (this is performed by passing the appropriate consistency constraint constant as a parameter to the atomic operation);

Second, a standalone barrier (std::atomic_thread_fence) actually acts as a stronger barrier than what is stated above. Specifically, a standalone release barrier actually also prevents subsequent stores – including atomic stores – from being moved in front of the barrier, and a standalone acquire barrier also prevents earlier loads from being moved behind the barrier.

Conclusion and key take-aways

Some key points that I wanted to cover which some more advanced tutorials tended to gloss over:

  • Atomic operations and memory ordering constraints are really two separate things;
  • Memory barriers are useless without some kind of synchronisation (which atomic operations provide). Memory barriers do not themselves provide any synchronisation. Atomic operations themselves do not provide a memory barrier;
  • Atomic operations can have a memory ordering constraint specified as a parameter. This effectively creates a barrier of the specified type, and forces the atomic operation to stick on the appropriate side of that barrier.

There are of course a lot of things I didn’t cover here, but there’s plenty of other material out there which covers the concepts in more detail. I hope the explanations here, which I’ve tried to reduce to the bare essentials, are helpful to some.

* Edit: it should be noted that the default ordering constraint on atomic operations is actually memory_order_seq_cst which implies both acquire and release semantics (depending on the operation) and provides additional guarantees beyond. So, while an atomic operation doesn’t itself imply a memory barrier, an atomic operation with the default memory order constraint does so. It may be preferable to use acquire/release semantics explicitly however since the default constraint can affect performance.

POSIX timer APIs are borked

I’m currently working on Dasynq, an event loop library in C++ (which is not yet in a state of being ready for use by external projects, though the functionality it currently exposes does work correctly as far as I know). It has come to the point where I want to add timer functionality, and this has been frustratingly tricky, mostly due to horribly designed APIs.

There are a few basic requirements to set out before I start:

  • There are essentially two types of timer – relative and absolute. I either want the timer to expire some given interval from now, or I want it to expire at some specific (“wall clock”) time. In the latter case, if the system time is changed the timeout should be suitably adjusted. (Example: if I set an alarm for 04:00, and the system change is changed by the user from 03:25 to 04:15, the alarm should expire immediately).
  • I want to be able to be sure that I can use a timer, at some point in the future. That is, I need to be able to allocate timers in advance (without necessarily arming them immediately) or at least to be able to re-set an existing timer to a specified timeout. I should be able to avoid the situation where I need a timer, which I knew I would need in advance, but am unable to create one due to resource limits / exhaustion.
  • I need a reasonable level of resolution. Timers should be usable for everything from running weekly tasks to animation timing.

With the above points in mind, let’s take a look at what POSIX provides.

POSIX timer APIs

First, the most basic timer-like call provided by POSIX is the alarm(…) function. It has second granularity, which rules it out immediately.

Then, there’s setitimer(…). This isn’t a particularly nice interface and delivers timer expiry via a signal. There is only one timer (well, there is one timer for each of several of different kinds of clock), which means that to allow for multiple timers to be managed we need to essentially multiplex the single timer; by itself this isn’t such a huge problem, but other limitations of the API make it fundamentally difficult, and the API is pretty broken to begin with in several ways.

The first problem with setitimer is that the interval timers deliver timeout events via signals, which is awkward, especially since setitimer itself is not async-signal-safe, meaning you can’t call it from within the signal handler to set the next desired timeout; if you want to multiplex multiple timeouts over the interval timer interface, you’re forced to turn asynchronous event notifications into synchronous events (which of course is what libraries like Dasynq are all about, so by itself this isn’t a huge problem).

The next problem with setitimer is that it only allows setting a relative timeout. If I want to get a timer notification at an absolute time, I need to get the current clock time (clock_gettime), calculate the time remaining until that time, and then set the timer. Not allowing an absolute timeout means that setting an alarm for a wall-clock time is pretty much impossible – since if the system time is changed by the user, the timer’s timeout interval won’t be adjusted. However, there’s a more subtle issue here: time might elapse between the calculation and setting the timer – the process could be preempted just after calculating the interval, and in unusual cases might not be scheduled again for a significant period of time. By the time it finally arms the timer, the interval is significantly incorrect. The only work to work around this (that I can think of, other than pretending that the problem doesn’t exist) is to check the clock time immediately after setting the interval, to make sure that it’s within a certain window of tolerance of the original measurement – and if not, to re-calculate the interval and reset the timer.

The safety check just described requires a minimum of two clock_gettime calls (which generally means two calls into the kernel) just for setting a timer. If multiple timers are being managed over the top of a single interval timer, that’s going to mean that two clock_gettime calls are required on each timer expiry.

Generalised POSIX timer interface

The real-time POSIX extensions also define timer_create, which appears to solve some of the problems above:

  • It allows the creation of multiple independent timers
  • It allows for specifying absolute (as well as relative) timeouts, and when using the realtime clock the timeout interval will be adjusted appropriately (for absolute timeouts) if the system time is altered

However, notification is still either via a signal, or via a thread (SIGEV_THREAD); the latter is problematic for implementations, because there is usually no way to detect notification failure if a thread cannot be created due to resource limits, and because it requires userspace support; on Linux you need to link with -lrt (and thus also pull in the pthreads library) to use timer_create etc, even if you don’t use SIGEV_THREAD. On OpenBSD the situation is worse – the realtime extensions are generally not supported, and create_timer et al are not available at all.

Even using timers with SIGEV_SIGNAL notifications is less than ideal. Using such timers in different threads requires cooperation between all threads, to either choose different signal numbers for notification or otherwise to have a common signal handler somehow suitably dispatch notifications to the correct thread.

Non-POSIX solutions

On Linux, timerfds (timerfd_create et al) provide an apparently sane solution to the whole messy problem – it supports multiple timers, supports absolute and relative timeouts, and delivers events by file handle notifications (can be used with select/poll/epoll etc). A large number of timers could be feasibly multiplexed over a single timerfd, which is good from a resource management perspective.

On OpenBSD (and various other BSDs) there is the option of using kqueue timers. This supports multiple timers, and neatly solves the problem of notification; however it only allows relative timeouts, and the timeout/interval cannot be changed, meaning that multiple timers cannot be multiplexed over a single kqueue timer. Even worse, it is not possible to pre-allocate timers; once created, they begin countdown immediately. This makes it impossible to discover resource allocation failure until the point that the timer is actually needed.

In Conclusion

The POSIX timer APIs are awkward and clunky. The setitimer functions support only limited use cases.  On the other hand, the generalised interface would be difficult to use in a library (since it either requires signal handling or multi-threading). On Linux, the timerfd interface is an ideal substitute. On other systems the general timer interface can be used, with some caveats and trade-offs, but it is not always available; on systems where it is not, and there is no system-specific replacement, the only option for wall-clock timers is to assume that the system clock does not change (other than by the usual tick) while the system is running.

5 Gripes with C++

First of all, let me say that I actually like C++ as a programming language. This makes me a rarity among my associates, but in terms of a systems programming language it is, in my opinion, currently strides ahead of any existing alternative (especially C). But that’s enough of that; this post isn’t about how great C++ is; this post is in fact about a few of the things that I don’t like about C++. Here are they are in order of most to less annoying:

1. The stupid “empty base class optimization”

This is that thing where, if you have some class A that is empty (contains no data members and no virtual functions, thus not requiring a vtable and not, theoretically, requiring any space at all), then you discover that it’s not really empty because if you include it as a member in some other class, it will take up space.

class A { };
class B { };
class C { A a; B b; };

Now, sizeof(A)? Yeah, not 0. It will come out as 1. Same with sizeof(B) (which should not be surprising). And sizeof(C) is 2, which again is not surprising. How about if we change the definition of C though:

class C : A { B b; };

Now we get sizeof(C) = 1. You see, it turns out that objects of the same type are required to have distinct addresses – specifically: “Two distinct objects that are neither bit-fields nor base class subobjects of zero size shall have distinct addresses”, and sizeof() any class type must be at least 1, but because A and B are different types, and a special exception in the C++ language spec (C++11 1.8 para 5) that “Base class subobjects may have zero size”, it is now possible to locate the A (base class) subobject and the B (member) object at the same location, and the overall size of the derived class is reduced.

In a language with such fantastic meta-programming capabilities, where empty classes often serve as a way of containing a set of type traits for use in a template, this is significant. (A cheap example: C++ container types are templates with an element type and an allocator type; the container contains a member that is of the allocator type. Often, the allocator is an empty object, since it has no state; for example it just allocates memory using malloc()/free()).

So, ok, there is a trick to optimize size of objects by using inheritance, as shown above. In standard library implementations, this trick tends to be used heavily, because it can have a significant impact; implementations of std::pair and std::tuple, for instance, will generally use it to collapse empty members to zero size. That seems like a good idea, so why am I calling it stupid?

Because it shouldn’t be necessary.

The problem is that applying it disfigures the structure of your types. You end up inheriting from some type just because you want to make use of the empty base-class optimization, and your code becomes a right mess where accessing what should have been a member is now done instead by casting the “this” pointer… to make matters worse, you have to be careful when you use it that the potentially “empty” class really is empty, since if it has virtual methods you run the risk of accidentally overriding them.

There might be some good reasons to ensure that objects of the same type are always allocated at different addresses, but those reasons often don’t apply to the sort of classes that tend to be empty. It would be so easy, so very easy, to have some attribute (either on the type, or on members, or even both) saying that “this (object/type) does not need a unique address”, but for years we’ve instead had to perform acrobatics with our code to make use of what should be a simple and straightforward optimization.

Edit 2018.06.04: Woot! [[no_unique_address]] is in C++20.

2. Broken encapsulation model

So “private inheritance is for is-implemented-in-terms-of” and “public inheritance is for is-a relationships” are claims you may have heard at some point or other. I have no beef with how public inheritance works, but private inheritance is another kettle of fish.

Essentially private inheritance of some class X says, “I will be implemented via X. I will not be seen as an X to outside observers, however, I may pass myself of as an X when I deem it necessary to do so”. This is I suppose good for things like listener interfaces, where you want to receive events from another source (and so you need to inherit the event-listener base class) but you don’t want to expose the listener methods elsewhere. You still need to override some of the base class methods (otherwise, you could’ve used composition instead of inheritance: that is, have a member of type X, rather than privately inheriting from X).

Right, so what’s the problem? The problem is that it is still possible to override virtual private methods, including methods which are private by virtue of private inheritance by a class further up the hierarchy. If you have a class A, and a class B that privately inherits A, and then a class C that inherits B (publicly or privately), C shouldn’t know or care about B’s relationship to A, right? But it so happens that if you accidentally name a method (with an appropriate signature) the same as a method from A, you will now override that method and suitably screw up everything. That’s the problem: private inheritance is not private enough. Although, to be honest, I could envisage other changes to the language that could do away with the need for private inheritance altogether, which brings me to my next point.

3. Container object from member subobject is non-standard

Suppose I have an object of type A with a member, b, of type B. Further suppose that I have a pointer to the member b; maybe even it is a “this” pointer, because I am implementing a method in the B class. Now, if I know my B object is a singular member of an A container object, I should be able to convert a pointer-to-B to a pointer-to-A which points at the container object easily enough, right? Something like:

char * c = reinterpret_cast<char *> b_ptr;
A * a_ptr = reinterpret_cast<A *>(c - offsetof(A,b));

Easy, right? Now… hmm… I know C++’s private inheritance actually breaks encapsulation principles (see above), but could I use this little trick to overcome that problem? Let’s say I want A to “privately inherit” from some class C. Instead of using actual private inheritance in A, I use inheritance (public or private, doesn’t matter) in my member class B, and I make “B b;” a private member of A. This truly hides the relationship between A and C, since there’s no way I could subclass A and accidentally override one of C’s methods. If the overridden method (which is now in B) needs to access any of A’s data or methods, that’s fine, I can use the method above to do so; it’s a little ugly, but it works… right?

Well, yeah, it does work; it’s just that it’s not standard. “offsetof” is only required to work for plain-old-data types (which among other restrictions don’t contain any virtual methods, or any members that do). This amazingly-useful-in-the-real-world technique isn’t actually required to work by the language (in fact it explicitly classifies it as “undefined”).

The standards-compliant alternative of having an explicit pointer member in the sub-object which points to the containing object works but has a runtime cost. So, you’re faced with a choice: leaky encapsulation via private inheritance, or runtime penalty due to unnecessary extra pointer storage.

What I’d really like to see is a straightforward syntax which directly supported this technique, instead of having to jump through reinterpret_cast/offsetof-hoops to use it (only to be then warned by the compiler that your code is non-compliant). It would be easy enough to do this in such a way that it delivered the expected performance gain in real-world compilers while still behaving correctly in theoretical compilers which store objects via hashtables or something equally daft.

4. No proper mixins

What C++ programmers call “the mixin pattern” is inheritance-of-template-parameter, a technique that is occasionally useful to augment a class via another “mixin” class (usually designed for the purpose). So for example if I have:

class A<T> : public T { /* ... */ };

… then I can “mix in” any class that I like, causing the resulting template instantiation to include its methods. The main problem with this approach is that the mixed-in class is unable to call any methods from the class it is mixed into; it is, after not, not a true mix-in – it’s just plain old public inheritance, and that’s a one way street. The most direct way to work around this is to declare virtual methods in the mixin class which will then be overridden in the target class, but this has a performance overhead and also has the unfortunate effect, potentially, of allowing these methods to be accidentally overridden in subclasses of A<T>.

So, it would be sorta nice if there were real mixins – where I could just declare mixin classes specially, and then pull them into another class via some declaration (or even just overload the inheritance syntax). Obviously this would probably require the whole source of the mixin to be included in a header, but that’s already the case with templates anyway. The mixin classes would somehow need to declare members that they expect the mixed-to (or other mixed-in) class to provide.

5. There should be more flexibility in dealing with inherited members

We’re now scraping the bottom of the barrel a little, as the four points above are the main gripes I have with C++; but 5 is a nice round number.

Basically my complaint here is that names are fixed in the base class and can’t be changed in the derived class. If I have a class A with virtual method m and I publicly derive from A in another class B, then in B the method is also called m, and if I want to override it I have to use the same name, m, throughout the entire class hierarchy from that point. If I’m desperate enough I could implement a new method f which just delegated to m, and I could even make m final at the same time so that everyone’s forced to override f instead from that point, but of course there’s a runtime overhead.

Why can’t I just rename methods? Why can’t I say, “from this point in the hierarchy on, method m will now be called f”? (Or more accurately: method f overrides method m).

It seems like a small thing, but occasionally I’ve wanted something like this. There are other related issues: I can shadow a base class non-virtual method, why can’t I shadow a final method? How am I supposed to deal with multiple base classes declaring same-name same-signature methods that I need to override separately in a derived class (especially considering I need all the help I can get if I’m forced to use multiple inheritance, right?) Why can’t I remove a base-class method from visibility (causing it to be shadowed rather than overridden in further derived classes)? And of course, why can a class override a base class private method at all? (eh-hmm broken encapsulation model).

Conclusion

That about rounds it out. 5 things about C++ that I would like to see improved. Just throwing it out there… who knows, maybe someone on the committee will pay attention… pretty please?

OpenGL spec for glDrawRangeElementsBaseVertex is rubbish

The title says it all: the spec for the glDrawRangeElementsBaseVertex function is rubbish.

glDrawRangeElementsBaseVertex is a restricted form of glDrawElementsBaseVertex.

Ok, but:

mode, start, end, count and basevertex match the corresponding arguments to glDrawElementsBaseVertex, with the additional constraint that all values in the array indices must lie between start and end, inclusive, prior to adding basevertex.

glDrawElementsBaseVertex doesn’t have a start or end argument. Perhaps the above should say “mode, count, type, indices and basevertex”, since type and indices seem to have the same meaning for both functions?

Index values lying outside the range [start, end] are treated in the same way as glDrawElementsBaseVertex

But… you just said that all the index values must be inside that range. Perhaps substitute “outside” with “inside” to make this sentence make sense?

Does no-one proof-read this stuff? bug submitted.

Update: so it turns out that ‘in the same way as glDrawElementsBaseVertex’ is supposed to mean ‘in an implementation-defined manner consistent with how similarly out-of-range indices are treated by glDrawElementsBaseVertex’. I feel like the wording could be much clearer but I’m not going to argue this one. The parameter specifications are clearly incorrect and this should be fixed in a revision.

Why is there no decent, simple, structured text format?

So I want a structured text format usable for configuration files and data interchange. My key requirements can be boiled down to:

  • Syntax allows concise expression (this alone rules out XML)
  • Simple to parse (this also rules out XML)
  • Suitable for human “consumption” (reading, editing). To some degree, this rules out XML.

As you can see, XML is definitely out. But oddly enough, I’m struggling to find anything I’m really happy with.

Oh yeah, I know, I know, JSON, right? I have these main problems with JSON:

1. Excessive quotation marks required

So for a simple set of key-values I need something like:

{
    "key1" : "overquoted",
    "key2" : "the perils of javascript"
}

I mean this isn’t crazy bad, but why are the quotes even necessary? I mean wouldn’t it be nice if I could instead write:

{
    key1 : overquoted,
    key2 : "the perils of javascript"
}

Given that, at the point key1 and key2 appear, an alphabetical character may not otherwise legitimately be present, what would be the harm in allowing this? (sure, if you want spaces or punctuation in your key, then you should need to quote it, but not otherwise). Similarly for values. It would be nice if those unnecessary quotes weren’t actually required.

2. No comments

This one really irks me. For a configuration file, comments are pretty much mandatory. Douglas Crockford gives an explanation for why there are no comments in JSON (why he removed comments from the spec, in fact), and it sucks. Basically: people weren’t using comments the way I wanted them to, so I removed comments. Yeah, I think we just hit ludicrous speed. There are so many things wrong with this argument I barely know where to begin. At the very outset, anyone using comments as parsing directives was going to need a custom parser anyway – what they were dealing with wasn’t plain JSON. So actually changing JSON does not affect those people; they will continue to use their custom parsers. In fact all you do by removing comments is make the standard less useful generally. The follow up:

Suppose you are using JSON to keep configuration files, which you would like to annotate. Go ahead and insert all the comments you like. Then pipe it through JSMin before handing it to your JSON parser.

… is equally ridiculous. I mean sure I could strip comments before handing off the the parser, but then my original data isn’t actually JSON, is it? And so interoperability is destroyed anyway, because I can no longer use any standard-JSON tools on my configuration files.

3. Unclear semantics

The current RFC for JSON has, in my opinion, a lot of guff about implementations that just doesn’t belong in a specification. Problematically, this discussion could be seen to legitimise limitations of implementations. Take for example:

An object whose names are all unique is interoperable in the sense that all software implementations receiving that object will agree on the name-value mappings. When the names within an object are not unique, the behavior of software that receives such an object is unpredictable.

What does that mean, exactly? That it’s allowed that objects with non-unique names cause software to behave “unpredictably”? This is meant to be a specification imposing requirements on implementations, but it’s hard to glean from this text precisely what the requirements are, in particular because (just above that):

The names within an object SHOULD be unique.

That’s SHOULD, not MUST (see RFC 2119); which implies that non-unique names are in fact permissible and must be handled by an implementation. I wonder how many JSON libraries will properly represent such an object… not many, I’d guess.

So How About YAML?

YAML does solve the problems that I identified with JSON above. It doesn’t require superfluous quotation marks, it’s clear about map semantics, and it allows comments. On the other hand, its specification is quite large. Part of the complexity comes from the concept of tags which are a way of identifying types. While the YAML core specification (“failsafe schema”) deals only with maps, sequences and strings, it allows for explicitly tagging any value as a particular type identified by a “tag”. The schema notionally defines how tags are mapped to actual types and allows for specifying rules for determining the type of otherwise untagged ‘plain scalar’ (roughly: unquoted string) values. So for instance the JSON schema – which makes YAML a superset of JSON – maps sequences of digits to an integer type rather than the string type. The fact that different schemas yield different semantics, and that arbitrary types (which a given implementation may not know how to handle) can be assigned to values, in my opinion reduces YAML’s value as an interchange format.

(For instance, a tool which merges two YAML maps needs to know whether 123 and “123” are the same or not. If using the failsafe schema, they are strings and are the same; if using the JSON schema, one is a number and they are not the same).

In fact, the whole notion of schemas leads to the question of whether it is really up to the text format to decide what type plain nodes really are. Certainly, maps and sequences have a distinct type and are usually unambiguous – even YAML doesn’t allow a schema to re-define those – and are enough to represent any data structure (in fact, just sequences would be enough for this). I also think it’s worth while having a standard quoting mechanism for strings, and this is necessary to be able to disambiguate scalar values from structures in some cases. But beyond that, to me it seems best just to let the application determine how to interpret each scalar string (and it can potentially use regular expressions for this, as YAML schemas do), but that for purposes of document structure scalars are always just strings. This is essentially what the YAML failsafe scheme does (and it even allows disambiguating quoted strings from unquoted strings, since the latter will be tagged with the ‘?’ unknown type).

It’s worth noting that YAML can handle recursive structures – sequences or maps that contain themselves as members either directly or indirectly. This isn’t useful for my needs but it could be for some applications. On the other hand, I imagine that it greatly complicates implementation of parsers, and could be used for attacks on poorly coded applications (it could be used to create unbounded recursion leading to stack overflow).

Or TOML?

TOML is a relative newcomer on the scene of simple structured text formats. It has a fixed set of supported types rather than allowing schemas as YAML does, and generally aims to be a simpler format; on the other hand it is much closer to YAML in syntax than JSON and so is much easier to read and edit by hand.

Among the supported types are the standard map / sequence / string, but also integer, float, boolean and date-time. This seems fine, but again I’m uncertain that having more just than the basic “string” scalar type is really necessary. On the other hand having these types properly standardised is unlikely to cause any harm.

I think the one downside to TOML is the ungainly syntax for sequences of maps – it requires double-square brackets with the name of the sequence repeated for each element:

[[sequencename]]
key1 = value1
key2 = value2
[[sequencename]]
key1 = value1
key2 = value2

Nested maps are also a bit verbose, requiring the parent map name to be given as a prefix to the child map name:

[parent]
[parent.child]
key1 = value1  # this key and all following keys are in the child map

The top level node of a TOML structure, if I understand correctly, must always be a map, since you specify key-value pairs. This is probably not a huge concern for my purposes but is certainly a limitation of the format. Once you’ve opened a map (“table” in TOML parlance) there’s also no way to close it, it seems, other than by opening another table.

I think the occasional ugliness of the syntax, together with the immaturity of the format, are deal breakers.

And so the winner…

… Is probably YAML, at this stage, with the failsafe schema, although the potential for recursive structures makes me a little uneasy and it’d be nicer if I didn’t have to explicitly choose a schema. It’s also a shame that the spec is so wordy and complex, but the syntax itself is nice enough I think and seems like a better fit than either JSON or TOML.