The world is on fire, but not because of C23

New ad-free location: https://davmac.org/blog/the-world-is-on-fire-but-not-c23.html

An article was published on acmqueue last month entitled “Catch-23: The New C Standard Sets the World on Fire”. It’s a bad article; I’ll go into that shortly; it comes on the back of similar bad arguments that I’ve personally either stumbled across, or been involved in, lately, so I’d like to respond to it.

So what’s bad about the article? Let me quote:

Progress means draining swamps and fencing off tar pits, but C23 actually expands one of C’s most notorious traps for the unwary. All C standards from C89 onward have permitted compilers to delete code paths containing undefined operations—which compilers merrily do, much to the surprise and outrage of coders.16 C23 introduces a new mechanism for astonishing elision: By marking a code path with the new unreachable annotation,12 the programmer assures the compiler that control will never reach it and thereby explicitly invites the compiler to elide the marked path.

The complaint here seems to be that the addition of the “unreachable” annotation introduces another way that undefined behaviour can be introduced into a C program: if the annotation is incorrect and the program point is actually reachable, the behaviour is undefined.

The premise seems to be that, since undefined behaviour is bad, any new way of invoking undefined behaviour is also bad. While it’s true that undefined behaviour can cause huge problems, and that this has been a sore point with C for some time, the complaint here is fallacious: the whole point of this particular annotation is to allow the programmer to specify their intent that a certain path be unreachable, as opposed to the more problematic scenario where a compiler determines code is unreachable because a necessarily-preceding (“dominating”) instruction would necessarily have undefined behaviour. In other words it better allows the compiler to distinguish between cases where the programmer has made particular assumptions versus when they have unwittingly invoked undefined behaviour in certain conditions. This is spelled out in the proposal for the feature:

It seems, that in some cases there is an implicit assumption that code that makes a po-
tentially undefined access, for example, does so willingly; the fact that such an access is
unprotected is interpreted as an assertion that the code will never be used in a way that
makes that undefined access. Where such an assumption may be correct for highly spe-
cialized code written by top tier programmers that know their undefined behavior, we are
convinced that the large majority of such cases are just plain bugs

That is to say, while it certainly does add “another mechanism to introduce undefined behaviour”, part of the purpose of the feature is actually to limit the damage and/or improve the diagnosis of code which unintentionally invokes undefined behaviour. It’s concerning that the authors of the article have apparently not read the requisite background information, or have failed to understand it, but have gone ahead with their criticism anyway.

In fact, they continue:

C23 furthermore gives the compiler license to use an unreachable annotation on one code path to justify removing, without notice or warning, an entirely different code path that is not marked unreachable: see the discussion of puts() in Example 1 on page 316 of N3054.9

When you read the proffered example, you quickly notice that this description is wrong: the “entirely different code path” is in fact dominated by the code path that is marked unreachable and thus is, in fact, the same code path. I.e. it is not possible to reach one part without reaching the other, so they are on the same path; if either are unreachable, then clearly the other must also be; if one contains undefined behaviour then the entire code path does and (just as before C23) the compiler might choose to eliminate it. Bizarrely, the authors are complaining about a possible effect of undefined behaviour as if it was a new thing resulting from this new feature.

The criticism seems to stem at least partly from the notion that undefined behaviour is universally bad. I can sympathise with this to some degree, since the possible effects of UB are notorious, but at the same time it should be obvious that railing against all new undefined behaviour in C is unproductive. There will always be UB in C. The nature of the language, and what it is used for, all but guarantee this. However, it’s important to recognise that the UB is there for a reason, and also that not all UB is of the “nasal daemons” variety. While those who really understand UB can often be heard to decry “don’t do that, it’s UB, it might kill your dog” there is also an important counterpoint: most UB will not kill your dog.

“Undefined behaviour” does not, in fact, mean that a compiler must cause your code to behave differently than how you wanted it to. In fact it’s perfectly fine for a compiler (“an implementation”) to offer behavioural guarantees far beyond what is required by the language standard, or to at least provide predictable behaviour in cases that are technically UB. This is an important point in the context of another complaint in the article:

Imagine, then, my dismay when I learned that C23 declares realloc(ptr,0) to be undefined behavior, thereby pulling the rug out from under a widespread and exemplary pattern deliberately condoned by C89 through C11. So much for stare decisis. Compile idiomatic realloc code as C23 and the compiler might maul the source in most astonishing ways and your machine could ignite at runtime.16

To be clear, “realloc(ptr,0)” was previously allowed to return either NULL, or another pointer value (which is not allowed to be dereferenced). Different implementations can (and do) differ in their choice. While I somewhat agree that making this undefined behaviour instead is of no value, I’m also confident that it will have next to zero effect in practice. Standard library implementations aren’t going to change their current behaviour because they won’t want to break existing programs, and compilers won’t treat realloc with size 0 specially for the same reason (beyond perhaps offering a warning when such cases are detected statically). Also, calling code which relies on realloc returning a null pointer when given a 0 size “idiomatic” is a stretch, “exemplary” is an even further stretch, and “condoned by C89 through C11” is just plain wrong; the standard rationale suggests a practice for implementations, not for applications.

Later, the authors reveal serious misunderstandings about what behaviour is and is not undefined:

Why are such requests made? Often because of arithmetic bugs. And what is a non-null pointer from malloc(0) good for? Absolutely nothing, except shooting yourself in the foot.

It is illegal to dereference such a pointer or even compare it to any other non-null pointer (recall that pointer comparisons are combustible if they involve different objects).

This isn’t true. It’s perfectly valid to compare pointers that point to different objects. Perhaps the author is confusing this with the use of a pointer to an object whose lifetime has ended, or calculating the difference between pointers; it’s really not clear. Note that the requirement (before C23) for a 0-size allocation was that “either a null pointer is returned, or the behavior is as if the size were some nonzero value, except that the returned pointer shall not be used to access an object”; there’s no reason why such a pointer couldn’t be compared to another in either case.

It’s a shame that such hyperbolic nonsense gets published, even in on-line form.

Is C++ type-safe? (There’s two right answers)

I recently allowed myself to be embroiled in an online discussion regarding Rust and C++. It started with a comment (from someone else) complaining how Rust advocates have a tendency to hijack C++ discussions and suggesting that C++ was type-safe, which was responded to by a Rust advocate first saying that C++ wasn’t type-safe (because casts, and unchecked bounds accesses, and unchecked lifetime), and then going on to make an extreme claim about C++’s type system which I won’t repeat here because I don’t want to re-hash that particular argument. Anyway, I weighed in trying to make the point that it was a ridiculous claim, but also made the (usual) mistake of also picking at other parts of the comment, in this case regarding the type-safety assertion, which is thorny because I don’t know if many people really understand properly what “type-safety” is (I think I somewhat messed it up myself in that particular conversation).

So what exactly is “type-safety”? Part of the problem is that it is an overloaded term. The Rust advocate picked some parts of the definition from the wikipedia article and tried to use these to show that C++ is “not type-safe”, but they skipped the fundamental introductory paragraph, which I’ll reproduce here:

In computer science, type safety is the extent to which a programming language discourages or prevents type errors

https://en.wikipedia.org/wiki/Type_safety

I want to come back to that, but for now, also note that it offers this, on what constitutes a type error:

A type error is erroneous or undesirable program behaviour caused by a discrepancy between differing data types for the program’s constants, variables, and methods (functions), e.g., treating an integer (int) as a floating-point number (float).

… which is not hugely helpful because it doesn’t really say it means to “treat” a value of one type as another type. It could mean that we supply a value (via an expression) that has a type not matching that required by an operation which is applied to it, though in that case it’s not a great example, since treating an integer as a floating point is, in many languages, perfectly possible and unlikely to result in undesirable program behaviour; it could perhaps also be referring to type-punning, the process of re-interpreting a bit pattern which represents a value on one type as representing a value in another type. Again, I want to come back to this, but there’s one more thing that ought to be explored, and that’s the sentence at the end of the paragraph:

The formal type-theoretic definition of type safety is considerably stronger than what is understood by most programmers.

I found quite a good discussion of type-theoretic type safety in this post by Thiago Silva. They discuss two definitions, but the first (from Luca Cardelli) at least boils down to “if undefined behaviour is invoked, a program is not type-safe”. Now, we could extend that to a language, in terms of whether the language allows a non-type-safe program to be executed, and that would make C++ non-type-safe. However, also note that this form of type-safety is a binary: a language either is or is not type-safe. Also note that the definition here allows a type-safe program to raise type errors, in contrast to the introductory statement from wikipedia, and Silva implies that a type error occurs when an operation is attempted on a type to which it doesn’t apply, that is, it is not about type-punning:

In the “untyped languages” group, he notes we can see them equivalently as “unityped” and, since the “universal type” type checks on all operations, these languages are also well-typed. In other words, in theory, there are no forbidden errors (i.e. type errors) on programs written in these languages

Thiago Silva

I.e. with dynamic typing “everything is the same type”, and any operation can be applied to any value (though doing so might provoke an error, depending on what the value represents), so there’s no possibility of type error, because a type error occurs when you apply an operation to a type for which it is not allowed.

The second definition discussed by Silva (i.e. that of Benjamin C. Pierce) is a bit different, but can probably be fundamentally equated with the first (consider “stuck” as meaning “has undefined behaviour” when you read Silva’s post).

This notion of type error as an operation illegal on certain argument type(s) is also supported by a quote from the original wiki page:

A language is type-safe if the only operations that can be performed on data in the language are those sanctioned by the type of the data.

Vijay Saraswat

So where are we? In formal type-theoretic language, we would say that:

  • type safety is (confusingly!) concerned with whether a program has errors which result in arbitrary (undefined) behaviour, and not so much about type errors
  • in fact, type errors may be raised during execution of a type-safe program.
  • C++ is not type-safe, because it has undefined behaviour

Further, we have a generally-accepted notion of type error:

  • a type error is when an attempt is made to apply an operation to a type of argument to which it does not apply

(which, ok, makes the initial example of a type error on the wikipedia page fantastically bad, but is not inconsistent with the page generally).

Now, let me quote the introductory sentence again, with my own emphasis this time:

In computer science, type safety is the extent to which a programming language discourages or prevents type errors

This seems to be more of a “layman’s definition” of type safety, and together with the notion of type error as outlined above, certainly explains why the top-voted stackoverflow answer for “what is type-safe?” says:

Type safety means that the compiler will validate types while compiling, and throw an error if you try to assign the wrong type to a variable

That is, static type-checking certainly is designed to prevent operations that are illegal according to argument type from being executed, and thus have a degree of type-safety.

So, we have a formal definition of type-safety, which in fact has very little to do with types within a program and more to do with (the possibility of) undefined behaviour; and we have a layman’s definition, which says that type-safety is about avoiding type errors.

The formal definition explains why you can easily find references asserting that C++ is not type-safe (but that Java, for example, is). The informal definition, on the other hand, clearly allows us to say that C++ has reasonably good type-safety.

Clearly, it’s a bit of a mess.

How to resolve this? I guess I’d argue that “memory-safe” is a better understood term than the formal “type-safe”, and since in many cases lack of the latter results from lack of the former we should just use it as the better of the two (or otherwise make specific reference to “undefined behaviour”, which is probably also better understood and less ambiguous). For the layman’s variant we might use terms like “strongly typed” and “statically type-checked”, rather than “type-safe”, depending on where exactly we think the type-safety comes from.

Understanding Git in 5 minutes

Git it seems is known for being confusing and difficult to learn. If you are transitioning from a “traditional” versioning system such as CVS or Subversion, here are the things you need to know:

  • A “working copy” in Subversion is a copy of the various files in a subversion repository, together with metadata linking it back to the repository. When using Git, your working copy (sometimes referred to as “working directory”, apparently, in Git parlance) is actually hosted inside a local copy (clone) of the remote repository. To create this clone you use the “git clone” command. So, generally, you would use “git clone” where you would have used “svn checkout”.
  • A repository has a collection of branches, some of which may be remote-tracking branches (exact copies of the upstream repository), and the rest of which are local branches (which generally have an associated upstream and remote-tracking branch).
  • In Git you make changes by committing them to the local branch. You can later push these commits upstream, which (if successful) also updates the associated remote tracking branch in your repository.
  • But actually, commit is a two-stage operation in Git. First you stage the files you want to commit (“git add” command), then you perform the commit (“git commit”).
  • You can fetch any new changes from the remote repository into a remote tracking branch, and you can merge these changes into your local branch; You can combine both operations by doing a pull (“git pull”), which is the normal way of doing it.
  • A Git “checkout” is to replace your working copy with a copy of a branch. So you switch branches use the “git checkout” command. You cannot (or at least don’t normally) checkout remote tracking branches directly; you instead checkout the associated local branch.
  • So actually a Git repository contains these different things:
    • A collection of branches, both local and remote-tracking;
    • A working copy
    • A “stash” (of changes to be committed)
    • Some configuration
    • (a few other things not mentioned here).
  • Except for the working copy itself, everything in a Git repository is stored in the “.git” directory within your working copy.
  • A “bare” Git repository doesn’t have a working copy (or stash). The data that would normally be inside “.git” is instead contained directly inside the repository root folder. A repository hosted on a server is often a “bare” repository; when you clone it, your clone will not be bare, so that you can perform checkouts / make commits etc.
  • Because you make commits to a local branch, a commit operation does not contact the origin server. This means that commits may occur in the origin repository simultaneously. A merge will be needed before the local branch commits can be pushed to the remote repository.
  • Git versions represent a snapshot of the repository state. They are identified by an SHA-1 checksum (of both file contents and some metadata, including version creation date and author). A Git version has a preceding version, and may have multiple preceding versions if it is the result of a merge.
  • To avoid complicating the commit history, you can rebase commits made to your local branch, which means that the changes they make are re-applied after the changes from remote commits. This re-writes the history of your commits, effectively making it appear that your commits were performed after the remote changes, instead of interleaved with them. (You can’t rebase commits after you have pushed them).

C++ and pass-by-reference-to-copy

I’m sure many are familiar with the terms pass-by-reference and pass-by-value. In pass-by-reference a reference to the original value is passed into a function, which potentially allows the function to modify the value. In pass-by-value the function instead receives a copy of the original value. C++ has pass-by-value semantics by default (except, arguably, for arrays) but function parameters can be explicitly marked as being pass-by-reference, with the ‘&’ modifier.

Today, I learned that C++ will in some circumstances pass by reference to a (temporary) copy.

Interjection: I said “copy”, but actually the temporary object will have a different type. Technically, it is a temporary initialised using the original value, not a copy of the original value.

Consider the following program:

#include <iostream>

void foo(void * const &p)
{
    std::cout << "foo, &p = " << &p << std::endl;
}

int main (int argc, char **argv)
{
    int * argcp = &argc;
    std::cout << "main, &argcp = " << &argcp << std::endl;
    foo(argcp);
    foo(argcp);
    return 0;
}

What should the output be? Naively, I expected it to print the same pointer value three times. Instead, it prints this:

main, &argcp = 0x7ffc247c9af8
foo, &p = 0x7ffc247c9b00
foo, &p = 0x7ffc247c9b08

Why? It turns out that what we end up passing is a reference to a temporary, because the pointer types aren’t compatible. That is, a “void * &” cannot be a reference to an “int *” variable (essentially for the same reason that, for example, a “float &” cannot be a reference to a “double” value). Because the parameter is tagged as const, it is safe to instead pass a temporary initialised with the value of of the argument – the value can’t be changed by the reference, so there won’t be a problem with such changes being lost due to them only affecting the temporary.

I can see some cases where this might cause an issue, though, and I was a bit surprised to find that C++ will do this “conversion” automatically. Perhaps it allows for various conveniences that wouldn’t otherwise be possible; for instance, it means that I can choose to change any function parameter type to a const reference and all existing calls will still be valid.

The same thing happens with a “Base * const &” and “Derived *” in place of “void * const &” / “int *”, and for any types which offer conversion, eg:

#include <iostream>

class A {};

class B
{
    public:
    operator A()
    {
        return A();
    }
};

void foo(A const &p)
{
    std::cout << "foo, &p = " << &p << std::endl;
}

int main (int argc, char **argv)
{
    B b;
    std::cout << "main, &b = " << &b << std::endl;
    foo(b);
    foo(b);
    return 0;
}

Note this last example is not passing pointers, but (references to) object themselves.

Takeaway thoughts:

  • Storing the address of a parameter received by const reference is probably unwise; it may refer to a temporary.
  • Similarly, storing a reference to the received parameter indirectly could cause problems.
  • In general, you cannot assume that the pointer object referred to by a const-reference parameter is the one actually passed as an argument, and it may not exist once the function returns.

Maven in 5 minutes

Despite having been a professional Java programmer for some time, I’ve never had to use Maven until now, and to be honest I’ve avoided it. Just recently however I’ve found myself need to use it to build a pre-existing project, and so I’ve had to had to get acquainted with it to a small extent.

So what is Maven, exactly?

For some reason people who write the web pages and other documentation for build systems seem to have trouble articulating a meaningful description of what their project is, what it does and does not do, and what sets it apart from other projects. The official Maven website is no exception: the “what is Maven?” page manages to explain everything except what Maven actually is. So here’s my attempt to explain Maven, or at least my understanding of it, succintly.

  • Maven is a build tool, much like Ant. And like Ant, Maven builds are configured using an XML file. Whereas for Ant it is build.xml, for Maven it is pom.xml.
  • Maven is built around the idea that a project depends on several libraries, which fundamentally exist as jar files somewhere. To build a project, the dependencies have to be download. Maven automates this, by fetching dependencies from a Maven repository, and caching them locally.
  • A lot of Maven functionality is provided by plugins, which are also jar files and which can also be downloaded from a repository.
  • So, yeah, Maven loves downloading shit. Maven builds can fail if run offline, because the dependencies/plugins can’t be downloaded.
  • The pom.xml file basically contains some basic project information (title etc), a list of dependencies, and a build section with a list of plugins.
  • Dependencies have a scope – they might be required at test, compile time, or run time, for example.
  • Even basic functionality is provided by plugins. To compile Java code, you need the maven-compiler-plugin.
  • To Maven, “deploy” means “upload to a repository”. Notionally, when you build a new version of something with Maven, you can then upload your new version to a Maven repository.
  • Where Ant has build targets, Maven has “lifecycle phases”. These include compile, test, package and deploy (which are all part of the build “lifecycle”, as opposed to the clean lifecycle). Running a phase runs all prior phases in the same lifecycle first. (Maven plugins can also provide “goals” which will run during a phase, and/or which can be run explicitly in a phase via the “executions” section for the plugin in the pom.xml file).
  • It seems that the Maven creators are fans of “convention over configuration”. So, for instance, your Java source normally goes into a particular directory (src/main/java) and this doesn’t have to specified in the pom file.

Initial Thoughts on Maven

I can see that using Maven means you have to expend minimal effort on the build system if you have a straight-forward project structure and build process. Of course, the main issue with the “convention over configuration” approach is that it can be inflexible, actually not allowing (as opposed to just not requiring) configuration of some aspects. In the words of Maven documentation, “If you decide to use Maven, and have an unusual build structure that you cannot reorganise, you may have to forgo some features or the use of Maven altogether“. I like to call this “convention and fuck you (and your real-world requirements)”, but at least it’s pretty easy to tell if Maven is a good fit.

However, I’m not sure why a whole new build system was really needed. Ant seems, in many respects, much more powerful than Maven is (and I’m not saying I actually think Ant is good). The main feature that Maven brings to the table is the automatic dependency downloading (and sharing, though disk space is cheap enough now that I don’t know if this is really such a big deal) – probably something that could have been implemented in Ant or as a simple tool meant for use with Ant.

So, meh. It knows what it can do and it does it well, which is more than can be said for many pieces of software. It’s just not clear that its existence is really warranted.

In conclusion

There you have it – my take on Maven and a simple explanation of how it works that you can grasp without reading through a ton of documentation. Take it with a grain of salt as I’m no expert. I wish someone else had written this before me, but hopefully this will be useful to any other poor soul who has to use Maven and wants to waste as little time as possible coming to grips with the basics.