Mysql, and C99 aliasing rules. A detective story.

When running mythfrontend after upgrading Mysql (to 5.1.40) I got:

Driver error was [2/1210]:
QMYSQL3: Unable to execute statement
Database error was:
Incorrect arguments to mysql_stmt_execute

After a bit of Google-ing I found a Myth bug ticket which seems to refer to the same (or a very similar) problem. The problem is apparently due to a Mysql bug and has something to do with an invalid cast and gcc optimizations, which makes it easy enough to work around – except of course that the bug was apparently fixed many versions ago.

However, conveniently, changing my configure options for Mysql (from -O3 to -O1 in my CFLAGS and CXXFLAGS) does make the problem go away. I suspected alias analysis to be the problematic optimization, so I checked whether “-O3 -fno-strict-aliasing” was enough – and it was. I resolved to track down the bug.

I started by compiling Mysql twice, in two different trees – one with my regular options and one with -fno-strict-aliasing. Then, I copied object files from the first tree to the second and repeated “make install” until the problem appeared (actually, I had to copy the .o file, as well as the stupidly hidden .libs/*.o file that libtool creates, and then touch the *.lo file; the build system is broken enough that just changing an .o file won’t cause the library to rebuilt). Fortunately I found the right file first try: it was libmysql.o (or rather, libmysql.c).

Unfortunately, libmysql.c is over  5000 lines long, and I didn’t know which function was causing the problem, so I had no idea where to look in the code. I proceeded by compiling the file both with and without “-fno-strict-aliasing” and with the “-save-temps” option so I could grab the generated assembly (i.e. libmysql.s) for each case. Then I did a manual binary search by hand-editing functions from one file into the other, assembling it and re-building the library until I was able to find which function had the problem. It turned out to be “cli_stmt_execute”. At that point I decided to look through the generated assembly and see whether I could spot where it didn’t seem to match up with what was expected.  First, though, I got the size of the function down by marking various other static function that it was calling with __attribute__((noinline)), each time re-testing to make sure that the problem still occurred.

Anyway, enough with the suspense. Here is the problematic line of code (2553 in libmysql.c):

    for (param= stmt->params; param < param_end; param++)
        store_param_type((char**) &net->write_pos, param);

The problem of course is the cast “(char **)” which casts to an incompatible type, thus the value net->write_pos is accessed as a “char *” when it is declared as a “uchar *” (“uchar” is a typedef for “unsigned char”). Now “char” and “unsigned char” are compatible types, but “char *” and “unsigned char *” are not. That means, if you access some value via a “char *” reference then you must always do so and never through any incompatible reference including “unsigned char *”. Doing so breaks the strict aliasing rules.

In this case, store_param_type is actually meant to modify net->writepos:

    static void store_param_type(char **pos, MYSQL_BIND *param)
        uint typecode= param->buffer_type | (param->is_unsigned ? 32768 : 0);
        int2store(*pos, typecode);
        *pos+= 2;

But, as gcc inlines the call, it effectively sees the “*pos+=2” as updating a location that must be different to the source location (net->write_pos) seeing as the types are incompatible, in other words, it effectively sees “*pos = *otherpos + 2” (where otherpos is an “unsigned char **”). Then, it sees that *otherpos must be constant for the duration of the loop, seeing as a value of a compatible type is not written. This means it can defer the statement and collapse its multiple instances (one per iteration of the loop) to a single instance. The result? kaboom.

The solution? Change the type of the “pos” parameter for store_param_type() from “char **” to “unsigned char **” (and remove the cast from the call in cli_stmt_execute()). The function isn’t used anywhere else so this makes sense even if it wasn’t causing a bug.

And the lesson to be learnt from this: be very careful with pointer casts. Avoid them whenever possible. And learn, I mean really learn the aliasing rules. There are way too many pieces of software out there written by people who do not understand them and the result is a buggy software.

Mysql bug filed.


4 thoughts on “Mysql, and C99 aliasing rules. A detective story.

  1. This was great work. I ran into this bug today and it was nice to find a patch, and to find such a great description of what the problem was. You did some wonderful debugging and I really, really appreciate your efforts.

  2. The REAL problem is C. C’s aliasing rules just do not fit the the low level programming model. That’s what you get from building hack onto hack onto hack in a programming language. C is dangerous and unintuitive and yet optimizers are hindered by the crazy pointer semantics and the very weak type system. Ada is just superior in every way… Unfortunately nobody uses it, thanks to the unprofessional UNIX hacker culture. I hate computing.

  3. One wouldn’t need to change much to make C89’s aliasing rules suitable for low-level programming. The main requirement would be to recognize that quality implementations intended to be suitable for low-level programming, it must do a couple of things:

    Allow a pointer or lvalue which is freshly derived from one of another type to be used to access the object from which it is derived, and
    Recognize any effects that “volatile” accesses may have on the target machine state which could be accommodated by synchronizing the abstract machine state with the logical memory state seen by the target CPU, before and after the volatile access.

    Unfortunately, some compiler writers have interpreted the fact that the Standard doesn’t require that special-purpose compilers support low-level constructs that aren’t needed for their specialized purposes as an indication that low-level code that requires such constructs is somehow “broken”.

Leave a Reply to davmac Cancel reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.