Mysql, and C99 aliasing rules. A detective story.

When running mythfrontend after upgrading Mysql (to 5.1.40) I got:

Driver error was [2/1210]:
QMYSQL3: Unable to execute statement
Database error was:
Incorrect arguments to mysql_stmt_execute

After a bit of Google-ing I found a Myth bug ticket which seems to refer to the same (or a very similar) problem. The problem is apparently due to a Mysql bug and has something to do with an invalid cast and gcc optimizations, which makes it easy enough to work around – except of course that the bug was apparently fixed many versions ago.

However, conveniently, changing my configure options for Mysql (from -O3 to -O1 in my CFLAGS and CXXFLAGS) does make the problem go away. I suspected alias analysis to be the problematic optimization, so I checked whether “-O3 -fno-strict-aliasing” was enough – and it was. I resolved to track down the bug.

I started by compiling Mysql twice, in two different trees – one with my regular options and one with -fno-strict-aliasing. Then, I copied object files from the first tree to the second and repeated “make install” until the problem appeared (actually, I had to copy the .o file, as well as the stupidly hidden .libs/*.o file that libtool creates, and then touch the *.lo file; the build system is broken enough that just changing an .o file won’t cause the library to rebuilt). Fortunately I found the right file first try: it was libmysql.o (or rather, libmysql.c).

Unfortunately, libmysql.c is over  5000 lines long, and I didn’t know which function was causing the problem, so I had no idea where to look in the code. I proceeded by compiling the file both with and without “-fno-strict-aliasing” and with the “-save-temps” option so I could grab the generated assembly (i.e. libmysql.s) for each case. Then I did a manual binary search by hand-editing functions from one file into the other, assembling it and re-building the library until I was able to find which function had the problem. It turned out to be “cli_stmt_execute”. At that point I decided to look through the generated assembly and see whether I could spot where it didn’t seem to match up with what was expected.  First, though, I got the size of the function down by marking various other static function that it was calling with __attribute__((noinline)), each time re-testing to make sure that the problem still occurred.

Anyway, enough with the suspense. Here is the problematic line of code (2553 in libmysql.c):

    for (param= stmt->params; param < param_end; param++)
        store_param_type((char**) &net->write_pos, param);

The problem of course is the cast “(char **)” which casts to an incompatible type, thus the value net->write_pos is accessed as a “char *” when it is declared as a “uchar *” (“uchar” is a typedef for “unsigned char”). Now “char” and “unsigned char” are compatible types, but “char *” and “unsigned char *” are not. That means, if you access some value via a “char *” reference then you must always do so and never through any incompatible reference including “unsigned char *”. Doing so breaks the strict aliasing rules.

In this case, store_param_type is actually meant to modify net->writepos:

    static void store_param_type(char **pos, MYSQL_BIND *param)
    {
        uint typecode= param->buffer_type | (param->is_unsigned ? 32768 : 0);
        int2store(*pos, typecode);
        *pos+= 2;
    }

But, as gcc inlines the call, it effectively sees the “*pos+=2” as updating a location that must be different to the source location (net->write_pos) seeing as the types are incompatible, in other words, it effectively sees “*pos = *otherpos + 2” (where otherpos is an “unsigned char **”). Then, it sees that *otherpos must be constant for the duration of the loop, seeing as a value of a compatible type is not written. This means it can defer the statement and collapse its multiple instances (one per iteration of the loop) to a single instance. The result? kaboom.

The solution? Change the type of the “pos” parameter for store_param_type() from “char **” to “unsigned char **” (and remove the cast from the call in cli_stmt_execute()). The function isn’t used anywhere else so this makes sense even if it wasn’t causing a bug.

And the lesson to be learnt from this: be very careful with pointer casts. Avoid them whenever possible. And learn, I mean really learn the aliasing rules. There are way too many pieces of software out there written by people who do not understand them and the result is a buggy software.

Mysql bug filed.

Advertisements

POSIX write()

Take a look at:

http://www.opengroup.org/onlinepubs/000095399/functions/write.html

What a mess – different requirements for regular files, pipes, and “other devices supporting non-blocking operation”.  For pipes, there are reasons for this (atomic writes), but I think they should have been abstracted out (why can’t other devices have atomic writes? Why isn’t there a general mechanism to determine maximum size of an atomic write for a given file descriptor?).

I also notice, and I think this is particularly stupid, that if write() is interrupted by a signal before transferring any data it returns -1 rather than 0. If it transfers some data before being interrupted, it returns the amount of transferred data. Why make a special case out of 0?!! This forces increased complexity in the application, which can not assume that the return from write is equal to the number of bytes actually written, and for which -1 is in almost any other case an abortive error.

Unfortunately there is no discussion of the topic I was most interested in: atomicity/ordering of reads and writes to regular files. Consider:

  1. Process A issues a “read()” call to read a particular part of a file. The block device is currently busy so the request is queued.
  2. Process B issues a “write()” request which writes to the same part of the file as process A requested.

The question is now: can the data written by process B be returned to process A, or must the data that was in the file at the time of the read() call being issued? Also, is it allowed that process B might see part of the data that process A wrote, and part of what was in the file at the time of the read request?

Continue reading

C Compiler Benchmarks

Edit 1/5/09: I’ve kind of re-done this. I wasn’t happy with it and some of the results were incorrect. I’ve also added two new benchmark programs since the first version of this post.

Well gcc 4.4.0 has been released now (with a few bugs, of course…) and I thought it’d be interesting to do some benchmarks, especially as there are now a few open source compilers out there (gcc, llvm, libfirm/cparser, pcc, ack; I’m not so interested in the latter two just now). Now I don’t have access to SPEC so I cooked up just a few little programs specifically designed to test a compiler’s ability to perform certain types of optimization. Eventually, I might add some more tests and a proper harness; for now, here are the results (smaller numbers are better):

Results
Results

Thankfully, it looks like gcc 4.4.0 performs relatively well (compared to other compilers) in most cases. Noticable exceptions are bm5 (where gcc-3.4.6 does better), bm3 and bm6 (where llvm-2.5 does better, especially in bm6).

llvm-2.5 does pull ahead in a few of the tests, but fails dismally in the 4th, which tests how well the compiler manages a naive memcpy implementation. llvm-2.5 generated code is more than four times slower than gcc generated code in this case, which is ridiculously bad. If it wasn’t for this result, llvm would be looking like the better compiler.

Sadly, none of the compilers I tested did particularly well at optimising bm2. In theory bm2 run times could be almost idential to bm3 times, but no compiler yet performs the necessary optimizations. This is a shame because the potential speedup in this case is obviously huge.

It’s surprising how badly most compilers do at bm4, also.

So, nothing too exciting in these results, but it’s been an interesting exercise. I’ll post benchmark source code soon (ok, probably I won’t. So just yell if you want it).

Continue reading

Mplayer version 1.0… ever?

The guys that develope MPlayer have some serious hangup about doing an actual 1.0 release. I mean, they’ve been through 1.0pre1, pre2, pre3, pre4, pre5, -rc1 (2006-10-22, that’s over 2 years ago) and -rc2 (2007-10-7, over 1 year ago) and still we don’t have a 1.0 release.

Now, there are mailing list posts like this one which say “it’ll be ready when it’s ready” and calling for the use who dared venture forth with the question (of when the next rc or release might be forthcoming) to be the release manager. I mean, look, I understand that these guys are doing a lot of hard work developing MPlayer and it is, after all, a pretty good product by open source standards (I mean, it has some actual documentation for one thing) BUT geez, if you’re not going to actually release 1.0 in some reasonable timeline then why have “release candidates”? Was -rc2 so bad that it can’t be called 1.0? (I mean, that is the point of a “release candidate” right? It is the candidate for the release, yes?) And if -rc2 has problems, wouldn’t it be prudent to at least do another release candidate, i.e. rc3?

Sigh. I guess I’m complaining not so much about the absence of MPlayer 1.0 (although I would certainly like it to arrive) but the fact that these “release candiates” exist and did so for so long without any actual release occurring. I mean, you shouldn’t have a release candidate if you’re not planning (or able) to do an actual release – it kind of gives the wrong impression. (Let’s be clear. The developers aren’t at fault, at least, not most of them. I’m just kind of irked at whoever had the bright idea to bundle a tarball and call it “Mplayer 1.0pre1” in the first place…)

Comcast is screwing its customers…

Comcast is screwing its customers by blocking “spam” which isn’t spam and also preventing mail servers which are accused of “spamming” from being removed from Comcast’s blacklist. I’ve tried to respond to two support requests from Comcast subscribers recently and in both cases, my reply bounced.

I got this back rom my mail server:

Reporting-MTA: dns; *********

Final-Recipient: rfc822;********@comcast.net
Action: failed
Status: 5.0.0 (permanent failure)
Diagnostic-Code: smtp; 5.4.7 - Delivery expired (message too old) 554-'IMTA06.emeryville.ca.mail.comcast.net comcast ***.***.***.*** Comcast BL004 Blocked for spam.  Please see http://help.comcast.net/content/faq/BL004' (delivery attempts: 0)

(Of course “***” are places where I’ve removed identifying information).

On following the link to http://help.comcast.net/content/faq/BL004, I get a page which suggests I can go to http://www.comcastsupport.com/rbl to get my mailserver unblocked. However, following that link redirects me to http://www.comcastsupport.com/browserupgrade/ which gives me a 404 Not Found.

Way to go, Comcast.

Update 25/8/2010: Seems like this has been fixed.

Multiple pkg-config incarnations?

Why the heck does ftp.gnome.org host (apparently two strains of) pkg-config?

http://ftp.gnome.org/pub/GNOME/sources/pkg-config/ (version 0.18 and 0.19)

http://ftp.gnome.org/pub/GNOME/sources/pkgconfig/ (version 0.8 through 0.18, though skipping 0.10)

It’s irresponsible to host out-of-date versions like that, as if pkg-config was part of the Gnome project and didn’t exist independently. At least put a README or something which mentions that this is a possibly-out-of-date mirror!

The real pkg-config distribution is here:

http://pkg-config.freedesktop.org/wiki/

with releases (including the current 0.23) here:

http://pkgconfig.freedesktop.org/releases/

Using $* in shell scripts is Nearly Always Wrong

(Bourne and Bash) Shell script writers bandy ‘$*’ about like it’s a stick of celery, but the truth is $* is nearly always the wrong thing to use. (For the clueless, $* expands to all the “positional parameters”, that is, all the parameters to the current script or function; it’s useful in cases where you want to pass the argument values for those parameters on to another program).

So why is $* wrong? Because it is evaluated before word splitting. In other words, if any of the arguments to your script had a space or other whitespace, then those arguments will become two when $* is evaluated. The following shell script can be used to verify this:

#!/bin/sh
# Call this script “mytouch”
touch $*

… Now try to “mytouch” a file with a space in its name (yeah, well done, you just created two files with one command).

Does quoting it work? will “$*” solve your problems? It will fix the above example, right up to the point where you try to “mytouch” two or more files at once and then you’ll see the fallacy (hint: “$*” always expands to a single “word”).

So what is the correct solution? It’s simple: use “$@” instead. Note, that’s $@, with quotes around it. Hence “$@”. This specifically expands to one string for each actual argument. It works. Hallelujah. (And while I’m blathering on shell scripts, note that any unquoted variable expansion is also Nearly Always Wrong for pretty much the same reasons).