Edit 1/5/09: I’ve kind of re-done this. I wasn’t happy with it and some of the results were incorrect. I’ve also added two new benchmark programs since the first version of this post.
Well gcc 4.4.0 has been released now (with a few bugs, of course…) and I thought it’d be interesting to do some benchmarks, especially as there are now a few open source compilers out there (gcc, llvm, libfirm/cparser, pcc, ack; I’m not so interested in the latter two just now). Now I don’t have access to SPEC so I cooked up just a few little programs specifically designed to test a compiler’s ability to perform certain types of optimization. Eventually, I might add some more tests and a proper harness; for now, here are the results (smaller numbers are better):
Thankfully, it looks like gcc 4.4.0 performs relatively well (compared to other compilers) in most cases. Noticable exceptions are bm5 (where gcc-3.4.6 does better), bm3 and bm6 (where llvm-2.5 does better, especially in bm6).
llvm-2.5 does pull ahead in a few of the tests, but fails dismally in the 4th, which tests how well the compiler manages a naive memcpy implementation. llvm-2.5 generated code is more than four times slower than gcc generated code in this case, which is ridiculously bad. If it wasn’t for this result, llvm would be looking like the better compiler.
Sadly, none of the compilers I tested did particularly well at optimising bm2. In theory bm2 run times could be almost idential to bm3 times, but no compiler yet performs the necessary optimizations. This is a shame because the potential speedup in this case is obviously huge.
It’s surprising how badly most compilers do at bm4, also.
So, nothing too exciting in these results, but it’s been an interesting exercise. I’ll post benchmark source code soon (ok, probably I won’t. So just yell if you want it).
General notes on the benchmarks
Firstly, I’m trying to test the compilers’ ability to optimize specific constructs. These benchmarks are small and self-contained, they are not complete programs. Some may argue that this is not a good way to do a benchmark, as the tests are artificial; well, firstly, these tests allow us to see specifically what the weaknesses in certain compilers are, and secondly, it’s my contention that compiler that does well in most of these tests will do well in most real-world tests.
Note that there is no floating-point in these benchmark programs. I might add some in at some point.
So what do the various benchmarks actually test?
bm1 just tests that a switch…case statement with values outside the range of the switch’d datatype are optimized away. gcc 3.4.6 doesn’t do this, which is why it compares badly against gcc 4.3.3 and 4.4.0 (which do).
bm2 tests alias analysis of the compiler. It has a function which takes an argument pointer to a certain large structure s1, but then allocates space for a slightly smaller structure s2 on the stack (as a local variable). It copies memory (using memcpy) from s1 to s2, then makes a modification of one field of s2, before finally copying s2 back to s1. This could be optimized by just changing the field directly in s1 without doing the two memory copy operations; however, none of the tested compilers manage to do this.
bm3 is like bm2, but the s2 structure is much smaller. Most compilers then seem to be able to perform decent optimization; gcc 4.3.3 notably fails.
bm4 is a simple memory copy written using a while loop with two char pointers. It’s amazing how much difference there is between the different compilers. Notably, a simple hand-coded assembly “rep movsb” does way, way better than any of the compilers (0.036 seconds). [Edit 2012/2/29: That just can’t be right. Need to re-check].
bm5 is a small testcase taken from a gcc PR. It contains branches and potential for common subexpression elimination; it probably puts some pressure on the register allocator. The regression affects gcc 4.3/4.4 series which is reflected in the benchmark results (gcc 3.4.6 really shines).
bm6 is another gcc PR. It’s essentially testing common subexpression elimination in combination with loop invariant hoisting. gcc-3.4.6 does shockingly badly, but gcc-4.3.3 and gcc-4.4.0 still get whipped by llvm-2.5.
bm7 calls a function which returns a (large) structure and passes the result straight into another function. The ABI allows for returning the structure directly into the stack location from which it is then used as a parameter, thus avoiding the need to copy the structure; however, not all compilers manage this.
Conclusion: free compilers still suck a bit, but at least gcc 4.4.0 sucks less than earlier versions, and llvm-2.5 is looking halfway decent as well.