With several new compiler releases it’s time to update my compiler benchmark results (last round of results here, description of the benchmark programs here). Note that the tests were on a different machine this time and the number of iterations was tweaked so numerical results aren’t comparable.
Without further ado:
So, what’s interesting in this set of results? Generally, note that LLVM and GCC now substantially compete for dominance. Notably, GCC canes LLVM in bm4, where it appears that GCC generates very good “memmove” code, whereas LLVM generates much more concise but apparently also much slower code; also in bm8 (trivial loop removal – though neither compiler actually performs this optimization, GCC apparently generates faster code). On the other hand LLVM beats GCC quite handily in bm6 (a common subexpression elimination problem).
GCC 4.6.2 improves quite a bit over 4.5.2 in bm5 (essentially a common subexpression refactoring test). However, it’s slightly worse in bm3 and for some reason there is a huge drop in performance for the bm7 test (stack placement of returned structure).
The Firm suite, surprisingly, mostly loses ground with 1.20.0 and remains uncompetitive.
Edit 25/03/12: I’ve noticed a flaw in bm6, which when corrected causes GCC to perform much worse – about 0.5 seconds rather than the 0.288 reported above.

Would you publish the tests on github maybe?
Ok, here: https://github.com/davmac314/ccomp-microbench
Hello Dav, would you consider publishing these compiler tests?
I’d be curious about the code and new compiler versions. Is it published?
Did you read the reply above… the code is on github. I haven’t run it with newer compiler versions, but you can do that yourself 🙂
Hallo Dav, excuse the spam, I had sent these posts from my phone which failed to show the already posted comments. Greetings.
Ok. My name is not Dav, thanks.
Hello Davin ( i hope that is your name?),
for the record, some results as of 2018:
I ran clang 3.8.1-24 against gcc 6.3.0 on Debian x64:
wall time in seconds
bm1.c-gcc
1.27
1.27
1.28
1.27
1.27
#clang seems to do (unnecessary?) stuff before returning
bm1.c-clang
1.59
1.59
1.60
1.59
1.64
#shift not simplified!
bm10.c-gcc
0.25
0.25
0.25
0.24
0.24
bm10.c-clang
0.01
0.01
0.01
0.01
0.01
bm2.c-gcc
0.98
0.99
0.99
1.00
1.05
bm2.c-clang
1.00
0.99
1.00
0.98
0.98
bm3.c-gcc
0.49
0.49
0.49
0.49
0.49
bm3.c-clang
0.49
0.49
0.49
0.49
0.49
bm4.c-gcc
0.56
0.45
0.57
0.55
0.54
bm4.c-clang
0.49
0.49
0.64
0.66
0.51
# “NumSift”
#https://gcc.gnu.org/bugzilla/show_bug.cgi?id=21485
#still open as of 201803
bm5.c-gcc
0.14
0.13
0.13
0.13
0.13
bm5.c-clang
0.08
0.08
0.08
0.08
0.08
# redundant && || not eliminated
#http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32306
#Still open as of 201803
bm6.c-gcc
0.31
0.31
0.31
0.31
0.31
bm6.c-clang
0.17
0.17
0.17
0.17
0.17
#gcc’s issues seem to be resolved
bm7.c-gcc
0.94
0.90
0.90
0.94
0.93
bm7.c-clang
0.93
0.90
0.93
0.93
0.94
bm8.c-gcc
0.25
0.25
0.25
0.25
0.25
bm8.c-clang
0.24
0.25
0.24
0.25
0.24
bm9.c-gcc
0.92
0.93
0.92
0.92
0.92
bm9.c-clang
0.87
0.87
0.86
0.86
0.88
It seems many of gcc’s issues are still open.
Great! Yes, that’s my name, although on this blog I usually go by “davmac”.
It’s interesting that for bm10, clang seems to partially unroll the loop (by 16 iterations) – which makes it much faster, but it still doesn’t perform the ultimate optimisation of removing the loop altogether.
For bm7 gcc is as good as clang now but as far as I can tell neither compiler optimises fully by storing the result of the call to foo() directly “in place” for where it can be passed to bar() – both actually copy the result (gcc emits a call to memcpy, clang uses “rep movsq”).
The benchmarks as a whole are a little unfair to GCC, because some of them come from GCC bugs. It is much more difficult to find good Clang bugs to make tests out of because their bug database is not as well organised.
Incidentally your results for bm2 appear wrong. In my own tests clang/llvm successfully performs the tested optimisation, where gcc fails miserably:
https://godbolt.org/g/QQWqT8
Indeed, the results for bm2 are wrong.
That’s because I wrongly assumed bm2 and bm3 should use the same structs.
In bm1 I don’t understand why clang 3.8 does anything more than mov eax, 100; retq
https://godbolt.org/g/4bPsh5
But that seems fixed by more recent clang (4.0+)
In bm10 both compilers can improve, as you said, by realising that after a certain number of shifts the value is directly compile-time computable.