I'm not a compiler expert but if it's a "very simple loop" is it still too complex for the compiler to make good machine code? Did they use a bad compiler on purpose? Or are computers just not yet fast enough to do a good job with very simple loops in practical compilers?
> are computers just not yet fast enough to do a good job with very simple loops in practical compilers?
The short answer to this question is 'yes', but there are some extenuating factors:
- Although we could do interesting things with unlimited computational resources, the current crop of c compilers is simply not very good, compared with what's possible today.
- Performance is always workload-dependent; the compiler has been somewhat shafted here because it doesn't know what sorts of inputs the function usually receives. The compiler output is better than the 'improved' code for some inputs. (It's possible you could get a better result from the existing compilers and c code just by using profile-guided optimisation.)
- The difference is prone to be more pronounced in simple loops than large ones. This is a contrived use-case. There is not a factor of 6 of performance hiding in optimised c code which could be recovered by doing the sorts of optimisations done by the op. Probably something more like 10-20%.
One prominent example: the use of intermediate representations based on basic blocks introduces redundancies that increase the complexity of the compiler, requiring attendant redundancies in order to optimise the same. You can see the redundancy manifest here https://godbolt.org/z/8o3oe39hh as different code generation from f and g. (They may change the result of this particular test in the future, but it seems unlikely that the disease—rather than the symptom—can be treated without a complete rearchitecture.)
E-graphs ameliorate phase ordering issues and allow for exploring the space of non-monotonic rewrites; recent research makes them computationally viable.
Put simply: it's legacy. Gcc and llvm are millions of lines of code, and they assume a particular architecture. Changing that is not easy.
Another issue, which I did not mention (but which is pertinent) is that c is a poor language for compilation. (Fran allen famously said 'c has destroyed our ability to advance the state of the art'.) In some respects, the optimisations performed automatically by modern high-performance cpus are more sophisticated than those done by c compilers, howbeit with less reach; the only reason they are able to do this is that they have direct control of the execution and hence have a greater ability to abstract over the side effects which are rampant in most c code.
E-graphs are interesting, but one still has to deal with combinatorial explosions. Are you alluding to some powerful search heuristic?
Your example touches on the problems of inflexible ABI, namely caller saved registers and the unknowability of side effects of external functions. Very weird that it can't reorder `r = x+y` despite it having no "observable" side effect until `return r`, since that return dominates the assignment, and there's no real relation between (the return, assignment) and (eff()).
I looked at it closer. In C, it is a side effect to assign to a variable. For an extern function not annotated __attribute__((pure)) the compiler has to assume the function call generates side effects. This prevents it from reordering the assignment and function call. Since x86-64 ABI has caller saved registers, in the case where it calls eff() first, it has to save x and y, and after the call, restore them.
My example has nothing whatsoever to do with abi, and everything to do with ir. f and g are exactly semantically equivalent, and this equivalence is trivial to show; that the compilers generate different code for each demonstrates redundancies in their ir.
The problem is the author of the article is making some huge implicit assumptions that the compiler can't possibly know about.
Consider this statement: "However, we know some things about this loop. We know that the only time we break out of it is when we hit the null terminator (’\0’). The code clang generates checks for the null terminator first, but this makes no sense."
This statement contains huge assumptions about the lengths of the input strings and the frequency of the letters 's' and 'p' in the input. And then has the chutzpah to call the compiler's failure to read his mind about this as "making no sense."
Good first effort by the author, but has not sufficiently thought through the problem.
That's the thing, a C compiler has all the information it needs to know that the maximum amount of times a '\0' can be processed in the loop is once (because the function returns), but there's no upper bound on the amount of times other characters are seen in the loop.
I might be missing a reason that this information of opaque to the compiler though, in which case, this section of the article is indeed lacking, but I'm happy to learn :)
It's not just that the C compiler lacks the information... but the reader of this article also lacks this information.
String length tells you the frequency with which nul terminators will be found. Without knowing frequency of occurrence of the nul terminator, 's', and 'p' then you cannot know which one occurs most often.
Consider two benchmark cases:
(1) every string tested contains exactly one character
(2) every string tested is 1MB long and is composed entirely of 's' and 'p'.
The author's first "optimization" assumes nul is rare. It would make benchmark (1) worse, and (2) better.
The article is a good example of "specification is hard, code is easy." He insufficiently specified the problem to be solved, and his test cases contained information not in the code and not in the text of the article.
It's not the upper bound that matters but the frequency. How frequently should the compiler assume an 's' appears in the dataset, or any other character?
We know that E[# of '\0' in a string] == 1.
But what is E[# of 's' in a string]? Is it greater or less than E[# of '\0' in a string], and how should the compiler know this?
You haven't given the compiler any reason to assume that 's' or 'p' will appear more often than '\0'.
OK so they were abusing the benchmark, like the compiler's output would be faster on less contrived test data? Do I have to search what are fdo or pgo or cmov to understand the answer?
The compiler will generate different code if it knew the rates at which branches were taken.
If a branch is almost always taken or almost never taken, a compiler will want to emit a jump. The frontend will be able to predict the jump with high probability, and a successfully-predicted jump is "free." The cost of a misprediction is paid for by the near-zero cost of the many successful predictions.
If a branch is hard to predict (and taking versus not taking it would load a different value into a register/memory), the compiler wants to emit a conditional move (cmov). A conditional move is slightly "more expensive" in the backend because the CPU has to wait for the condition to resolve before it can execute instructions dependent on the output. However, it is much cheaper than many mispredicted branches (mispredicts around half of the time).
FDO (feedback-directed optimization) or PGO (profile-guided optimization) means "run the code on some sample input and profile how often branches are taken/not taken." It gives the compiler more information to generate better code.
The problem with the blog post is that the compiler has no idea what the function's input data will look like. It (arbitrarily) chose to generate branches instead of cmovs. However, if the benchmark input is better suited for cmovs, then the benchmark will (wrongly) show that the compiler generates "slow" assembly. But that's not a fair test, because with PGO/FDO the compiler would generate equivalent assembly to the "fast" assembly (actually, probably faster). Finally, the human (OP) is using their knowledge of the benchmark data "unfairly" to write better assembly than the compiler.
The takeaway is: most of the time, one can't optimize code/assembly in a vacuum. You also need to know what the input data and access patterns look like. FDO/PGO gives the compiler more data to understand what the input data/access patterns look like.
Thank you this is an amazingly comprehensive answer! Now I wonder what would be the workflow for using these compiler features. Like if I am a normal or bad C programmer and I write my program and use valgrind to check that it doesn't have obvious problems and I compile it with -march native or whatever, then I can add some step to the workflow to somehow re-compile it in a way that uses the branching or access patterns of some examples that I let it process for that purpose?