As a long-time C programmer (20+ years), this quiz -- despite its flaws -- are a good example of how difficult it really is to write 100% portable C code.
I often hear developers talking about how C is a simple language, but it may or may not be, depending on how you define "simple". For example, a written alphabet with only two letters -- A and B -- is "simple" to grasp (there's only two letters to learn!), but it would be very difficult, in the real world, to read and write such a language.
I think C is similar: "simple" in a sense, but very, very complex in another sense. C could use some massive clean-up, in my opinion. Much of its design was so that it could be ported to CPUs which haven't really existed in the wild in a long time in great enough numbers to warrant all the undefined and implementation dependent behavior.
The language could be hugely simplified, in the usage sense, if much of the cruft were jettisoned.
That's a huge problem with programming languages: Designers can always add things, but removing stuff is hard. JavaScript is a great example of this. It still has automatic semicolon insertion, variable hoisting, and a broken comparator (==). Practically everyone agrees these are bad things, but they can't be changed. Doing so would break backwards compatibility, and there's a huge ecosystem of JavaScript programs that would need to be updated.
Also, language users often rebel when language designers make breaking changes. Python 3 tried to remove cruft, and look how slowly it's been adopted. Instead of adopting 3.x, people backported the features they wanted to 2.x.
I would love it if C had less cruft, but when I say that I mean, "I want C with less cruft, but with the same huge ecosystem of documentation and libraries and tools and debuggers and profilers that crufty-C has."
That's why I think the language Go is such a great development. The authors seem to have added only the minimum set of features that make the language workable for their initial needs.
As a result, these features are largely orthogonal, and the rules are simple to state and learn -- even if they at first seem a bit odd (you have to cast all numeric types to each-other before they can interact).
In contrast to Scala, another modern language I investigated recently, which seemed like quite a thicket of features.. some of which seemed just to be there to mediate the interaction of other features.
Even languages that are explicitly designed to be minimal can be anything but. For example, Scheme. The call/cc function is a lot more complicated than it sounds, and it sounds complicated. People are talking about removing it from the "minimal" Scheme. (There are two versions of R7RS, the big version and the small version. If you think that call/cc is actually simple, then you haven't tried to write higher order library functions such as map.)
And Scheme's syntax for numbers was invented by Cthulhu himself.
(Python 3 was supposed to be adopted this slowly, last time I checked.)
Much of its design was so that it could be ported to CPUs which haven't really existed in the wild in a long time in great enough numbers to warrant all the undefined and implementation dependent behavior.
A lot of the implementation dependent behavior is now expected. Not too long ago, I was taking part here in a brief debate in the comments for a Go language related post, and the programmer was up in arms that the Go compiler didn't transparently cast things for him. I pointed out that this would result in implementation dependent behavior, and maybe the design goal in Go was to eliminate that.
His reply: That's what people should expect. >Of course< you should expect all these arcane things to happen when you port from one platform to another.
I was taking part here in a brief debate in the comments for a Go language related post, and the programmer was up in arms that the Go compiler didn't transparently cast things for him
I was that programmer, and I've now changed my mind.
The whole 'less design is better' is always more palatable when it doesn't touch the things one is used to taking for granted (in my case, having a nice numeric tower).
But the union of all the common such features gives you something like Scala, which is to me obviously over-designed. Go is more like the intersection.
I agree, but I also think that C isn't as complicated as these kinds of quizzes suggest. I've been using C for over 20 years as well, and I do know most of these arcane rules, but I don't rely on that knowledge much.
I decided a long time ago that it makes no sense to trust myself and everyone who might have to maintain my code to always remember all these rules. It's a lot simpler and safer to remember a much smaller set of rules like don't mix signed and unsigned and just enable all the warnings.
One thing I do that may be controversial is to prefer fixed size types most of the time. In my opinion, the decision to use a 16, 32 or 64 bit int type is not primarily a portability or performance issue. It's a decision driven by application requirements, and I want to express these requirements as explicitly as possible in my code.
I think I don't understand... On one side, you say that C isn't that complicated but on the other side you say you just use a small subset because you don't trust yourself and your colleagues to understand all the C quirks...
I think C is complicated with all the 191 undefined behaviors and 52 unspecified behaviors (source for these number is the paper that the author of the quiz wrote for specifying CSmith). One must master C before having absolutely knowledge that the code that he wrote doesn't fall into one of the two black-holes (undefined and unspecified behaviors). For instance, almost everyone that I ask, they tell me that dereferencing a NULL pointer yields 'Segmentation fault' but it's just not true. It's in fact undefined behavior.
The other thing is the rules for strict aliasing... It's pretty much impossible to convert one pointer to another without breaking some of the C rules. The experts say type punning is the way to go but type punning relies in another not defined behavior. Reading an element from an union that wasn't the last element written is undefined behavior.
I didn't say anything about a small subset. The rules I use are not a subset of all those special cases, they are broader rules that cover lots of potential pitfalls and are easier to remember. But C is definitely more complicated than it should be, no doubt about it.
Wait wait, all questions in this quiz are not really about C but about the form in which numbers are stored in computer and consequences of that -- C just expresses them in raw way, but other languages also suffer from this quirks, either directly or via performance/accuracy jumps.
I was really surprised how many errors there were in the quiz.
Errors in the quiz:
3. (unsigned short)1 > -1: This will evaluate to 0 on systems where sizeof(short)<sizeof(int), and 1 on systems where sizeof(short)==sizeof(int). Yes, such systems exist. Old Cray supercomputers and various DSP processors don't have byte-addressable memory, only word-addressable, so they make all primitive types a word long.
5. SCHAR_MAX == CHAR_MAX: The person who wrote the quiz even ACKNOWLEDGES that the quiz is incorrect here and apologizes. This is implementation-dependent, it is 1 when char is signed and 0 when char is unsigned. Both types of systems exist. You can use -funsigned-char on GCC, for example.
11: int x; x << 31: This is only defined for some values on platforms where int has at least 32 bits. There exist systems where int has 16 bits, in which case this is undefined for all values. Old DOS PCs often used 16-bit ints, and I believe that ints were 16 bits when C was invented.
12: int x; x << 32: This is only undefined on systems where int has no more than 32 bits. There exist systems where int has more bits. Do a search for ILP64 if you wish to hear about such systems.
14: unsigned x; x << 31: Again, this is only defined on systems where int has at least 32 bits. See #11.
15: unsigned short; x << 31: This one is tricky. There are four different cases, depending on the size of int and whether sizeof(short)==sizeof(int).
15a: sizeof(short) == sizeof(int): defined for all x, since x gets promoted to unsigned.
15b: sizeof(short) < sizeof(int), int has less than 32 bits: defined for no x. I don't think any such systems exist.
15c: sizeof(short) < sizeof(int), int has at least 32 bits but less than 32 more than a short: defined for some x. This is the most common, with 16-bit short and 32-bit int.
15d: sizeof(short) < sizeof(int), int has at least 32 more bits than a short: defined for all x. This is the uncommon ILP64 system.
18: int x; (short)x + 1: This is outright incorrect. Casting int to short results in undefined behavior if the value cannot be represented as a short. Truncation is only guaranteed to occur for unsigned types.
The intro text explicitly defines those datatype sizes:
In other words, please answer each question in the context of a C compiler whose implementation-defined characteristics include two's complement signed integers, 8-bit chars, 16-bit shorts, and 32-bit ints. The long type is 32 bits on x86, but 64 bits on x86-64 (this is LP64, for those who care about such things).
I don't think it's intended as a portability test, but rather an implementation-defined vs undefined behavior test. And in practice it's safe to make assumptions about implementation-defined behavior as long as you're programming for general-purpose computers.
Regehr clarifies this in the comments section: "Regarding signed overflow being defined or not, compiler developers generally draw a sharp distinction between undefined behavior and implementation-defined behavior. 32-bit ints, 2's complement, etc. are examples of the latter and signed overflow is an example of the former. A lot of developers do not draw such a sharp distinction, which is why I made a point of asking questions about this issue."
For an example of signed overflow versus unsigned overflow, certain compilers are known to assume that int loop variables won't overflow. So "for (int i = 0; i != -1; ++i)" is transformed into "for (int i = 0; ; ++i)", since both are equal in the eyes of the C standard (both will iterate through all non-negative values that fit in an int, then both will invoke undefined behavior, so they are the same).
The funny part is that it's often better to use int exactly because of the undefined behavior on overflow. By signaling to the compiler that you don't intend to overflow a particular variable, it can optimize appropriately.
Things are pretty bad when even on the most well-known platform, with a fairly straightforward implementation of two's complement integers, the answers are not intuitive.
Fortunately, with a few simple coding guidelines you can avoid major pitfalls.
1. Always do bit operations on unsigned types.
2. Never overflow signed types.
3. Always shift less than the type width. I can name architectures which implement large shifts in three distinct ways.
4. Know that most integers get promoted to int or unsigned, so you need an explicit cast to get 64 bits on most platforms. So 1ULL << 48 is okay, 1U << 48 is bad.
5. Treat unsigned int / unsigned long as contagious, just like float and double.
Still ambiguous. Unsigned short to int or long expansion uses the internal 'widen' operator, which can either sign-extend or not (unspecified). Would have to specify a particular compiler to make the quiz fair (or add "depends upon compiler" to the choices).
No, expanding a short to int or long is never unpredictable. The C standard guarantees that the expansion will only ever occur if the resulting type can represent all values in the original type, and that the result will always have the same value as the input.
If int is 32 bits and short is 16 bits, then converting unsigned short to int will always give a value in the range [0, 0xffff]. No exceptions.
I believe that the reason people need to know these things is not because they should be playing Russian roulette with them, but because they need to know what's going on when things go accidentally wrong.
in theory yes, but often there are cases where you can't avoid mixing signed and unsigned stuff for example. Think about write() and read() syscalls that will return a signed type, but are often used in a context where buffer lengths are expressed in unsigned types.
Then you probably want to mark it with an explicit function named something like "int32_to_uint32()" which you have defined (and debugged) in one place that performs the appropriate conversion with sanity checks.
Basically, IMO, C's implicit conversions are a flaw in the spec. A more sensible language would require every conversion (aside from those involving un-dimensioned constant data) to be spelled out explicitly. Since you can't make C sensible, the next best thing is to scrupulously isolate the stupid parts.
I believe that the answers are incorrect. They assume that all true boolean expressions evaluate to 1, whereas the c standard only guarantees that they don't evaluate to 0.
It's not wrong. If you read the intro, they note that those traits are implementation defined, so they say to assume you're using an implementation with those characteristics.
> 2. To make C usable for those worried about erroneous overflows in their code.
I think you're saying that the rule allows compilers to implement -fwrapv if they choose, which sounds reasonable.
> 3. To make it easier to write C compilers for CPUs that trap on integer overflow.
> 4. To allow for performant C compilers on CPUs that use one's complement arithmetic.
I wonder how much these matter nowadays.
For optimization, the blog post you cite describes the following examples:
- "X+1 > X" to true
- "X*2/2" to "X"
- "<= loops" and "int" induction variables
On the first two, presumably these usually only come up after macro expansion and inlining, however they're still suspicious. If a function is scaling its return value and its caller is de-scaling it, it's usually a sign that the API isn't designed quite right. I'd be curious to know how often these come up.
Code using "int" induction variables to step through arrays on 64-bit targets is often sloppy. Such code won't handle very large arrays properly, due to the limited range of "int", which is a bug that may not be quickly noticed.
And for the "<= loop" itself:
for (i = 0; i <= N; ++i) { ... }
it's really unlikely that the code is actually intended to be an infinite loop in the case where N happens to be INT_MAX. Code like this would usually be clearer written as "i < N + 1" to emphasize that it really does intend to iterate N+1 times rather than just N times, and it just so happens that this form makes optimizers happier as well.
Instead of having compiler writers sit around and think up clever ways to repurpose anachronistic language rules, I might prefer to have them focus instead on ways they can help me write better code instead :-).
In response to the last paragraph, there's a tension between two different visions for the C language. One of the visions is what's sometimes called "high-level assembly language". In that case, undefined behavior lets implementors choose the behavior that makes the most sense in that situation. Someone writing software for a DSP is going to have different expectations for the behavior or integers than someone writing software for a general purpose processor. Forcing both implementations to use the exact same semantics is going to slow both of them down, because extra checks will have to be added to work around the differences in the underlying hardware.
The other vision for C is a portable systems programming language. When trying to write portable code, undefined or implementation-defined behavior is a big problem. I'm glad that compiler writers employ a take-no-prisoners approach here. People shouldn't be relying on undefined behavior when trying to write portable code, so compiler writers shouldn't be bound to implement consistent behavior in those cases. That's especially the case if code ever needs to be compiled with a different compiler, which might decide to implement different consistent behavior.
It's also worth noting that in some cases, the exploitation of undefined behavior doesn't always happen in a single place in a compiler. Instead, it can be a combination of applying a few different rules in different optimization passes that produces surprising results.
With that being said, it sure would be nice if the compiler writers figured out how to give more warnings when undefined behavior is detected. If x + 1 > x is optimized to false, tell me! If a dereference of a pointer that could be null leads to potentially dead code being eliminated, tell me about that too! It's the silently surprising behavior that causes the most problems.
Virtually all CPUs today implement two's complement signed integers natively. Even the few C-capable DSPs that I know about do too (and have separate types and operations for saturation etc.). The main loss would be the compiler optimizations, and most of those can be recovered if programmers can avoid a few pitfalls, such as the ones I discussed.
The problem with undefined behavior is not people ignoring portability. It's that it's actually really easy to accidentally misuse it. For example, Regehr's group has found quite a few such bugs in widely ported code written by smart people [0].
I have to agree with your parting comment. The standard even suggests that undefined things could behave "in a documented manner characteristic of the environment"! - why, goodness, that sounds useful. Though I speak only for myself; perhaps taking full advantage of every potential inch of leeway and confusing everybody except for the very best-read of language lawyers really is what people would prefer.
In practice, this has resulted in compilers like gcc eliminating security-critical integer overflow checks in code on the basis that, since they can only be triggered if the integer overflows, they're invoking undefined behaviour and can be deleted.
Another behaviour which it handles is chips with hardware guard bits. You might have some (inaccessible) bits of headroom (e.g. "(INTMAX + 1) / 2 > 0")
At the time of the introduction of C, there were plenty of different architectures that used different kinds of signed integer representations. C was evolved to be the "lowest common denominator", so things that operated differently on different hardware were made undefined to preserve portability.
That would be true for unsigned data types. The C Standard specifies as 'undefined behavior' overflow for signed types. The C compiler can do whatever it wants...
That is true but it's not the reason... INT_MAX + 1 must be 0 for unsigned int because the C standard defines wrap-around behavior for unsigned data types. Saturated data types are only defined in Embedded C. E.g.: Sat Fract
That is the entire reason. Because the representation of signed numbers could vary from implementation to implementation, making INT_MAX + 1 (or any signed integer overflow) undefined allows for more efficient generated code. If it were defined in some way, the generated code would need checks for overflow wherever it was possible, which would lead to bigger and slower code.
INT_MAX + 1 is never 0 for unsigned int. (unsigned)INT_MAX + 1 is equal to INT_MAX + 1. You're thinking of UINT_MAX + 1, which is always 0.
It's not true that (unsigned)INT_MAX + 1 is never 0 - it's allowed for UINT_MAX == INT_MAX (leaving the sign bit unused in the unsigned representation). I've never seen such an implementation, though.
So in question 5, the C Standard answer is rejected because the x86-specific answer is required. But then in question 7, the x86 (and all normal archs) answer is rejected and the C Standard answer is required.
Would it really kill you to decide on the questionnaire rules before writing the questions? :-)
There's a difference between 'implementation-defined' and 'undefined'. Question 5 is implementation-defined behaviour, while question 7 is undefined behaviour.
While I didn't expect to do well on this (if for no other reason, then lack of parenthesis to demarcate evaluation order!), however could someone please explain why CHAR_MAX == SCHAR_MAX? My understanding is that no such guarantee exists for the range of these types.
A few of the questions seemed odd, or mixing implementation specifics with standardize.
On the example platforms described at the top of the quiz, CHAR_MAX (char max) is 127, which SCHAR_MAX (signed char max) is also 127. This is because chars are signed by default on those platforms.
The point being made is that the default signedness of char can be different from platform to platform, and the default can be confusing to people who associate the char type with binary data. Better to use int8_t and uint8_t when you want a portable signed or unsigned 8-bit number. int8_t is always signed, and uint8_t is always unsigned.
That tripped me up too, and I'm pretty sure that answer is wrong. It's implementation defined, since it's implementation defined whether or not 'char' is signed or not.
It is, but this quiz is in the context of x86 and x86-64.
> Also assume that x86 or x86-64 is the target. In other words, please answer each question in the context of a C compiler whose implementation-defined characteristics include two's complement signed integers, 8-bit chars, 16-bit shorts, and 32-bit ints. The long type is 32 bits on x86, but 64 bits on x86-64 (this is LP64, for those who care about such things).
On Question 12 ("Assume x has type int. Is the expression x<<32..."), why is this considered an error? Why do we want a compiler to prefer this over "x<<n means shift x by (n % sizeof(int))"?
One possible reason: it makes little sense for 1 << 33 to be less than 1 << 31 but not zero (assuming 32-bit ints -- replace 31 and 33 with 63 and 65 for 64-bit ints). In other words, if you keep shifting a number to the left, it should keep growing, until all the bits are gone.
Another possible reason: CPUs have instructions for implementing << as it now stands.
On x86, the shift instruction only looks at the bottom 5 bits of the shift distance. x<<32 effectively is interpreted as x<<0. (I'm sure other architectures have the same restriction.)
In fact the answer was 5 bits for 32bit and 7 bits for 64bit[1]. x86-64 uses 5 bits for 32bit and for 64bit (shld/shrd can't shift more than 31 bits).
7 bits is problematic because some code may assume that x << -y is an optimization for x << (32-y). I thought there was some architecture where this assumption didn't work on 32bit either, which led to my post above.
Ah, I see... 'Shift' doesn't mean what I thought it meant. Somehow I thought that a left shift was supposed to rotate the number. E.g. shifting by 1 would shift the 31st bit into the 0th bit.
I often hear developers talking about how C is a simple language, but it may or may not be, depending on how you define "simple". For example, a written alphabet with only two letters -- A and B -- is "simple" to grasp (there's only two letters to learn!), but it would be very difficult, in the real world, to read and write such a language.
I think C is similar: "simple" in a sense, but very, very complex in another sense. C could use some massive clean-up, in my opinion. Much of its design was so that it could be ported to CPUs which haven't really existed in the wild in a long time in great enough numbers to warrant all the undefined and implementation dependent behavior.
The language could be hugely simplified, in the usage sense, if much of the cruft were jettisoned.