I thought is was about this: https://bellard.org/otcc/otcc.c (a tiny, obfuscated C compiler, winner of the 2001 IOCCC). It has led to TinyCC (https://bellard.org/tcc/), not as tiny, but also more complete, not obfuscated and actually useful.
It turns out the compiler in the article is the opposite of that. It it a simple toy transpiler and the code is very clear and mostly made of comments. The former is a feat of optimization, the latter is a tutorial.
Some compilers tokenize while parsing, but for a different reason: it's faster and uses less memory to generate the AST (and sometimes even do analyses) while you're reading the input than to allocate and store a giant list of tokens and then parse that.
Most compilers try not to fail when they encounter a syntax error, they try to "recover" and parse the remaining document, usually starting from the next valid statement or declaration. This lets them report more syntax errors and moreover, is very important for good IDE support (if you've ever used an IDE/language combo where you only get completion after you've written the code, you know why). Although if the syntax error messes up tokenization (e.g. missing quote or end comment) it usually screws up the rest of the document anyways.
> Some compilers tokenize while parsing, but for a different reason: it's faster and uses less memory
Rather legit reasons.. The one I'm writing does this. It seems akin to natural language processing. You'd interrupt a speaker early if you can't make sense of his uttering.
> Most compilers try not to fail when they encounter a syntax error, they try to "recover" and parse the remaining document
This seems ardous for you'd maybe have to keep parallel explanations of what you read ?
"> It seems akin to natural language processing. You'd interrupt a speaker early if you can't make sense of his uttering."
But if you're reading a book and can't understand a sentence, you'll probably glance at the rest of the page to see if there are clues in the context.
Maybe it's a printing error and a paragraph has been repeated. A human who sees the entire page will detect that problem immediately and simply skip over the repeated paragraph. A computer that gives up at the first sight of incongruency has no idea that recovery was so easy.
The analogy here might be relevant to error messages. A compiler that "sees the whole page" can potentially offer more useful suggestions to the programmer about how to fix a problem.
Wrote a C-subset compiler for a compilers course back in university.
In general it's useful for a compiler to not just hit the first error it can find in the source code and immediately error out. Instead, if you keep parsing after encountering an error, you can often encounter more errors, so that you can give the user a list of errors they need to fix, not just the first one. In that sense, a compiler's job isn't over the instant it encounters a syntax error, so the extra tokenization would not be useless.
In a mature product, you would apply heuristics to help you provide the most useful feedback to the user while optimizing the time it takes to do so.
Users love consolidated, efficient feedback without having to peel back errors one at a time. It’s a distinguishing feature when you can find a way to do it that suits your input, and specifically so because it can be a hard problem to solve well!
Strings typically have a token state context. Antlr allows pushing and popping a context state. Not a CFG and won't work with simple tools or book approaches.
Back in the day my Intro To Programming course taught the CLU programming language. I remember one cool feature of the compiler; when it hit a syntax error, it would show the error, skip a few lines until it found a stable place to resume parsing, and then continue compiling in order to catch more syntax errors. Pretty neat!
"skip a few lines until it found a stable place to resume parsing"
That's interesting! I always assumed it was always more complex than that, I love it. How reliable is this method btw? I'm specifically curious about the risk of catching false positives due to failing to parse some sections, such as missing goto labels, or unused variables.
Tokenizing everything in one go is usually the right approach. The typical case is the error-free case, so you won't actually tokenize more characters than necessary, but you will simplify the implementation and avoid jumping back and forth between parsing and tokenizing (thus improving locality).
You got many answers, but I'll give another brutally honest answer: I do this because it's simply much much easier to implement. I would much rather have my AST generator's input be a token stream. It just simplifies everything.
The purpose of tokenizing everything is to report all errors upfront. Imagine a big source file with two syntax errors. Tokenizing and parsing everything and reporting both errors with one compilation is a better user experience.
If you mean storing the whole tokenised input in memory before processing it again, then I'd say never, and go as far as calling it hugely wasteful and inefficient.
The other comments here about catching all errors is valid, but you don't need to store any of the previous output to do so. The tokeniser can provide a token at a time as necessary, and thus also provide errors in the same way.
For the year 1970 and after: never. This should always be part of the parser for reasons of being confidently correct.
Thinking otherwise just leads to suffering. The problem with tutorials like Tiny Compiler or Crafting Interpreters is that the authors do not run into problems with the code shown in their teaching materials, but as soon as a student wants to apply it to modestly complex grammars, it stops working. The authors traded conciseness for correctness which IMO is a bad trade-off, especially since a reasonably complete implementation of a parsing algorithm that has no shortcomings is perhaps only four to six times longer than a short and flawed one.
The point of this critique is to raise awareness; each student should not have to figure out that they got the bad end of said trade-off by trial and error, instead the teaching material should make this clear initially.
I am a professional compiler writer and nope you are wrong. I never tokenise everything up front and instead read tokens on demand. It is much faster and enables me to change how tokens are interpreted on the fly. Which enables me to parse different languages with different token definitions in the same file. Something that I need for the production work I do.
I am a pro compiler writer and I always use scanner-less parsers (no tokens). It allows me to change how tokens are interpreted on the fly. Very useful for the kind of problems I work on.
Better off pulling tokens from semantic analysis through parsing top down, LALR(1), SLR(1), or alternative context-sensitive parsing. Derivation parsing is scannerless.
It still has to separate tokens, understand a nested structure, and output a different format. It's a small compiler, but is a compiler, and honestly understanding parsers can help a lot with text processing tasks.
That's on lines 83/84. I'm not entirely sure that qualifies as the "top" of the source file. It's on the 4th page of the source file, given a traditional 80x25 terminal. (Assuming no wrapping, which the ascii art banner definitely could. A lot.)
It turns out the compiler in the article is the opposite of that. It it a simple toy transpiler and the code is very clear and mostly made of comments. The former is a feat of optimization, the latter is a tutorial.