In my first C compiler, I combined the lexer and preprocessor. That was a mistak...

exebook · on May 6, 2023

It is also about performance.

I've watched a video recently from a guy who compiled entire Linux kernel in under 1 second I believe, he used TinyC and also noticed that something like 90% of compilation is tokenization of headers that are included many times, like there are headers that are included thousands of times in almost every C file, so he ended up caching tokens. So a big reason to have a separate tokenizer is that tokenization is a simpler task and can be optimized with all low level approaches, like perfect hashes, crafted nested switch/if tries, branchless algs, compiler intrinsics etc.

The good tokenizer is about as fast as a speed of writing the output array of records into memory. Which means it is important to choose right memory layout for your tokenized data so that when parser reads tokens it has as little cache misses and indirect memory access as possible. Tokenization can be thought of as a sort of in memory compression.

WalterBright · on May 6, 2023

Excellent summary. A couple other reasons for a separate tokenizer:

1. Sometimes all you need is a tokenizer - such as for highlighting in a code editor

2. D has a construct called a token string - where a string literal consists of tokens

3. A separate tokenizer means the lexer and parser can run in separate threads