Get a lexer (like, just steal one if you don't think you'll be able to write one...

twic · on Aug 22, 2021

IMVLE, a lexer is just a big regular expression, with a branch at the top level for each type of token. Are there cases where this doesn't work?

samus · on Aug 22, 2021

Python and Haskell are interesting because the indentation is crucial to figure out the correct nesting. Haskell allows to use curly braces and semicolons to completely remove indentation, so the lexer only has to add virtual tokens into the token stream. For Python, I don't know how it deals with it.

C/C++ have plenty of difficult syntax edge cases. For example, >> can represent a single operator in an expression or two successive > in a template declaration.

twic · on Aug 22, 2021

Python has some preprocssing to inject INDENT and DEDENT pseudo-tokens into the lexing stream:

https://docs.python.org/3/reference/lexical_analysis.html#in...

rhdunn · on Aug 22, 2021

Some things require (or are more easily implemented) using state-based lexing, such as identifying escaped characters in JSON strings or XML attributes, or documentation comments. These can be thought of as a Finite State Machine (for the state) driving the different sub-lexer regex token matching.