Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Get a lexer (like, just steal one if you don't think you'll be able to write one without getting bored, they're usually fairly self contained) and just starting writin' code. They're basically that simple.

The lovely thing about a nicely written recursive descent parser is that the code pretty much is the grammar (topologically), so any grammar you can actually imagine that can be parsed by recursive descent doesn't take much thought to make work.



IMVLE, a lexer is just a big regular expression, with a branch at the top level for each type of token. Are there cases where this doesn't work?


Python and Haskell are interesting because the indentation is crucial to figure out the correct nesting. Haskell allows to use curly braces and semicolons to completely remove indentation, so the lexer only has to add virtual tokens into the token stream. For Python, I don't know how it deals with it.

C/C++ have plenty of difficult syntax edge cases. For example, >> can represent a single operator in an expression or two successive > in a template declaration.


Python has some preprocssing to inject INDENT and DEDENT pseudo-tokens into the lexing stream:

https://docs.python.org/3/reference/lexical_analysis.html#in...


Some things require (or are more easily implemented) using state-based lexing, such as identifying escaped characters in JSON strings or XML attributes, or documentation comments. These can be thought of as a Finite State Machine (for the state) driving the different sub-lexer regex token matching.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: