Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Super Tiny Compiler (github.com/jamiebuilds)
126 points by stefankuehnel on March 3, 2023 | hide | past | favorite | 42 comments


I thought is was about this: https://bellard.org/otcc/otcc.c (a tiny, obfuscated C compiler, winner of the 2001 IOCCC). It has led to TinyCC (https://bellard.org/tcc/), not as tiny, but also more complete, not obfuscated and actually useful.

It turns out the compiler in the article is the opposite of that. It it a simple toy transpiler and the code is very clear and mostly made of comments. The former is a feat of optimization, the latter is a tutorial.


One of my favorites is the self-hosting Haskell compiler from the 2019 IOCCC: https://www.ioccc.org/2019/lynn/hint.html

There is a series of articles on how it was built here: https://crypto.stanford.edu/~blynn/compiler/


Whenever the subject of tiny learning-oriented compilers comes up, I always recommend C4: https://news.ycombinator.com/item?id=8558822

It's a tiny self-compiling compiler which also includes a bytecode interpreter, and terse but also very understandable.


Lots of tiny, self-hosting compilers here: http://t3x.org

For example:

Compiler for a procedural language, compiling to binary in 1500 lines: http://t3x.org/t3x/comp.html

Super-portable tiny compiler for DOS, CP/M, Unix, and a VM: http://t3x.org/t3x/index.html#0

Self-hosting LISP compiler in 400 lines: http://t3x.org/lfn/liscmp.lisp.html


Related:

A dynamic tutorial about a compiler in one JavaScript file (2016) - https://news.ycombinator.com/item?id=30129911 - Jan 2022 (7 comments)

An ultra-simplified example of a modern compiler written in JavaScript - https://news.ycombinator.com/item?id=22522208 - March 2020 (27 comments)

Super Tiny Compiler - https://news.ycombinator.com/item?id=11395656 - March 2016 (100 comments)


Question to compiler writers : when is it useful, if ever, to tokenize the whole input beforehand (as done here) ?

You wouldn't catch an early syntax error and would go on tokenizing till the end for nothing.


Some compilers tokenize while parsing, but for a different reason: it's faster and uses less memory to generate the AST (and sometimes even do analyses) while you're reading the input than to allocate and store a giant list of tokens and then parse that.

Most compilers try not to fail when they encounter a syntax error, they try to "recover" and parse the remaining document, usually starting from the next valid statement or declaration. This lets them report more syntax errors and moreover, is very important for good IDE support (if you've ever used an IDE/language combo where you only get completion after you've written the code, you know why). Although if the syntax error messes up tokenization (e.g. missing quote or end comment) it usually screws up the rest of the document anyways.


> Some compilers tokenize while parsing, but for a different reason: it's faster and uses less memory to generate the AST

And some don't even generate an AST. :) Just read in and emit or interpret.

https://briancallahan.net/blog/20210814.html


  > Some compilers tokenize while parsing, but for a different reason: it's faster and uses less memory 
Rather legit reasons.. The one I'm writing does this. It seems akin to natural language processing. You'd interrupt a speaker early if you can't make sense of his uttering.

  > Most compilers try not to fail when they encounter a syntax error, they try to "recover" and parse the remaining document
This seems ardous for you'd maybe have to keep parallel explanations of what you read ?


"> It seems akin to natural language processing. You'd interrupt a speaker early if you can't make sense of his uttering."

But if you're reading a book and can't understand a sentence, you'll probably glance at the rest of the page to see if there are clues in the context.

Maybe it's a printing error and a paragraph has been repeated. A human who sees the entire page will detect that problem immediately and simply skip over the repeated paragraph. A computer that gives up at the first sight of incongruency has no idea that recovery was so easy.

The analogy here might be relevant to error messages. A compiler that "sees the whole page" can potentially offer more useful suggestions to the programmer about how to fix a problem.


Wrote a C-subset compiler for a compilers course back in university.

In general it's useful for a compiler to not just hit the first error it can find in the source code and immediately error out. Instead, if you keep parsing after encountering an error, you can often encounter more errors, so that you can give the user a list of errors they need to fix, not just the first one. In that sense, a compiler's job isn't over the instant it encounters a syntax error, so the extra tokenization would not be useless.


I'm struggling with this idea. You could end up tokenizing the inside of an unopened string literal for (contrived) ex :

  String s = hello world";
  [-> typename:'String' id's' op'=' id'hello' id'world' ...]
And it would cascade if then

  foo("blah"); (...)
  [-> strlit';\nfoo(' id'blah' ...]
  
A parser-first approach will stop with

  unknown id 'hello'
and avoid the cascade.


In a mature product, you would apply heuristics to help you provide the most useful feedback to the user while optimizing the time it takes to do so.

Users love consolidated, efficient feedback without having to peel back errors one at a time. It’s a distinguishing feature when you can find a way to do it that suits your input, and specifically so because it can be a hard problem to solve well!


Precise diagnostics usually follow one of 2 approaches:

1. Hand-written, recursive descent

2. Error matching expanded grammar beyond minimal gramar


Check out this section of Crafting Interpreters on Error Recovery: https://craftinginterpreters.com/parsing-expressions.html#pa...


Strings typically have a token state context. Antlr allows pushing and popping a context state. Not a CFG and won't work with simple tools or book approaches.


Back in the day my Intro To Programming course taught the CLU programming language. I remember one cool feature of the compiler; when it hit a syntax error, it would show the error, skip a few lines until it found a stable place to resume parsing, and then continue compiling in order to catch more syntax errors. Pretty neat!


"skip a few lines until it found a stable place to resume parsing"

That's interesting! I always assumed it was always more complex than that, I love it. How reliable is this method btw? I'm specifically curious about the risk of catching false positives due to failing to parse some sections, such as missing goto labels, or unused variables.


I think most programming languages tokenize first. Tokenizing is absurdly fast, like 100x faster than printing to a terminal.


> Tokenizing is absurdly fast

is rather a very good reason to interleave tokenizing and parsing. Otherwise your tokenizer will just get stuck in iowait for no good reason.


Sure, if shaving a millisecond off the execution time of your compiler is your idea of a good time.


Tokenizing everything in one go is usually the right approach. The typical case is the error-free case, so you won't actually tokenize more characters than necessary, but you will simplify the implementation and avoid jumping back and forth between parsing and tokenizing (thus improving locality).


You got many answers, but I'll give another brutally honest answer: I do this because it's simply much much easier to implement. I would much rather have my AST generator's input be a token stream. It just simplifies everything.


The purpose of tokenizing everything is to report all errors upfront. Imagine a big source file with two syntax errors. Tokenizing and parsing everything and reporting both errors with one compilation is a better user experience.


If you mean storing the whole tokenised input in memory before processing it again, then I'd say never, and go as far as calling it hugely wasteful and inefficient.

The other comments here about catching all errors is valid, but you don't need to store any of the previous output to do so. The tokeniser can provide a token at a time as necessary, and thus also provide errors in the same way.


For the year 1970 and after: never. This should always be part of the parser for reasons of being confidently correct.

Thinking otherwise just leads to suffering. The problem with tutorials like Tiny Compiler or Crafting Interpreters is that the authors do not run into problems with the code shown in their teaching materials, but as soon as a student wants to apply it to modestly complex grammars, it stops working. The authors traded conciseness for correctness which IMO is a bad trade-off, especially since a reasonably complete implementation of a parsing algorithm that has no shortcomings is perhaps only four to six times longer than a short and flawed one.

The point of this critique is to raise awareness; each student should not have to figure out that they got the bad end of said trade-off by trial and error, instead the teaching material should make this clear initially.


I am a professional compiler writer and nope you are wrong. I never tokenise everything up front and instead read tokens on demand. It is much faster and enables me to change how tokens are interpreted on the fly. Which enables me to parse different languages with different token definitions in the same file. Something that I need for the production work I do.


>>> when is it useful, if ever, to tokenize the whole input beforehand

>> never

> nope you are wrong. I never tokenise everything up front

Explain how you could possibly arrive at the complete opposite meaning of what I wrote when you even use the same word "never" as I did?

Quick, get another downvote in, that'll teach me! Peak HN moment here.


Sorry I clearly misunderstood your point. My bad.

I upvoted your correction.


I am a pro compiler writer and I always use scanner-less parsers (no tokens). It allows me to change how tokens are interpreted on the fly. Very useful for the kind of problems I work on.


Sounds like a 2 to 3 pass compiler or LL(k).

Better off pulling tokens from semantic analysis through parsing top down, LALR(1), SLR(1), or alternative context-sensitive parsing. Derivation parsing is scannerless.


This is useful when your language is simple enough that you don’t need a real parser.

Fast lexer + cheap validator is a winning combo.


Shout out for https://github.com/jneen/parsimmon if you want to build a parser in JS


> * We're going to compile some lisp-like function calls into some C-like

> * function calls.

Isn't this a transpiler, doing mostly string replacement?


It still has to separate tokens, understand a nested structure, and output a different format. It's a small compiler, but is a compiler, and honestly understanding parsers can help a lot with text processing tasks.


The file is 1053 lines long and the first 380 lines are comments. Including a big ascii banner. That seems deliberate.


> The Super Tiny Compiler

Written in JS, runs in the (super big) web browser.


it wasn't clear at all from a 5 second skim of the code and looking at the readme what the source and targets were


Ten seconds probably would have done it.


* We're going to compile some lisp-like function calls into some C-like * function calls.

From the top of the source file.


Yep, from test.js:

> const input = '(add 2 (subtract 4 2))';

> const output = 'add(2, subtract(4, 2));';


That's on lines 83/84. I'm not entirely sure that qualifies as the "top" of the source file. It's on the 4th page of the source file, given a traditional 80x25 terminal. (Assuming no wrapping, which the ascii art banner definitely could. A lot.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: