Let's create a Tree-sitter grammar

mingodad · on March 20, 2024

For people interested on grammars there is also:

- Convert tree-sitter grammar.json to EBNF https://meimporta.eu/tree-sitter/json2ebnf.html

- An alternative syntax to write tree-sitter grammars https://mingodad.github.io/lua-wasm-playground/ (select "Tree-sitter-ebnf-generator" from "Examples" then click "Run" to see the tree-sitter grammar for the content in "Input Text (arg[1])")

- An Yacc/Lex online editor/tester with several example grammars https://mingodad.github.io/parsertl-playground/playground/

gcr · on March 20, 2024

Is there any easy way to make tree-sitter produce smaller files? I was hoping to add support for nim to difftastic [1], but tree-sitter produces a 60MB .c file that’s too big to add to the repository.

[1]: https://github.com/Wilfred/difftastic/pull/633

diffxx · on March 20, 2024

Abandon all hope, ye who enter here.

Tree-sitter is great until you inevitably run into a problem with it and find that it is impossible to fix properly if you can even figure out why the problem even exists.

The idea of a uniform api for querying syntax trees is a good one and tree-sitter deserves credit for popularizing it. It's unfortunately not a great implementation of the idea.

dumbo-octopus · on March 20, 2024

Yes. It also presents itself as some sort of bastion of Rust stability, but segfaults constantly^. More than any NPM module I've ever used before. Any syntax that doesn't precisely match the grammar is liable to take down your entire thread.

The wasm bindings are a necessity if you want to be anywhere near "stable", but they run 3x slower.

^ the JS/TS parser when invoked via the Node bindings, at least. Others are potentially better. The GH issues are rife with similar complaints.

nbadg · on March 20, 2024

Maybe I'm missing something here, but isn't tree sitter already used by a bunch of stuff? Last I heard, github was moving all of their syntax highlighting to it, and a bunch of editors support it. Are they just dealing with the segfaults? Genuine question, because I'm considering using tree sitter for two different projects I'm working on (a LML and a full-blown language).

Or is it an issue with poorly-written grammars? The second you need to write an external parser with tree sitter, you have to start writing C, so I can see how that could lead to segfaults very quickly.

I dunno, I guess I'm just a bit surprised to hear a bunch of negative feedback towards tree-sitter, because up until now most of what I've heard has been pretty positive.

diffxx · on March 20, 2024

Even if tree-sitter had no bugs it can never really be trusted because there is always a divergence between the language's official parser and the tree-sitter grammar -- unless the language actually uses tree-sitter as its parser, which no popular language does and it seems unlikely a future popular language will.

For syntax highlighting and code folding, what I said above is probably generally fine. Except for the fact that it will probably be wrong from time to time and then downstream users will waste hours of their lives trying to fix the broken highlighting instead of actually working on their real work.

On the implementation side, the whole thing has bizarre choices. It has created a culture in which checking in enormous generates files is common. It abdicates common and important features in popular languages like significant whitespace to error-prone external scanners. The GLR algorithm sometimes fails inexplicably and is essentially impossible to debug (not sure if the bugs are in the algorithm or the implementation). It bizarrely uses rust and node. The js dsl for grammars is needlessly verbose. The codegen is fairly slow in spite of using rust and the generated c files are insanely huge (probably a consequence of the GLR algorithm) which leads to horrific compile times which makes incremental development a nightmare. It also uses (unless it's been updated in the last six months) a terrible algorithm for unicode lexing.

In my opinion, the whole premise of tree-sitter is kind of wrong. What would be better would be a uniform dsl for generating an AST (which can easily be represented with c structs or classes in any oo language). Trying to make a universal glr parser for all languages is a bad idea. It would generally be much easier, more accurate and faster to write the AST in the dsl that external tools recognize and then handwrite the parser that generates the AST. In the best case, the compiler already exposes an api to get the ast for some source code and you just need to write a translation program from compiler format to dsl format. Worst case, you rip out the parser in the compiler and have it generate the dsl ast format. Most handwritten parsers are only a few thousand lines of codes (maybe 10k at the high end).

maxbrunsfeld · on March 20, 2024

> It has created a culture in which checking in enormous generates files is common.

You're not required to do that, and never have been. We're going to move the grammars away from checking the generated `parser.c` in, but early in the project, it was a pretty pragmatic solution, given that many consumers of the grammars weren't using a particular package registry like NPM or crates.io.

> It abdicates common and important features in popular languages like significant whitespace to error-prone external scanners.

External scanners are the right tool for this. Nobody has proposed a better solution for this that actually works, and I don't think there is one, because every programming language with significant implements it differently.

> The GLR algorithm sometimes fails inexplicably and is essentially impossible to debug (not sure if the bugs are in the algorithm or the implementation)

It sounds like you personally have had trouble with debugging a grammar you've tried to develop, but it solves a problem.

> It bizarrely uses rust and node.

Tree-sitter isn't very coupled to node. We just shell out to a JS engine because we use JS (the most popular programming language, period) as a grammar DSL, rather than inventing our own language. Node is the most common one.

> the generated c files are insanely (probably a consequence of the GLR algorithm)

No, it's not because of GLR. A lot of the reason for the C code size is that we want to generate nice, readable, multi-line C expressions to represent data structures like parse tables, rather than encoding it in some cryptic, unreadable way. Also, incremental parsing requires that a bit more data be present in the parse table than would be required for a batch-only parser. What really matters is the code size of the final binaries. The generated `.so` or `.wasm` files for a Python parser are 503k, and 465k, respectively. The wasm gzips down to 69k. I would love to make it smaller, but there are pretty good reasons why it occupies the size that it does, and it's currently pretty manageable.

> It also uses (unless it's been updated in the last six months) a terrible algorithm for unicode lexing.

I'm not sure what you're referring to here. UTF8 and UTF16 are decoded into code points using standard routines from `libicu`.

diffxx · on March 22, 2024

I don't understand how you can say the code size has nothing to do with GLR and the explain that it is due to the formatting of the parse tables. The point is that table driven parsers are almost always huge compared to their recursive descent based alternatives (and, yes, recursive descent does have problems). The code size of the final binary is not all that matters because the huge c programs cause insanely slow compile times that make development incredibly painful. So that statement only holds if the time of grammar writers doesn't matter.

I was too vague on unicode. I meant unicode character class matching. Last I checked, tree-sitter seemed to use a binary search on the code point to validate whether or not a character was in a unicode character class like \p{Ll} rather than a table lookup. This lead to a noticeable constant slowdown of like 3x at runtime if I recall correctly for the grammar I was working on compared to an ascii set like [a-z].

nbadg · on March 20, 2024

Thanks! That's super helpful!

dumbo-octopus · on March 20, 2024

I have no idea what other folks are doing to stabilize it. For me, I needed a ton (hundreds of thousands) of JS/TS(X) files parsed and ingested into other data pipelines and the segfaults were constant^. If you had a system where single files were only parsed on-demand via separate worker thread and the consequence of a segfault was the text appearing monotone, I'm sure you'd be alright.

^One example: https://github.com/wasp-lang/wasp/blob/main/waspc/data/Gener...

maxbrunsfeld · on March 20, 2024

Did you open an issue for the segfault you got, on the Tree-sitter repo or the repo for the particular parser you were using?

It's not really accurate to say that Tree-sitter segfaults constantly. The Tree-sitter CI performs randomized fuzz testing based on a bunch of popular grammars, randomly mutating entries from their test corpus. If you have a reproduction case for the segfault, it'd be useful to report.

Since you mention NPM, it sounds like you may be talking about a segfault that's specific to the Tree-sitter Node.js bindings, but it's hard to be sure.

dumbo-octopus · on March 21, 2024

No, I did not open an issue. When I checked the issues for tree-sitter-javascript, I saw the most recent issue was one reporting the same thing (segfault) from some weeks ago, with no attention. It still has had no attention, so I don't think reporting would have done much.

Anyways, since you're here, I dug a bit more into this to make a more useful report. Starting with v0.20.2, the following file will cause a segfault when parsed using tree-sitter-javascript: https://github.com/tursodatabase/libsql/blob/main/libsql-sql... . It worked fine in v0.20.1, and it's still broken with the latest v0.20.4. Based on the diff here: https://github.com/tree-sitter/tree-sitter-javascript/compar..., I don't on the surface see a way to dig deeper into this than trying to read through a 170,000 line (!!!) diff to parser.c.

And looking at that diff raises another complaint: the names of parser nodes must be considered part of the public API, as they are exposed in descendantsOfType, .type, etc., but they are 100% not documented anywhere, and are liable to change without notice in patch version bumps. This makes developing against it a massive pain, as any version increase is liable to break any code expecting a particular nomenclature.

I don't mean to dunk on your project, I'm sure it solves some problems very well. But it is remarkably difficult to confidently depend on in a production environment.

duped · on March 20, 2024

Do you mean the parser generator itself (which is written in Rust) segfaults constantly, or the generated parsers (which are in C)?

It's not surprising to me that there are issues with generated parsers, partly because many languages need external scanners to work, which have to be written in C and use a rather clunky API. tree-sitter-javascript also uses an external scanner so you might want to look there for whatever is causing the problem.

dumbo-octopus · on March 20, 2024

It may well have been the scanner that crashed, I was working at a startup and had about a million other things to deal with so I didn't dig into it very long, I just switched to wasm and kept going. (Of course, the wasm bindings are subtly different and similarly lacking in type definitions, so that bit me down the road as well...).

But in general, I'd have to argue that if your "very stable" software requires interfacing with very unstable external tools to work with a very popular language... eliminating that dependency is certainly an area for improvement!

duped · on March 21, 2024

> I'd have to argue that if your "very stable" software requires interfacing with very unstable external tools to work with a very popular language... eliminating that dependency is certainly an area for improvement!

I think you may misunderstand parser generation and what tree sitter does. You can't eliminate external scanners, and there isn't another choice for them than C.

If you were just using an existing parser than you never touched any program written in Rust.

dumbo-octopus · on March 21, 2024

If the only job of the Rust is to create broken C, you'll excuse me if I don't find its stability particularly relevant to the tree-sitter proposition.

And the crash in this case does appear to be a result of a broken Rust-generated parser.c file, not some external tool, but I don't know enough about the project to say for sure.

bad_login · on March 20, 2024

Maybe tweaking the rules of the parser and check if it reduce the size of : #define STATE_COUNT 20290 #define LARGE_STATE_COUNT 10594

in parser.c

gcr · on March 20, 2024

That's a good idea, thanks!

dugmartin · on March 20, 2024

If you want or need to stay totally within the JavaScript ecosystem, Marijn Haverbeke (author of CodeMirror, ProseMirror, and more) created Lezer which is tree-sitter like and is grammar driven with optional external tokenizers also:

Announcement: https://marijnhaverbeke.nl/blog/lezer.html

Homepage: https://lezer.codemirror.net/

mdaniel · on March 20, 2024

https://lezer.codemirror.net/docs/guide/#token-specializatio... is pretty cool, and they make use of it in the python grammar for the new "match" and "case" soft keywords: https://github.com/lezer-parser/python/blob/ead67a397f7c7cb6...

robjwells · on March 20, 2024

This is a great post. I’d previously started to write a tree-sitter grammar for djot but bounced off when I realised I’d have to use an external scanner. Nice to have an article that sets out how to do so.

Link in the post is currently broken, but here’s the author’s grammar on GitHub: https://github.com/treeman/tree-sitter-djot

heresie-dabord · on March 20, 2024

In the interest of making the page the best version of itself, the following phrase makes a mistake often made by native US English speakers:

> But, you might wonder how does a Tree-sitter grammar look like,

How does $thing look?

or

What does $thing look like?

Similarly, one often hears "Where are we at?" It is common enough to be seen as a regionalism.

-- Your friendly tree-sitting grammarian

Joker_vD · on March 20, 2024

Interestingly enough, in German and Polish the analogous expression is in fact "how does X look like?" So maybe it's an interference from those two languages? Or maybe it's even a relict of the older English expression, displaced in England by "what does X look like" after the Am/Br language split had happened.

VyseofArcadia · on March 20, 2024

Anecdotally, I hear the how/what confusion from nearly every non-native English speaker I've ever met, regardless of native language.

I wonder if it's one of those little things that's surprisingly rare, like the th sound.

airstrike · on March 20, 2024

I don't think that's necessarily better... The two alternatives I think parse best are:

1. But you might wonder: what does a Tree-sitter grammar look like, and how do you create one for a language?

2. But you might wonder what a Tree-sitter looks like and how to create one for a language.

In either case, I agree the right word is "what" not "how". The comma after "but" bugs me tremendously. The colon can be replaced by an em dash (—) if the writer prefers it—but I would not use a question mark for the second clause without separating it from the first "but you might wonder" in some way.

The comma left in alternative #1 is to needed (recommended?) because the "and" is joining two clauses with different subjects. I think it's not strictly needed in alternative #2 because it's really "(but you might wonder (what a tree sitter looks like and how to create one for a language))" where the inner term is a list of things the reader might wonder, and lists with only two items do not take a comma before "and"

I would also prefer "Where are we?" over "Where are we at?"

Not an English speaker, so just my $0.02.

ngai_aku · on March 20, 2024

> Similarly, one often hears "Where are we at?" It is common enough to be seen as a regionalism.

I’m a native English speaker who says “Where are we at”. I think it’s informal, but I never considered it to be wrong or awkward like the first example. Perhaps I’m in the region where it’s common.

lawn · on March 20, 2024

Huh, I'm not a native English speaker, but I changed it to "what does a Tree-sitter grammar look like". It does sound better.

samatman · on March 20, 2024

> "Where are we at?"

Often elided to "Where we at?".

It's dialect, I wouldn't use either in formal/technical writing, but I would unselfconsciously say the latter, as in "ok, moving down the agenda, where we at with item 3?".

dunham · on March 20, 2024

As an exercise, I did one two years ago for pi-forall. That had a small scanner for the off-sides stuff. I lost steam after I got it working (and needed to add a bunch of rules). I think I was mainly trying to see if I could get tree-sitter to do off-sides (haskell-like indentation).

One issue that I remember was the web application to test the grammar needed an exact version of the emscripten stuff. And I had to sort out what that version was myself.

lawn · on March 20, 2024

Oh shoot, fixed the link. Thanks!

tillcarlos · on March 20, 2024

(unrelated to content) Wow, that site speed!

In the footer there’s a link to OPs source code. The whole site gets compiled with Rust. Interesting gem to find.

prabir · on March 20, 2024

I did the same with my blog which is written in rust https://github.com/prabirshrestha/rblog. While it could be static I wanted it to be server. But I don't think rust is the reason for it. Not depending on external resources such as fonts and javascript for critical path along with compression is the most critical. Since it is a blog focus on text instead of fancy images and only include when absolutely necessary. Testing the site in Fast 3G or Slow 3G is the best way to find the places where you can optimize your site.

emmanueloga_ · on March 20, 2024

The only nitpick is that the site loads 1.2Mb of fonts... apparently through CloudFront. It seems it serves them uncompressed, perhaps because the fonts are woff2? [1]

Also, the bold version of iosevka is an order of magnitud heavier than the other fonts :-).

--

1: https://docs.aws.amazon.com/AmazonCloudFront/latest/Develope...

juanrgon · on March 20, 2024

It’s a static site. It would be strange if it wasn’t fast

codethatwerks · on March 20, 2024

It is hard to make a slow one. Even shared hosting wordpress with the cache turned on is fast.

viraptor · on March 20, 2024

You'd think so... There's so many versions of WP caching and so many ways it's deployed, that (with a fairly small experience) I've seen sites taking 5s+ to load after enabling the relevant option because suddenly minimisation and bundling kicked in on every request and kept the page fresh. WP is trivial to make accidentally slow.

lawn · on March 20, 2024

Static sites are a breath of fresh air in this day and age.

antonpuz · on March 20, 2024

Thanks for sharing this interesting post!

I'm wondering why isn't there any language agnostic AST, commonly referred to as intermediate representation, and higher level logic such as data flow graph built on Tree-sitter. I mean writing the Grammer is of course something that is mandatory but it seems like programming languages are too much different in the architectural level of Tree-sitter

zelphirkalt · on March 20, 2024

There is probably no such thing, because it would be hard to map programming language concepts onto each other perfectly. OK, you could have a union of those concepts in the IR, but the the benefit of IR would disappear, because you will have to deal with all the things on the next layer.

dcreager · on March 22, 2024

This is exactly right. You either end up with something very low-level (on the level of LLVM IR, for instance)—which means you aren't constructing and analyzing high-level language constructs anymore—or with something high-level but with many language-specific special cases grafted on.

Where we've found success is in stepping back and creating formalisms that are language-agnostic to begin with, and then using tree-sitter to manage the language-specific translations into that formalism. A good example is how we use stack graphs for precise code navigation [1], which uses a graph structure to encode a language's name binding semantics, and which we build up for several languages from tree-sitter parse trees [2].

[1] https://dcreager.net/talks/stack-graphs/

[2] https://github.com/github/stack-graphs/tree/main/languages

barrkel · on March 20, 2024

Languages are generally only similar at a superficial level and have a lot of fractal detail with high variance.

For your data flow graph, are you going to handle copy constructors, assignment operator overloads, implicit conversions etc. like you see in C++? Or how about overload resolution: figuring out which method wins out over all candidates requires details about the convertibility of types, and how the language ranks candidates based on conversions required. And let's not forget Koenig lookup.

Things like Tree-sitter can discover definitions, declarations and invocations, but matching the right declaration for an invocation isn't trivial in the presence of overloads.

duped · on March 20, 2024

To be a bit pedantic, an AST is a kind of IR that represents syntactic constructs. Languages have different syntaxes so you can't really have a language agnostic syntax tree.

Consider three languages, one with binary operators but no function application, and another with only function application, and a third with both. What would the "agnostic" AST be, except a union of the AST of the first two?

riwsky · on March 21, 2024

Why, Lisp of course!

andrewla · on March 20, 2024

I'm not abreast of all the latest buzz in NeoVim, but I've been hearing and seeing good things about Treesitter in general.

Is the plan to at some point deprecate the existing vim-style syntax highlighting (syn match/region etc.) in favor of treesitter-based syntax highlighting? Where can I find this discussion if it is happening?

lsllc · on March 20, 2024

@lawn, There's a typo towards the end (about 80%) down in the section on nvim-treesitter, should be `TSUpdate`, not `TSUdate`:

  :TSUdate sdjot

Excellent write up though!

lawn · on March 20, 2024

Thank you, fixed.