Abandon all hope, ye who enter here. Tree-sitter is great until you inevitably r...

dumbo-octopus · on March 20, 2024

Yes. It also presents itself as some sort of bastion of Rust stability, but segfaults constantly^. More than any NPM module I've ever used before. Any syntax that doesn't precisely match the grammar is liable to take down your entire thread.

The wasm bindings are a necessity if you want to be anywhere near "stable", but they run 3x slower.

^ the JS/TS parser when invoked via the Node bindings, at least. Others are potentially better. The GH issues are rife with similar complaints.

nbadg · on March 20, 2024

Maybe I'm missing something here, but isn't tree sitter already used by a bunch of stuff? Last I heard, github was moving all of their syntax highlighting to it, and a bunch of editors support it. Are they just dealing with the segfaults? Genuine question, because I'm considering using tree sitter for two different projects I'm working on (a LML and a full-blown language).

Or is it an issue with poorly-written grammars? The second you need to write an external parser with tree sitter, you have to start writing C, so I can see how that could lead to segfaults very quickly.

I dunno, I guess I'm just a bit surprised to hear a bunch of negative feedback towards tree-sitter, because up until now most of what I've heard has been pretty positive.

diffxx · on March 20, 2024

Even if tree-sitter had no bugs it can never really be trusted because there is always a divergence between the language's official parser and the tree-sitter grammar -- unless the language actually uses tree-sitter as its parser, which no popular language does and it seems unlikely a future popular language will.

For syntax highlighting and code folding, what I said above is probably generally fine. Except for the fact that it will probably be wrong from time to time and then downstream users will waste hours of their lives trying to fix the broken highlighting instead of actually working on their real work.

On the implementation side, the whole thing has bizarre choices. It has created a culture in which checking in enormous generates files is common. It abdicates common and important features in popular languages like significant whitespace to error-prone external scanners. The GLR algorithm sometimes fails inexplicably and is essentially impossible to debug (not sure if the bugs are in the algorithm or the implementation). It bizarrely uses rust and node. The js dsl for grammars is needlessly verbose. The codegen is fairly slow in spite of using rust and the generated c files are insanely huge (probably a consequence of the GLR algorithm) which leads to horrific compile times which makes incremental development a nightmare. It also uses (unless it's been updated in the last six months) a terrible algorithm for unicode lexing.

In my opinion, the whole premise of tree-sitter is kind of wrong. What would be better would be a uniform dsl for generating an AST (which can easily be represented with c structs or classes in any oo language). Trying to make a universal glr parser for all languages is a bad idea. It would generally be much easier, more accurate and faster to write the AST in the dsl that external tools recognize and then handwrite the parser that generates the AST. In the best case, the compiler already exposes an api to get the ast for some source code and you just need to write a translation program from compiler format to dsl format. Worst case, you rip out the parser in the compiler and have it generate the dsl ast format. Most handwritten parsers are only a few thousand lines of codes (maybe 10k at the high end).

maxbrunsfeld · on March 20, 2024

> It has created a culture in which checking in enormous generates files is common.

You're not required to do that, and never have been. We're going to move the grammars away from checking the generated `parser.c` in, but early in the project, it was a pretty pragmatic solution, given that many consumers of the grammars weren't using a particular package registry like NPM or crates.io.

> It abdicates common and important features in popular languages like significant whitespace to error-prone external scanners.

External scanners are the right tool for this. Nobody has proposed a better solution for this that actually works, and I don't think there is one, because every programming language with significant implements it differently.

> The GLR algorithm sometimes fails inexplicably and is essentially impossible to debug (not sure if the bugs are in the algorithm or the implementation)

It sounds like you personally have had trouble with debugging a grammar you've tried to develop, but it solves a problem.

> It bizarrely uses rust and node.

Tree-sitter isn't very coupled to node. We just shell out to a JS engine because we use JS (the most popular programming language, period) as a grammar DSL, rather than inventing our own language. Node is the most common one.

> the generated c files are insanely (probably a consequence of the GLR algorithm)

No, it's not because of GLR. A lot of the reason for the C code size is that we want to generate nice, readable, multi-line C expressions to represent data structures like parse tables, rather than encoding it in some cryptic, unreadable way. Also, incremental parsing requires that a bit more data be present in the parse table than would be required for a batch-only parser. What really matters is the code size of the final binaries. The generated `.so` or `.wasm` files for a Python parser are 503k, and 465k, respectively. The wasm gzips down to 69k. I would love to make it smaller, but there are pretty good reasons why it occupies the size that it does, and it's currently pretty manageable.

> It also uses (unless it's been updated in the last six months) a terrible algorithm for unicode lexing.

I'm not sure what you're referring to here. UTF8 and UTF16 are decoded into code points using standard routines from `libicu`.

diffxx · on March 22, 2024

I don't understand how you can say the code size has nothing to do with GLR and the explain that it is due to the formatting of the parse tables. The point is that table driven parsers are almost always huge compared to their recursive descent based alternatives (and, yes, recursive descent does have problems). The code size of the final binary is not all that matters because the huge c programs cause insanely slow compile times that make development incredibly painful. So that statement only holds if the time of grammar writers doesn't matter.

I was too vague on unicode. I meant unicode character class matching. Last I checked, tree-sitter seemed to use a binary search on the code point to validate whether or not a character was in a unicode character class like \p{Ll} rather than a table lookup. This lead to a noticeable constant slowdown of like 3x at runtime if I recall correctly for the grammar I was working on compared to an ascii set like [a-z].

nbadg · on March 20, 2024

Thanks! That's super helpful!

dumbo-octopus · on March 20, 2024

I have no idea what other folks are doing to stabilize it. For me, I needed a ton (hundreds of thousands) of JS/TS(X) files parsed and ingested into other data pipelines and the segfaults were constant^. If you had a system where single files were only parsed on-demand via separate worker thread and the consequence of a segfault was the text appearing monotone, I'm sure you'd be alright.

^One example: https://github.com/wasp-lang/wasp/blob/main/waspc/data/Gener...

maxbrunsfeld · on March 20, 2024

Did you open an issue for the segfault you got, on the Tree-sitter repo or the repo for the particular parser you were using?

It's not really accurate to say that Tree-sitter segfaults constantly. The Tree-sitter CI performs randomized fuzz testing based on a bunch of popular grammars, randomly mutating entries from their test corpus. If you have a reproduction case for the segfault, it'd be useful to report.

Since you mention NPM, it sounds like you may be talking about a segfault that's specific to the Tree-sitter Node.js bindings, but it's hard to be sure.

dumbo-octopus · on March 21, 2024

No, I did not open an issue. When I checked the issues for tree-sitter-javascript, I saw the most recent issue was one reporting the same thing (segfault) from some weeks ago, with no attention. It still has had no attention, so I don't think reporting would have done much.

Anyways, since you're here, I dug a bit more into this to make a more useful report. Starting with v0.20.2, the following file will cause a segfault when parsed using tree-sitter-javascript: https://github.com/tursodatabase/libsql/blob/main/libsql-sql... . It worked fine in v0.20.1, and it's still broken with the latest v0.20.4. Based on the diff here: https://github.com/tree-sitter/tree-sitter-javascript/compar..., I don't on the surface see a way to dig deeper into this than trying to read through a 170,000 line (!!!) diff to parser.c.

And looking at that diff raises another complaint: the names of parser nodes must be considered part of the public API, as they are exposed in descendantsOfType, .type, etc., but they are 100% not documented anywhere, and are liable to change without notice in patch version bumps. This makes developing against it a massive pain, as any version increase is liable to break any code expecting a particular nomenclature.

I don't mean to dunk on your project, I'm sure it solves some problems very well. But it is remarkably difficult to confidently depend on in a production environment.

duped · on March 20, 2024

Do you mean the parser generator itself (which is written in Rust) segfaults constantly, or the generated parsers (which are in C)?

It's not surprising to me that there are issues with generated parsers, partly because many languages need external scanners to work, which have to be written in C and use a rather clunky API. tree-sitter-javascript also uses an external scanner so you might want to look there for whatever is causing the problem.

dumbo-octopus · on March 20, 2024

It may well have been the scanner that crashed, I was working at a startup and had about a million other things to deal with so I didn't dig into it very long, I just switched to wasm and kept going. (Of course, the wasm bindings are subtly different and similarly lacking in type definitions, so that bit me down the road as well...).

But in general, I'd have to argue that if your "very stable" software requires interfacing with very unstable external tools to work with a very popular language... eliminating that dependency is certainly an area for improvement!

duped · on March 21, 2024

> I'd have to argue that if your "very stable" software requires interfacing with very unstable external tools to work with a very popular language... eliminating that dependency is certainly an area for improvement!

I think you may misunderstand parser generation and what tree sitter does. You can't eliminate external scanners, and there isn't another choice for them than C.

If you were just using an existing parser than you never touched any program written in Rust.

dumbo-octopus · on March 21, 2024

If the only job of the Rust is to create broken C, you'll excuse me if I don't find its stability particularly relevant to the tree-sitter proposition.

And the crash in this case does appear to be a result of a broken Rust-generated parser.c file, not some external tool, but I don't know enough about the project to say for sure.