Thanks for sharing this interesting post! I'm wondering why isn't there any lang...

zelphirkalt · on March 20, 2024

There is probably no such thing, because it would be hard to map programming language concepts onto each other perfectly. OK, you could have a union of those concepts in the IR, but the the benefit of IR would disappear, because you will have to deal with all the things on the next layer.

dcreager · on March 22, 2024

This is exactly right. You either end up with something very low-level (on the level of LLVM IR, for instance)—which means you aren't constructing and analyzing high-level language constructs anymore—or with something high-level but with many language-specific special cases grafted on.

Where we've found success is in stepping back and creating formalisms that are language-agnostic to begin with, and then using tree-sitter to manage the language-specific translations into that formalism. A good example is how we use stack graphs for precise code navigation [1], which uses a graph structure to encode a language's name binding semantics, and which we build up for several languages from tree-sitter parse trees [2].

[1] https://dcreager.net/talks/stack-graphs/

[2] https://github.com/github/stack-graphs/tree/main/languages

barrkel · on March 20, 2024

Languages are generally only similar at a superficial level and have a lot of fractal detail with high variance.

For your data flow graph, are you going to handle copy constructors, assignment operator overloads, implicit conversions etc. like you see in C++? Or how about overload resolution: figuring out which method wins out over all candidates requires details about the convertibility of types, and how the language ranks candidates based on conversions required. And let's not forget Koenig lookup.

Things like Tree-sitter can discover definitions, declarations and invocations, but matching the right declaration for an invocation isn't trivial in the presence of overloads.

duped · on March 20, 2024

To be a bit pedantic, an AST is a kind of IR that represents syntactic constructs. Languages have different syntaxes so you can't really have a language agnostic syntax tree.

Consider three languages, one with binary operators but no function application, and another with only function application, and a third with both. What would the "agnostic" AST be, except a union of the AST of the first two?

riwsky · on March 21, 2024

Why, Lisp of course!