Kythe has one schema, whereas with Glean each language has its own schema with arbitrary amounts of language-specific detail. You can get a language-agnostic view by defining an abstraction layer as a schema. Our current (work in progress) language-agnostic layer is called "codemarkup" https://github.com/facebookincubator/Glean/blob/main/glean/s...
For wiring up the indexer, there are various methods, it tends to depend very much on the language and the build system. For Flow for example, Glean output is just built into the typechecker, you just run it with some flags to spit out the Glean data. For C++, you need to get the compiler flags from the build system to pass to the Clang frontend. For Java the indexer is a compiler plugin; for Python it's built on libCST. Some indexers send their data directly to a Glean server, others generate files of JSON that get sent using a separate command-line tool.
References use different methods depending on the language. For Flow for example there is a fact for an import that matches up with a fact for the export in the other file. For C++ there are facts that connect declarations with definitions, and references with declarations.
In case using Kythe was an option, what was the rationale for not using it?
One major limitation of Kythe is handling different versions. For example, Kythe can produce a well connected index of Stackage, but a Hackage would have many holes (not all references would be found, as the unique reference name needs the library version).
How Glean handles different library versions?
EDIT: the language agnostic view is already mentioned.
There will be more indexers: we have Python, C++/Objective C, Rust, Java and Haskell. It's just a case of getting them ready to open source. You can see the schemas for most of these already in the repo: https://github.com/facebookincubator/Glean/tree/main/glean/s...
The graph is sorted by performance, with the worst performing (not necessarily the most common) on the left. We've also done more profiling and optimisation since we took those measurements.
FXL employed some tricks that were sometimes beneficial, but often weren't - for example it memoized much more aggressively than we do in Haskell. Mostly that's a loss, but just occasionally it's a win. When a profile shows up one of these cases, we can squash it by fixing the original code.
What matters most is overall throughput for the typical workload, and we win comfortably there.
For example, let's say that one of the things you want to compute is the number of friends of the current user. This value is used all over the codebase, but it only makes sense in the context of the current request (because every request has a different idea of "the current user"). So this is a memoized value, even though in the language it looks like a top-level expression.
Memoization only stores results during a request. It starts empty at the beginning of the request and is discarded at the end, and it is not shared with any other requests. It's just a map that's passed around (inside the monad) during a request.
Thanks for the response. Just trying to expand my brain here =), so I have a followup question.
I always thought of memoization as storing the parameters to, and result of, a function call in a memotable. Doing some quick research, I came across this definition of memoization from NIST that sounds more general "Save (memoize) a computed answer for possible later reuse, rather than recomputing the answer." What I understand from what you said is that when a request is processed, it produces a map that is passed around for the duration of the request.
The memo table (map) is a bit of state that is maintained throughout the request's lifetime. When we compute a memoized value, it is inserted into the map, and if we need the value again we can just grab it from the map instead of recomputing it.
The "automatic" bit is that we insert the code that consults the map so the programmer doesn't have to write it. The map itself is already invisible, because it's inside the monad. So the overall effect is a form of automatic memoization.
Is it a cultural reference? Found it in bestcomments, so looks like many people do get it, but I don't. Genuinely interested having English as a second language.
Segmented stacks are better even if you have a relocatable stack. GHC switched from monolithic copy-the-whole-thing-to-grow-it to segmented stacks a while ago, and it was a big win. Not just because we waste less space, but because the GC knows when individual stack segments are dirty so that it doesn't need to traverse the whole stack. To avoid the thrashing issue we copy the top 1KB of the previous stack chunk into the new stack chunk when we allocate a new chunk.