IIUC, the main thing that Google's internal codesearch does that makes it superior to external systems (outside of an IDE, like GitHub code search) is that Google actually builds everything, and so it can incorporate that information into its index. There's only so much text search can do when you have macros generating code.
Internally, we use variants of our pipeline to index a variety of open source repos, and some non-blaze/bazel internal repos. Those are often non-Google repos. But we're using some internal postprocessing and serving logic to actually create and host the final index.
Unfortunately I don't know if there's any significant use of Kythe outside of Google. We get a handful of questions on the open source repo from time to time, but that's all I know about.
Before git, most people in my larger circle used RCS, a UNIX version control system from the early 80's. It was very limited (basically each file had its own side-file that contained revision data, and there was no project-wide file) but did its job. Many people moved over to VCS, which used RCS files but added project-wide files so you could manage a dir tree.
After that, I think many people moved to subversion, which had a lot more functionality for distributed VC, for exmaple there was a server. svn was popular for a while but building it was painful (due to berkeley db) and it sort of never grew. I invested a lot of time in (specifically apache with mod_dav and mod_dav_svn) but lost interest in VC after fighting with subversion.
git came along and from what i can tell it mainly had "it's by linus, and the kernel uses it" and "it's fast" and "something about reflogs". I use git day-to-day but I still; can't explain how git became so ubiquitous; I find using it outright painful.
As someone who has used all of them in different companies since the 90s: RCS was ancient, even in the early 90s. Most widely found in things like UNIX source trees.
CVS came along later (mid/late 90s), and was much more widely used.
SVN came on the scene around the late 90s -- it was a massive improvement on CVS, and spread across open source and most professional shops like wildfire. Major sites like sourceforge were built on SVN, but also supported CVS.
Git only became prevalent starting around 2006-2008, and adoption was actually really slow because of its inherent complexity. When Github appeared, that was really when the shift started in earnest.
There were others along the way: MS SourceSafe, a moment when everyone was toying with Bitkeeper, etc., but these were as marginal as RCS.
Subversion -was- a massive improvement over CVS but the effort to create it only started in 2000 and it was early 2004 when it went 1.0. I think it was at the 2003 OSCON when Fitz et al introduced it? Everyone in the audience was stoked, but it only had a few good years before git eclipsed it.
Lightweight branches was a huge selling point. If you didn't do them often enough that they were rote, branches in RCS/CVS/SVN required ritual sacrifice.
Mercurial (aka hg) was also gaining popularity at the same time as git. The interface was a lot nicer and more sane than git, but it had some serious performance limitations that hampered it.
Yes, after using git for a few years I was introduced to Mercurial and it was like a breath of fresh air, although I'm also told hg added a number of things that made it much more usable, "right before I started using it".
Since I have limited brain capacity I focus my efforts on being able to use git, not hg, merely because it has so much marketshare.
I mean I learned a lot back in the rcs days- I worked on a small team with a few million lines of code under RCS in a shared FS. The devs taught me about social locking protocol- basically, talking with your coworkers about what parts of code not to work on, because you'd have your hands down in the guts.
One of my first real accomplishments was migrating that codebase from RCS to CVS- which was relatively easy as CVS used RCS under the hood.
Git stepped into a source control ecosystem that was well-served (albeit contentious). People knew (or at least thought they knew) what they wanted from bk/svn/CVS/p4/rcs/sccs.
So git essentially was the "final form" that integrated all the various workflows and topped it off with a maximally-scaled use case (linux) that proved out the tool, drove innovation in integration/scripting/gadgetry, and provided a clear beacon for everyone else to adopt it. So it won.
But in 2004, if you asked around, everyone would have told you that a tool like this was coming at some point (even if they probably wouldn't have described it as very git-like!).
I think it is odd that the story mentions git at all. Git5, the mentioned wrapper around piper, has only a niche audience when I last used it 5 years ago, and it was a demonstrated fact that the users of it were less productive than perforce users. Whether that was causal or not was unknown.
The story mentions git because git5 got me into developer tooling. More in particular, it put me in touch with Shawn Pearce who ran the Git/Gerrit team at Google. When I went to work for him, Shawn wanted to have codesearch support in Gerrit, and Zoekt was ultimately the outcome of my explorations in this space.
IIRC, Git5 was deleted approximately 5 years ago because Fig (the Hg based replacement) had taken over all the use cases
Is Fig still kicking? I worked on it a bit as an intern back in 2014, but Mercurial has also been on its way out of mainstream support over the last few years. It was a really neat project!
I agree there doesn't seem to be a good connection between work on version control and work on code search.
However, I don't think it makes sense to downplay git5. Anecdotally, basically everybody knew about it, and I'd constantly run into people using it (which is by itself noteworthy since nobody was exactly talking about version control all the time).
Git5 was at the time the most robust solution to chain commits, which was tedious bordering on impossible without some tool. Without definitive data, I wouldn't say users were less productive with git5: it definitely was a useful tool that people at least recommended for chain commits. I was definitely more productive with it.
There were a lot of footguns though, and I do think the hg wrapper that superseded it was way better.
Zoekt was heavily inspired by Google's internal code search, as mentioned in the blog post. The original version of the internal code search is described in the rsc post. Zoekt keeps some of the foundational ideas (e.g., trigram index), but was a from-scratch implementation. We probably should link to the rsc post for completeness, will update.
At the time that I started Zoekt (2016), Google's internal codesearch used suffix arrays for the string matching, which the team wasn't happy with, presumably because of the algorithmic complexity and indexing slowness. The Codesearch team was exploring alternatives, one of them the technique described in https://link.springer.com/article/10.1007/s11390-016-1618-6. The positional trigrams were a simplification of this, that they didn't mind me open sourcing.
so, in terms of algorithms, Zoekt wasn't actually inspired by Google's internal code search.
The precise query syntax of zoekt is mostly copied from google's internal syntax, though.
Russ Cox' trigram approach uses document IDs for the posting list, which makes the index much smaller, but gives less precise (ie. slower) matching. This is mentioned in the design doc at https://github.com/sourcegraph/zoekt/blob/main/doc/design.md....
Is Zoekt actually in use at Google and if so how does it related to Kythe? I know the Zoekt instance for Bazel exists, but the Kythe index also exists (https://cs.opensource.google/bazel/bazel)
I'm on the Kythe team, and I don't know off the top of my head what Zoekt is. Looking it up, I see it's some sort of trigram search, which means if it's used at all (I have no idea), it's codesearch proper, not Kythe.
The Kythe index is the semantic index of the codebase, Codesearch does all of the text/regex/etc searching.
Are you sure? There is find definition and references in https://cs.opensource.google/bazel/bazel and Im quite sure it's thanks to the Kythe indexing job Bazel team is running in CI.
The refs & jump to def in bazel/bazel are using Kythe, yes. But that is Kythe's semantic index from running (also it's Kythe team running it, not bazel team). It's not the Codesearch trigram/text search (which again, I have no idea if it uses zoekt).