Code Search at Google: Han-Wen and Zoekt

j2kun · on Nov 21, 2023

IIUC, the main thing that Google's internal codesearch does that makes it superior to external systems (outside of an IDE, like GitHub code search) is that Google actually builds everything, and so it can incorporate that information into its index. There's only so much text search can do when you have macros generating code.

dmoy · on Nov 21, 2023

Yea that would be Kythe. We build almost everything, across 44-45 different programming languages, and postprocess that into a giant semantic graph.

Most major parts are open sourced at kythe.io, and there's a somewhat dated talk given by Luke here: https://youtu.be/VYI3ji8aSM0

sa46 · on Nov 21, 2023

Do you have any cases studies or success stories for non-Google repos? I miss code search but I’m not sure how close Kythe is to code-search-in-a-box.

dmoy · on Nov 21, 2023

Internally, we use variants of our pipeline to index a variety of open source repos, and some non-blaze/bazel internal repos. Those are often non-Google repos. But we're using some internal postprocessing and serving logic to actually create and host the final index.

Unfortunately I don't know if there's any significant use of Kythe outside of Google. We get a handful of questions on the open source repo from time to time, but that's all I know about.

dmoy · on Nov 21, 2023

> macros

Corollary: while we can do a lot with indexing generated code (even cross language) in Kythe, there are limits. Macros may be one, I forget atm

beyang · on Nov 21, 2023

Great call out! We've built this code navigation infra on top of Zoekt into Sourcegraph. Example: https://sourcegraph.com/github.com/golang/go/-/blob/src/net/...

Docs: https://docs.sourcegraph.com/code_navigation/explanations/pr...

JohnMakin · on Nov 21, 2023

I didn't start my tech journey til late 00's, so it's constantly surprising to me that something as ubiquitous as git only came out in 2005.

Is it possible at all this story helped spur the widespread adoption of git (the early implementation of this tool)?

dekhn · on Nov 21, 2023

Before git, most people in my larger circle used RCS, a UNIX version control system from the early 80's. It was very limited (basically each file had its own side-file that contained revision data, and there was no project-wide file) but did its job. Many people moved over to VCS, which used RCS files but added project-wide files so you could manage a dir tree.

After that, I think many people moved to subversion, which had a lot more functionality for distributed VC, for exmaple there was a server. svn was popular for a while but building it was painful (due to berkeley db) and it sort of never grew. I invested a lot of time in (specifically apache with mod_dav and mod_dav_svn) but lost interest in VC after fighting with subversion.

git came along and from what i can tell it mainly had "it's by linus, and the kernel uses it" and "it's fast" and "something about reflogs". I use git day-to-day but I still; can't explain how git became so ubiquitous; I find using it outright painful.

timr · on Nov 21, 2023

You have a weird/selective perspective.

As someone who has used all of them in different companies since the 90s: RCS was ancient, even in the early 90s. Most widely found in things like UNIX source trees.

CVS came along later (mid/late 90s), and was much more widely used.

SVN came on the scene around the late 90s -- it was a massive improvement on CVS, and spread across open source and most professional shops like wildfire. Major sites like sourceforge were built on SVN, but also supported CVS.

Git only became prevalent starting around 2006-2008, and adoption was actually really slow because of its inherent complexity. When Github appeared, that was really when the shift started in earnest.

There were others along the way: MS SourceSafe, a moment when everyone was toying with Bitkeeper, etc., but these were as marginal as RCS.

jonstewart · on Nov 22, 2023

Subversion -was- a massive improvement over CVS but the effort to create it only started in 2000 and it was early 2004 when it went 1.0. I think it was at the 2003 OSCON when Fitz et al introduced it? Everyone in the audience was stoked, but it only had a few good years before git eclipsed it.

timr · on Nov 24, 2023

Sounds right. I didn't look up the exact dates. Tons of folks were using it before 1.0, though.

zem · on Nov 22, 2023

don't forget arch (aka "tla" for "tom lord's arch")! I played with it briefly back in the day but for some reason did not stick with it

https://en.m.wikipedia.org/wiki/GNU_arch

paulryanrogers · on Nov 22, 2023

Bazar was my first DVCS, used after SVN, and it became clear very quickly why Git used SHA revision IDs.

timr · on Nov 22, 2023

Forgot about Bazar!

dws · on Nov 21, 2023

Lightweight branches was a huge selling point. If you didn't do them often enough that they were rote, branches in RCS/CVS/SVN required ritual sacrifice.

reportingsjr · on Nov 21, 2023

Mercurial (aka hg) was also gaining popularity at the same time as git. The interface was a lot nicer and more sane than git, but it had some serious performance limitations that hampered it.

Both were definitely way better than SVN/CVS/etc.

dekhn · on Nov 21, 2023

Yes, after using git for a few years I was introduced to Mercurial and it was like a breath of fresh air, although I'm also told hg added a number of things that made it much more usable, "right before I started using it".

Since I have limited brain capacity I focus my efforts on being able to use git, not hg, merely because it has so much marketshare.

cpach · on Nov 21, 2023

Nit-pick: Did you mean CVS?

dekhn · on Nov 21, 2023

Yes, CVS.

xarope · on Nov 22, 2023

ah yes, I still remember the rcs ci / co workflow

dekhn · on Nov 22, 2023

I mean I learned a lot back in the rcs days- I worked on a small team with a few million lines of code under RCS in a shared FS. The devs taught me about social locking protocol- basically, talking with your coworkers about what parts of code not to work on, because you'd have your hands down in the guts.

One of my first real accomplishments was migrating that codebase from RCS to CVS- which was relatively easy as CVS used RCS under the hood.

ajross · on Nov 21, 2023

Git stepped into a source control ecosystem that was well-served (albeit contentious). People knew (or at least thought they knew) what they wanted from bk/svn/CVS/p4/rcs/sccs.

So git essentially was the "final form" that integrated all the various workflows and topped it off with a maximally-scaled use case (linux) that proved out the tool, drove innovation in integration/scripting/gadgetry, and provided a clear beacon for everyone else to adopt it. So it won.

But in 2004, if you asked around, everyone would have told you that a tool like this was coming at some point (even if they probably wouldn't have described it as very git-like!).

pgeorgi · on Nov 21, 2023

If you squint a little, https://web.archive.org/web/20030629114010/http://www.venge.... is a fair approximation of some of the core ideas behind git (and Linus played with it and wrote a critique of its short-comings before starting git)

mettamage · on Nov 21, 2023

That is so cool to see where Linus got some of his inspiration from. It made a few things more clear to me as to why git uses certain things.

jeffbee · on Nov 21, 2023

I think it is odd that the story mentions git at all. Git5, the mentioned wrapper around piper, has only a niche audience when I last used it 5 years ago, and it was a demonstrated fact that the users of it were less productive than perforce users. Whether that was causal or not was unknown.

hanwenn · on Nov 21, 2023

Hi, I'm the Han-Wen from the title.

The story mentions git because git5 got me into developer tooling. More in particular, it put me in touch with Shawn Pearce who ran the Git/Gerrit team at Google. When I went to work for him, Shawn wanted to have codesearch support in Gerrit, and Zoekt was ultimately the outcome of my explorations in this space.

IIRC, Git5 was deleted approximately 5 years ago because Fig (the Hg based replacement) had taken over all the use cases

dag11 · on Nov 22, 2023

Is Fig still kicking? I worked on it a bit as an intern back in 2014, but Mercurial has also been on its way out of mainstream support over the last few years. It was a really neat project!

hanwenn · on Nov 22, 2023

See https://www.youtube.com/watch?v=bx_LGilOuE4. I think the plan is to replace to it with something based off JJ.

billllll · on Nov 21, 2023

I agree there doesn't seem to be a good connection between work on version control and work on code search.

However, I don't think it makes sense to downplay git5. Anecdotally, basically everybody knew about it, and I'd constantly run into people using it (which is by itself noteworthy since nobody was exactly talking about version control all the time).

Git5 was at the time the most robust solution to chain commits, which was tedious bordering on impossible without some tool. Without definitive data, I wouldn't say users were less productive with git5: it definitely was a useful tool that people at least recommended for chain commits. I was definitely more productive with it.

There were a lot of footguns though, and I do think the hg wrapper that superseded it was way better.

justrealist · on Nov 21, 2023

Oh yeah. I remember merging SVN branches into production in 2010 or so.

It was a... special time. Let's not reminisce.

frutiger · on Nov 21, 2023

I’m a bit confused as to how https://swtch.com/~rsc/regexp/regexp4.html isn’t mentioned at all.

beyang · on Nov 21, 2023

Zoekt was heavily inspired by Google's internal code search, as mentioned in the blog post. The original version of the internal code search is described in the rsc post. Zoekt keeps some of the foundational ideas (e.g., trigram index), but was a from-scratch implementation. We probably should link to the rsc post for completeness, will update.

hanwenn · on Nov 21, 2023

At the time that I started Zoekt (2016), Google's internal codesearch used suffix arrays for the string matching, which the team wasn't happy with, presumably because of the algorithmic complexity and indexing slowness. The Codesearch team was exploring alternatives, one of them the technique described in https://link.springer.com/article/10.1007/s11390-016-1618-6. The positional trigrams were a simplification of this, that they didn't mind me open sourcing.

so, in terms of algorithms, Zoekt wasn't actually inspired by Google's internal code search.

The precise query syntax of zoekt is mostly copied from google's internal syntax, though.

hanwenn · on Nov 21, 2023

Russ Cox' trigram approach uses document IDs for the posting list, which makes the index much smaller, but gives less precise (ie. slower) matching. This is mentioned in the design doc at https://github.com/sourcegraph/zoekt/blob/main/doc/design.md....

IshKebab · on Nov 21, 2023

It is mentioned.

frutiger · on Nov 21, 2023

It is now, but wasn’t earlier.

tromp · on Nov 21, 2023

Wondering how this tool got named after the Dutch verb for seek, I found this quote on its github page [1].

> "Zoekt, en gij zult spinazie eten" - Jan Eertink

> ("seek, and ye shall eat spinach" - My primary school teacher)

[1] https://github.com/sourcegraph/zoekt

consp · on Nov 22, 2023

Note, "zoekt" is the imperative, which makes sense asking a tool to do something.

est · on Nov 22, 2023

author https://github.com/hanwen

jeffbee · on Nov 21, 2023

Is Zoekt actually in use at Google and if so how does it related to Kythe? I know the Zoekt instance for Bazel exists, but the Kythe index also exists (https://cs.opensource.google/bazel/bazel)

dmoy · on Nov 21, 2023

It has nothing to do with Kythe.

I'm on the Kythe team, and I don't know off the top of my head what Zoekt is. Looking it up, I see it's some sort of trigram search, which means if it's used at all (I have no idea), it's codesearch proper, not Kythe.

The Kythe index is the semantic index of the codebase, Codesearch does all of the text/regex/etc searching.

sluongng · on Nov 21, 2023

Are you sure? There is find definition and references in https://cs.opensource.google/bazel/bazel and Im quite sure it's thanks to the Kythe indexing job Bazel team is running in CI.

dmoy · on Nov 21, 2023

The refs & jump to def in bazel/bazel are using Kythe, yes. But that is Kythe's semantic index from running (also it's Kythe team running it, not bazel team). It's not the Codesearch trigram/text search (which again, I have no idea if it uses zoekt).

hanwenn · on Nov 21, 2023

Not in use that I know

IshKebab · on Nov 21, 2023

The same algorithm is also used in Hound (https://github.com/hound-search/hound) though I have to say the best implementation of code search by far that I've seen is https://grep.app

You really should check it out if you haven't already. It's incredibly useful; I used it all the time. Not open source though.