Airbnb Deploys 125,000 Times per Year with Multicluster Kubernetes

OldHand2018 · on April 20, 2021

I think we are supposed to impressed, but my first reaction to the headline was “Oh God, that engineering department must be a complete disaster”. If a deployment takes 20 seconds, that’s 2 hours per day spent deploying!

A few extra minutes spent planning and thinking could really reduce the number of deployments needed to a much more manageable number.

GormanFletcher · on April 20, 2021

I'm curious why you feel that way?

The DevOps mantra is that you shouldn't be trying to manage deployments at all, except in aggregate. They should be seamless enough to become non-events that happen frequently and with maximal automation. Time spent doing deploys becomes irrelevant since it's a hands-off, low-risk process.

The DevOps philosophy advocates that the process of developing code and infra to that point produces many benefits to the business: first-order benefits to code and infra quality since you're demanding more from them, and second-order benefits to the business that come from releasing many times per day and going from ticket to prod quickly.

Under this philosophy, 125k annual deployments predicts the engineering department likely is exemplary rather than disastrous, since only an exemplary engineering department should be able to pull this off without frequent/severe mistakes damaging the business.

joshuamorton · on April 20, 2021

> If a deployment takes 20 seconds, that’s 2 hours per day spent deploying!

My assumption would be to support this level of automation, the median deployment takes 0 seconds [of human time].

nonameiguess · on April 20, 2021

It's not totally clear from reading this if this is actually referring to production deployments. It says this, but then devotes some of the article to explaining how they developed a subtyping system for clusters allowing them to deploy multiple development clusters that individual teams can play with and not step on each other. I understand why you would update that 125,000 times a year, because it's your developer feedback loop, same as compile/test/rewrite on a single machine but for distributed system development. I have to do the same thing. But that isn't a production deployment.

I'm really curious what could possibly have change frequency that necessitates that many production deployments. Something I noticed when trying to see why my local homelab network is constantly getting contention throttling when I'm trying to run a minimal local cluster is storing HelmRelease for FluxCD where the source is a GitRepository rather than a HelmRepository means you pull in changes every reconcile even if they aren't changes to the chart and don't do anything to the cluster. You need to be really disciplined about having a git repo be a Helm chart and only a Helm chart. Otherwise, you pull in documentation updates, typo fixes, and all kinds of stuff that doesn't actually change your deployment state but kills your network anyway.

One of the downsides of a polling based git ops deployment model.

meepmorp · on April 20, 2021

Site is down. Can someone who managed read the article explain why they deploy over 300 times a day, on average?

mdaniel · on April 20, 2021

It seems to load fine, but in case that changes again: https://archive.is/0Oiit

raiyu · on April 20, 2021

A/B testing!

bogomipz · on April 20, 2021

Does anyone know why Altoros is publishing compilation of AirBnB's engineering talks and blog posts? I am curious if there's a connection between Altoros which appears to be a Kubernetes consultancy and AirBnB?

tonymet · on April 20, 2021

I don't think teams should use deploys/day as a KPI

diveanon · on April 20, 2021

Seems like overkill for what is essentially just a booking website.

bick_nyers · on April 20, 2021

They redeploy every time a new listing goes up /s

akvadrako · on April 20, 2021

It would be quite impressive if their whole website was just static assets that they rebuilt whenever a booking was changed.

meepmorp · on April 20, 2021

You're being sarcastic, but it's a redeploy every 14-ish minutes.

mdaniel · on April 20, 2021

I also wonder if those are feature releases or they have 125,000 bug fixes

nkotov · on April 20, 2021

This is impressive but I still believe Kubernetes is overkill for most organizations. Not everyone is running a Google / Netflix / Airbnb. Introducing complexity for no reason outside of, “Well Airbnb is doing it” (or shiny-object syndrome) has devastating effects long term. When things go wrong (and things will go wrong), fixing and cleaning up the mess tends to be more costly, both time and money wise.

aliswe · on April 20, 2021

If you think deploying often (not 125k per year), load balancing, and a fully declaratively defined operational process to what is essentially an industry standard is overkill, then yeah to some it might be. But Kubernetes is a standard and if you know it then you might as well use it even for smaller companies that dont really need the brawny features of it.

In the end, theres nothing mandating that you would save a a lot of money throwing it out of the equation.