I think we are supposed to impressed, but my first reaction to the headline was “Oh God, that engineering department must be a complete disaster”. If a deployment takes 20 seconds, that’s 2 hours per day spent deploying!
A few extra minutes spent planning and thinking could really reduce the number of deployments needed to a much more manageable number.
The DevOps mantra is that you shouldn't be trying to manage deployments at all, except in aggregate. They should be seamless enough to become non-events that happen frequently and with maximal automation. Time spent doing deploys becomes irrelevant since it's a hands-off, low-risk process.
The DevOps philosophy advocates that the process of developing code and infra to that point produces many benefits to the business: first-order benefits to code and infra quality since you're demanding more from them, and second-order benefits to the business that come from releasing many times per day and going from ticket to prod quickly.
Under this philosophy, 125k annual deployments predicts the engineering department likely is exemplary rather than disastrous, since only an exemplary engineering department should be able to pull this off without frequent/severe mistakes damaging the business.
It's not totally clear from reading this if this is actually referring to production deployments. It says this, but then devotes some of the article to explaining how they developed a subtyping system for clusters allowing them to deploy multiple development clusters that individual teams can play with and not step on each other. I understand why you would update that 125,000 times a year, because it's your developer feedback loop, same as compile/test/rewrite on a single machine but for distributed system development. I have to do the same thing. But that isn't a production deployment.
I'm really curious what could possibly have change frequency that necessitates that many production deployments. Something I noticed when trying to see why my local homelab network is constantly getting contention throttling when I'm trying to run a minimal local cluster is storing HelmRelease for FluxCD where the source is a GitRepository rather than a HelmRepository means you pull in changes every reconcile even if they aren't changes to the chart and don't do anything to the cluster. You need to be really disciplined about having a git repo be a Helm chart and only a Helm chart. Otherwise, you pull in documentation updates, typo fixes, and all kinds of stuff that doesn't actually change your deployment state but kills your network anyway.
One of the downsides of a polling based git ops deployment model.
Does anyone know why Altoros is publishing compilation of AirBnB's engineering talks and blog posts? I am curious if there's a connection between Altoros which appears to be a Kubernetes consultancy and AirBnB?
This is impressive but I still believe Kubernetes is overkill for most organizations. Not everyone is running a Google / Netflix / Airbnb. Introducing complexity for no reason outside of, “Well Airbnb is doing it” (or shiny-object syndrome) has devastating effects long term. When things go wrong (and things will go wrong), fixing and cleaning up the mess tends to be more costly, both time and money wise.
If you think deploying often (not 125k per year), load balancing, and a fully declaratively defined operational process to what is essentially an industry standard is overkill, then yeah to some it might be. But Kubernetes is a standard and if you know it then you might as well use it even for smaller companies that dont really need the brawny features of it.
In the end, theres nothing mandating that you would save a a lot of money throwing it out of the equation.
A few extra minutes spent planning and thinking could really reduce the number of deployments needed to a much more manageable number.