Openly, Inc | Backend Engineer (mostly go) | Ann Arbor, Boston | Remote(US) or Onsite, Full Time | Salary, Benefits, Equity | https://openlyinsured.com
We’re a currently 7-person insure-tech startup with 3 engineers: me in Ann Arbor, the CTO (one of the cofounders) in Boston, and someone who works remotely from socal. We’re building an insurance company that uses technology from this century to make insurance agents’ lives easier. We're obviously a small team, so you'd be able to have a large impact quickly.
We intend on staying a remote-first team, but we're also trying to start a small Ann Arbor office if we can find the right person to join me here. Having a physical presence in a2 will allow us to hire interns/juniors and mentor them and such, which is more difficult to do remotely.
RBAC itself is enabled by default on new gke clusters afaik, and has been enabled on ours for months. We're a small team so we haven't tried to do anything particularly complex with RBAC inside the cluster, but I can't say I've run into anything particularly troublesome.
One thing I didn't put in here that's also turned out to be useful: We can prerelease things relatively easily this way too. Each deployment has a git sha, and we can have a canary/beta/dogfood version that points at an entirely different sha.
In our case it's not the websockets that are the problem, it's the XMPP connection that each websocket connection creates. Logging in thousands of users takes several minutes. While a user reconnects, any conversations that the users are having with their website visitors are disrupted.
Yeah I think if restartPolicy were changeable at runtime, we could simply have the pods exit once their connections are drained enough. If we were to exit using the current strategy, they'd just be restarted by kube.
yeah I don't know why terminationGracePeriodSeconds hacks didn't work. It could have been a different, unrelated factor that we didn't discover. It certainly could have been service-loadbalancer/haproxy's fault instead of the termination grace period itself. I'm certainly happy to be proven wrong there.
Not 100% sure about your scenario, but if you set a preStop hook to an exec probe you can arbitrarily delay shutdown inside the gracePeriod, because the kubelet won’t terminate the container until preStop returns.
So if you set a 5 hour grace period, and a preStop hook that invokes a script that doesn’t return until all connections are closed (but which tells the container process not to accept new ones) you can control the drain rate.
There is some app level smarts required - to have new connections rejected and have any proxies rebalance you. Haproxy does this in most cases, but the service proxy won’t (in iptables mode).
If that’s not the behavior you’re seeing, please open a bug on Kube and assign me (this is something I maintain)
Yeah I think that there is still some potential in the terminationGracePeriod strategy, but we found this other way that worked reliably and stopped exploring that path. If I can repro the issue I'll let you know.
One extra thing I remember that was sort of problematic was that when a pod was Terminating it'd get removed from the Endpoints, so any tooling that was using the API info to keep an eye on connections was basically unusable at that point.
We force reconnects eventually, there just aren't that many people affected at that point. There's a very long tail of people keeping their browsers open for days, but it's only a handful of people.
We actually have this functionality as well: we can send a signal to the process which will cause it to display a message to the user asking them to reload to upgrade. There is a similar feature in slack and riot that I've seen as well.
Making this handoff automatic is definitely possible as well, though we do want people to reload occasionally to get new client-side code.
The way this Rainbow Deploy works means that if I only deploy once a month, I only have a single "color" running for that whole month, plus couple days of overlap where there are 2. If I have blue/green, I have 2 colors running all month. If I have more fixed colors, even more. The sha thing is just a convenient way of creating "colors" dynamically when we need to do a new deploy, without having to use a meaningless representation like "blue", "green", "taupe", "chartreuse", etc.