Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

When I'm in charge of an on-call rotation I always try to make it very clear that this is not the expectation.

In my preferred model of on-call, you have a primary, then after 5min an escalation to secondary, then after 5min an escalation to something drastic (sometimes "everyone", sometimes a manager).

The expectation is that most of the time you should be able to respond within 5 minutes, but if you can't then that's what the secondary role is for - to catch you. This means it's perfectly acceptable to go for a run, go to a movie, etc.

You relax the responsibility on the individual and let a sensible amount of redundancy solve the problem instead. Everyone is less stressed, and sure you get the occasional 5min delay in response but I'm willing to bet that the overall MTTR is lower since people are well rested and happier to be on call to begin with.



We have a primary/backup setup and I would be pretty pissed if my primary just started going out for movies or a date night during their shift tbh. My job as a backup is to be there for unexpected events, ie they did not wake up or had an accident. Not be on call effectively 2 weeks in a row just because the primary doesn't take it seriously.


Yeah, going for a run or a dinner where you might be able to ack but not actually at keys for 10-20 minutes is one thing. Going to a movie or date where you might not even ack and won't be at keys for hours? Not cool at all.


I don’t see how this changes the problem where there is an expected guarantee of a rapid response except that now two people are expected to be available and would now need to directly coordinate in order to ensure one person’s going for a swim doesn’t interfere with the other’s WoW raid.


That's more or less what my team does. It works well. At least much better than saying you can't for for a swim at all.


I guess to me that seems worse because that’d effectively double the number of off-hours accountability per teammate. Not only do you need to be first on call for your primary hours, therefore severely restricting the quality of your “free time” but now you ALSO have to be secondary on call for that irresponsible coworker that goes afk without properly communicating for 2 hours, dipping twice into your actual free time.


Out of 168 hours in a week, there are maybe up to 8 where I want to do something that interferes with being oncall. There's no downside real downside to being oncall for the other 160 hours. But I would get a lot of disutility from losing my freedom during those 8.


This is pretty much how it should be done. If the business demands more, they should have a properly manned 24x7 NOC.

You also need *ownership*. There is nothing worse than having to support somebody else's work and not being allowed (either via time or other restrictions) to do things "right" so that you're not always paged for fixable problems. Everywhere I worked where the techs had ownership (which varied from OPS people being allowed to override the backlog to fix issues or developers being given enough free reign to fix technical debt) has usually meant that oncall is barely an issue. My current gig I often forget I'm even on call at all and the main issues that do crop up are usually external.


Almost all the reliability issues I encounter is usually due to constraints ordered by people who don't have to deal with on-call.

Things like, running in AWS but you have to use a custom K8S install so they aren't dependent on AWS.

Using self managed Kafka so that you aren't dependent on proprietary tech.

It all sucks because they are always less reliable and generate their own errors and noise for on-calls.

If they had to deal with phone calls every time there's a firewall issue that had absolutely nothing to do with the application, they would soon change their tune.


So it takes 10 min until you've gone to the drastic solution? With this time-frame it would be risky to go the bathroom, not go to a movie. Also even the backup sounds like a primary in this scenario.


Sure, but the assumption here is that primary and backup (edit: probably, ie. they're not coordinating this) aren't going to the bathroom at the same time. It's also based on the idea that alerts are extremely rare to begin with. If you're expecting at least one page every rotation, that's way, way too often. Step one is to get alerts under control, step two is a sane on-call rotation.


We want to ack within five minutes, and be at a laptop within 30. So long as I'm within mobile signal when the page goes off, it doesn't really matter what I'm doing — an ack is a button press on a push notification. And I can stay within 30 minutes of my laptop and an Internet connection by carrying said laptop and my phone (with "unlimited" data).

If the primary (paid) on-call doesn't catch the notification, the secondary (unpaid) will be paged. And so on, down a couple more steps, to a senior manager. There's no expectation that anyone other than the primary would actually be available to ack the alert.


Having the primary/secondary rotation is arguably worse. In that model, from the perspective of any one participant, now they're on-call for two weeks each time around instead of one.


> The expectation is that most of the time you should be able to respond within 5 minutes

That's an unreasonable expectation unless it's clearly said in writing and is billable hours.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: