Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Going through something like this as a SWE at a startup. Lots of noise in our alerts and logging, so alert fatigue is a real problem. Do you have any advice on navigating this scenario (esp. negotiating with product to get monitoring and ops in a usable state)


Sure, just give your manager a copy of the bible: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa...


Thanks, this was a very enlightening read. Getting product on board with the labor involved in implementing this is going to be a different story though.


Another good piece about negotiating with Product is written up here:

https://sre.google/sre-book/introduction/#:~:text=Pursuing%2...

Ultimately, it's Product's job to decide how they want to balance reliability and feature-shipping speed. Work with them to define an SLO (like, in 99.995% of five-minute timeslices of any given month, 99% of all queries will complete within 250msec) and then graph how well you're doing when it comes to hitting it.

If you're failing to keep things above that line, Product either needs to accept lower reliability standards or invest engineering time in improving reliability. Again, it's Product's call to make. If they do want to invest in reliability, though, that's when you get to present your wish list, work out an agreement on its ranking, and find time to get the work done, even if it means slowing down the rate at which new features are shipped.


You may have luck if you frame it in terms of an investment. Spend the time now to fix your alerts, add playbooks, improve process - because you immediately start enjoying the benefits. Less time spent on support means higher velocity. The longer you wait the more engineering time you've wasted It just takes a little patience up front as well as product and engineering collaborating.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: