> All of this is done at Google - there's extensive monitoring for all production systems, with alerts firing once parameters move outside of their statistical "normal" range. In practice the range tends to get set more by "How tight can we make it before we get really annoyed by these pagers?" than any rigorous statistical method, but the run-time operation carefully measures average & extreme values and determines when a spike is just a momentary anomaly vs. when it's worth paging someone.
Really, "All"?
First question is, 'parameters'. Usually, and likely
at Google, they are monitoring just one parameter
at a time. This is a really big bummer. E.g., even if
do this for several parameters, are forced into assuming
that the 'healthy and well' geometry of the parameters is
just a rectangle. Bummer that hurts the combination of
false alarm rate and detection rate.
Next, the "normal range" is a bummer because it assumes
that what is healthy and well is just a 'range', that is,
an interval, and this is not always the case. The result,
again, is a poor combination of false alarm rate and detection rate.
Again, yet again, please read again, just for you, as I wrote very clearly, to do well just must be multi-dimensional. I doubt that there is so much as a single
large server farm or network in the world that is doing
anything at all serious working with multidimensional
data for monitoring. Not one.
Next, your remark about false alarm rate points to
a problem I point out is solved: The method, with meager
assumptions, permits knowing false alarm rate in advance and
setting it, actually setting it exactly.
For how many and what variables to monitor, yes, that is a question that
can need some overview of the server farm or network
and some judgment, but there is some analytical work
that should help.
For "rigorous" statistics, the point is not 'rigor' but
useful power. Being multidimensional, knowing false alarm
rate and being able to adjust it, etc. are powerful.
Really, "All"?
First question is, 'parameters'. Usually, and likely at Google, they are monitoring just one parameter at a time. This is a really big bummer. E.g., even if do this for several parameters, are forced into assuming that the 'healthy and well' geometry of the parameters is just a rectangle. Bummer that hurts the combination of false alarm rate and detection rate.
Next, the "normal range" is a bummer because it assumes that what is healthy and well is just a 'range', that is, an interval, and this is not always the case. The result, again, is a poor combination of false alarm rate and detection rate.
Again, yet again, please read again, just for you, as I wrote very clearly, to do well just must be multi-dimensional. I doubt that there is so much as a single large server farm or network in the world that is doing anything at all serious working with multidimensional data for monitoring. Not one.
Next, your remark about false alarm rate points to a problem I point out is solved: The method, with meager assumptions, permits knowing false alarm rate in advance and setting it, actually setting it exactly.
For how many and what variables to monitor, yes, that is a question that can need some overview of the server farm or network and some judgment, but there is some analytical work that should help.
For "rigorous" statistics, the point is not 'rigor' but useful power. Being multidimensional, knowing false alarm rate and being able to adjust it, etc. are powerful.