*On a slightly deeper level, standard deviation only has meaning if the distribu...

npk · on May 26, 2009

The statement "only has meaning if the distribution is presumed to be normal" is wrong. The SD is a summary of the spread of a distribution. In fact, for most centrally concentrated distributions (including a uniform one) +/- 1 sigma corresponds to about 60% of the mass of the distribution. This is an amazingly useful thing to know.

As the above triva factoid points out, the standard deviation is an important summary statistic. More interestingly by using mean, variance (or sd), skew, and kurtosis, you can describe almost any centrally concentrated distribution. Even distribution with heavy tails.

I think what the OP meant is that most 3+ sigma results are not truly 3+ sigma, because most distributions in this world are not gaussian, but instead have large wings. SD is most useful when you know what the underlying distribution is. Currently it's more in fashion to communicate spread using confidence intervals because they presume less about the underlying distribution.

nkurz · on May 26, 2009

You're right. I was being sloppy.

I should have said something more like "the standard deviation calculated from a sample set is only generally applicable in so far as one is willing to make assumptions that the sample set is representative of the distribution as a whole". The default assumption in traditional statistics (such as quoting p-values) is that the distribution is normal, and in real world situations often not the case.

Your restatement is right on, although I'd go farther and say that standard deviations (and confidence intervals) are only useful metrics with regard to the particular assumptions one is willing to make about underlying distribution. Yes, you can calculate these measures, but they won't help you if your assumptions are irreparably flawed.

nkurz · on May 26, 2009

Are you completely sure about that?

You could quibble about my exact phrasing, but yes, I'm completely sure about that. This is the 'black swan' problem writ small. I don't mean that a high standard deviation should be ignored for real-world distributions, but I do mean that a low standard deviation carries very little weight unless a normal distribution is presumed.

I'm hard pressed to relate this to the cases discussed in the article, as those cases are shy on detail, but the DB2 example seems most applicable. Although he points to standard deviation as the tell-tale flag here, this is sort of misleading. The exact numerical value for the standard deviation across all queries is meaningless here, as not every query has an equal likelihood of being slow. As he states, the real problem was the terrible performance of an single query.

How many similar queries exist? Will a new query added to the system trigger a similar bug? We don't know, and standard statistics isn't going to help us unless we have an understanding of the underlying mechanism. The key here is not to test a statistically significant subset of all possible queries, but to check the performance of the actual queries executed (as he did).