The problems associated with NHST are well documented (three that are especially awful: multiple comparisons lead to uncontrolled rate of false positives; large sample sizes leading to statistically significant but substantively irrelevant effects being "accepted"; hard thresholds of statistical significance are arbitrary and the difference between a significant effect and insignificant effect is rarely significant). In addition, there are problems that emerge from using NHST as part of an A/B test: namely, using live, rolling samples to identify significant results rather than prespecifying sample sizes; and making atomic changes as though they are additive without re-testing joint hypotheses of multiple changes.
But when you come down to it, the question you want to ask is "Is B better than A?" and the question you do ask is "If A were truly no better than B, how often would a sample of size n drawn using the same sampling procedure I think I'm using produce the impression than B is as much better than A as it is apparently in my observed data?", and two problems are that these aren't the same question and almost no one knows they're doing the latter.
To be totally fair, one of the most common problems with NHST (the null hypothesis is patently absurd) isn't necessarily a problem in the A/B UX case.
Not sure which of these in specific the grandparent is referring to, but I suspect they and I are on the same page in general.
The problems associated with NHST are well documented (three that are especially awful: multiple comparisons lead to uncontrolled rate of false positives; large sample sizes leading to statistically significant but substantively irrelevant effects being "accepted"; hard thresholds of statistical significance are arbitrary and the difference between a significant effect and insignificant effect is rarely significant). In addition, there are problems that emerge from using NHST as part of an A/B test: namely, using live, rolling samples to identify significant results rather than prespecifying sample sizes; and making atomic changes as though they are additive without re-testing joint hypotheses of multiple changes.
But when you come down to it, the question you want to ask is "Is B better than A?" and the question you do ask is "If A were truly no better than B, how often would a sample of size n drawn using the same sampling procedure I think I'm using produce the impression than B is as much better than A as it is apparently in my observed data?", and two problems are that these aren't the same question and almost no one knows they're doing the latter.
To be totally fair, one of the most common problems with NHST (the null hypothesis is patently absurd) isn't necessarily a problem in the A/B UX case.
Not sure which of these in specific the grandparent is referring to, but I suspect they and I are on the same page in general.