3 May 2012

A/B tests and Weldon’s dice

Statistical testing focused on data volume for internet industry. This is translation of my article for Czech on-line magazine for web creators – zdrojak.cz  ISSN 1803-5620

The vast majority of your webpage users have more than just an average number of fingers.
 This information is true (with big probability). To verify this, an extensive research can be done to evaluate the results of statistical software and analytical tools. We can figure it out even without research. But is anybody interested in it?
 Statistics is the science of beauty that can do a lot. Statistics are also used to evaluate the A/B testing (and that's really good).

Little data
 The A/B test is nicely described in this article (sorry, only in Czech). However, the results interpretation is bad (see the discussion below the article). This is an example of how not to do it (or rather not to evaluate the results). In this case the problem lies in a little data volume (for this test type). From a small data set you may be able to choose the better of two options (i.e. the best option). In the case of six options (as is considered in the article) however, the situation is considerably more complicated and it shows that those options are not the same (see the tests of goodness of fit), they require more data than in case of the two options. And if you want to choose from six options the best one, the sample will have to be even greater.

Too much data
Another extreme is the big volume of real data. It is an extreme, not a problem. However, statistics and tests are significantly older than IT equipment. For example, T-test began using Guinness brewery employee in the early 20th century. In that period such a large volume of data to analyze was not available as it is at present. And in the tables for critical values, we see that "infinity begins very soon" (already with a sample frequency of 100, the critical value differs by less than 2 percent compared to the "infinite sample"). Also, in most statistics textbooks there are examples where the frequency of the sample is units, most in the tens.
In case of big data volume, the statistics will tend to reject the hypothesis of equality (i.e., using the statistics we can show that two groups have different parameter). The reason is that there are not two exactly same things in the universe (ok, protons are the same, gas atoms are too, but we will not test them). And also small deviation from the theoretical assumptions (i.e. that the random variables being independent and identically distributed), especially during test when a large volume of data is used, might lead to the incorrect conclusion. For example, let’s suppose you have an online store with a huge number of visitors (millions of hits per a week - and considering also web visitors who do not purchase anything). Do the test to determine whether a contemplated change has an impact on the number of items purchased by customers and the total amount of money that customers spend. During this test we find out that the customers purchase 0.3% more items than before the change, which is statistically significant and at the same time they spends about 0.4% less, than before the change , which is not statistically significant (due to larger standard deviation due to the mean value). Is this a good result for us?

Walter Frank Raphael Weldon in the 19th century threw 26,306 times 12 dice and watched the frequency of fives and sixes. The results of the experiment suggest that the dice were not fair. The same conclusion was also found by scientists, when they repeated the experiment (they used a machine to throw the dice and to count the results).
A perfect dice, a coin or roulette simply doesn’t exist. Also, two different marketing campaigns do not make the same result (if, at least, one campaign will have an impact). But to prove it using statistics can take a long.
Random number generators work well for large frequency according to theoretical assumptions. Even when generating 2,000,000,000 throws of an imaginary coin, I didn't show that the algorithm worked poorly (Oracle 11, dbms_random). Head fell 1,000,003,718 times.

How to get the best from A/B tests?

Perform the test correctly. Nonrandom distribution of the group might have surprising conclusions. The campaign A works better for males. The campaign A works also better for females however for the unisex - the unification of men and women – the campaign B works better. See the Simpson's paradox. It is better to have less data of good quality than to have a lot of bad data.

Bigger data volume. It is true that to prove a small difference we need much more data than to prove a big difference (simplification for n-times more accurate estimate of the need to n-square times more data). However, even for large samples it is necessary to follow the methodology. For example, a survey of 2.4 million respondents chose the wrong future U.S. president.

Use the data that you have to their best. For example, if you have historical data for the test participants, it is possible to use Bayesian statistics to get more and better analysis.

And especially before you test, remember what your data mean and what you want to achieve. Quantity of purchased beer and the client’s gender are dependent. Quantity of purchased hair conditioner and the client’s gender are dependent.  Quantity of purchased beer and quantity of purchased hair conditioner are dependent. Which one of these dependencies can be useful for the marketing?