3 May 2012

A/B tests and Weldon’s dice

Statistical testing focused on data volume for internet industry. This is translation of my article for Czech on-line magazine for web creators – zdrojak.cz  ISSN 1803-5620

The vast majority of your webpage users have more than just an average number of fingers.
 This information is true (with big probability). To verify this, an extensive research can be done to evaluate the results of statistical software and analytical tools. We can figure it out even without research. But is anybody interested in it?
 Statistics is the science of beauty that can do a lot. Statistics are also used to evaluate the A/B testing (and that's really good).

Little data
 The A/B test is nicely described in this article (sorry, only in Czech). However, the results interpretation is bad (see the discussion below the article). This is an example of how not to do it (or rather not to evaluate the results). In this case the problem lies in a little data volume (for this test type). From a small data set you may be able to choose the better of two options (i.e. the best option). In the case of six options (as is considered in the article) however, the situation is considerably more complicated and it shows that those options are not the same (see the tests of goodness of fit), they require more data than in case of the two options. And if you want to choose from six options the best one, the sample will have to be even greater.

Too much data
Another extreme is the big volume of real data. It is an extreme, not a problem. However, statistics and tests are significantly older than IT equipment. For example, T-test began using Guinness brewery employee in the early 20th century. In that period such a large volume of data to analyze was not available as it is at present. And in the tables for critical values, we see that "infinity begins very soon" (already with a sample frequency of 100, the critical value differs by less than 2 percent compared to the "infinite sample"). Also, in most statistics textbooks there are examples where the frequency of the sample is units, most in the tens.
In case of big data volume, the statistics will tend to reject the hypothesis of equality (i.e., using the statistics we can show that two groups have different parameter). The reason is that there are not two exactly same things in the universe (ok, protons are the same, gas atoms are too, but we will not test them). And also small deviation from the theoretical assumptions (i.e. that the random variables being independent and identically distributed), especially during test when a large volume of data is used, might lead to the incorrect conclusion. For example, let’s suppose you have an online store with a huge number of visitors (millions of hits per a week - and considering also web visitors who do not purchase anything). Do the test to determine whether a contemplated change has an impact on the number of items purchased by customers and the total amount of money that customers spend. During this test we find out that the customers purchase 0.3% more items than before the change, which is statistically significant and at the same time they spends about 0.4% less, than before the change , which is not statistically significant (due to larger standard deviation due to the mean value). Is this a good result for us?

Walter Frank Raphael Weldon in the 19th century threw 26,306 times 12 dice and watched the frequency of fives and sixes. The results of the experiment suggest that the dice were not fair. The same conclusion was also found by scientists, when they repeated the experiment (they used a machine to throw the dice and to count the results).
A perfect dice, a coin or roulette simply doesn’t exist. Also, two different marketing campaigns do not make the same result (if, at least, one campaign will have an impact). But to prove it using statistics can take a long.
Random number generators work well for large frequency according to theoretical assumptions. Even when generating 2,000,000,000 throws of an imaginary coin, I didn't show that the algorithm worked poorly (Oracle 11, dbms_random). Head fell 1,000,003,718 times.

How to get the best from A/B tests?

Perform the test correctly. Nonrandom distribution of the group might have surprising conclusions. The campaign A works better for males. The campaign A works also better for females however for the unisex - the unification of men and women – the campaign B works better. See the Simpson's paradox. It is better to have less data of good quality than to have a lot of bad data.

Bigger data volume. It is true that to prove a small difference we need much more data than to prove a big difference (simplification for n-times more accurate estimate of the need to n-square times more data). However, even for large samples it is necessary to follow the methodology. For example, a survey of 2.4 million respondents chose the wrong future U.S. president.

Use the data that you have to their best. For example, if you have historical data for the test participants, it is possible to use Bayesian statistics to get more and better analysis.

And especially before you test, remember what your data mean and what you want to achieve. Quantity of purchased beer and the client’s gender are dependent. Quantity of purchased hair conditioner and the client’s gender are dependent.  Quantity of purchased beer and quantity of purchased hair conditioner are dependent. Which one of these dependencies can be useful for the marketing?




20 Mar 2012

About miTSakes

Short introduction to problem of comparing strings - fuzzy matching with examples (Levensthein distance and Jaro-Winkler distance).
Only Czech version available http://zdrojak.root.cz/clanky/jak-na-prelkepy/

31 Oct 2011

Estimate gender from full name with Oracle

Estimate gender from full name with Oracle
Article is about parameter estimation from text string. I explain the algorithm and show detail results on Chicago marathon 2011 runners and total results on real Czech data.    

Czech version published - http://zdrojak.root.cz/clanky/odhad-pohlavi-z-celeho-jmena/


 One of the problems of data quality is incomplete data. For example for some clients we have information about full name and address but no information about gender (male/female), legal type (individual/corporate) or country (domestic/foreign countries) Our company need this data to correctly calculate risk level of client, for reporting, to compute segment etc.
 
Let's consider situation when we need to estimate missing parameter with 2 possibilities for clients with some text information and we have Oracle and some data with filled parameter (standard situation in DWH).

Algorithm

Without loss of generality let's estimate parameter gender from full name.

From filled data we create 2 dictionaries (one for female and one for male) with words after data standardization (
hereinafter referred to as the “word/s).  To create words we use Oracle functions upper, trim, translate, convert and regexp_replace [1] and we consider space, comma and dot like word separators (you can change it according to your data – add dash, question mark etc.). 

words:=upper(convert(Trim(regexp_replace(translate(full_name,',.','  '),'( ){2,}',' ')),'US7ASCII'))

We compute score for each word from dictionary by using Oracle percent_rank function [3]. The most typical words (like Isabella for women and Jacob for men [2]) have high scoring and influence decision rules most.

Select word, PERCENT_RANK () OVER (order by count(*)) score
….
group by word

For better understanding see attachment chicago_marathon.xls.

To estimate gender from full name we join full name through words to both dictionaries and sum scores for male and female dictionary. If difference for male and female sum of score is bigger then epsilon (epsilon is 0 or little positive number), we make the estimate according bigger sum of score.

Precondition for using the algorithm are: 

  • Strong dependency between string, which we use and parameter, which we want to estimate (there is a correlation between the name and gender)
  • The same distribution for words and  parameter for both groups (with filled parameter and without filled parameter - it’s not a good idea
    to use dictionary from English full names to estimate gender from Russians full names)
  • To have enough data volume with good quality to generate dictionary (more than 10 000 records for each dictionary)

This algorithm doesn't consider data meaning, construction dependencies of string, combination with other entities, experiences and knowledge of other sources and so. Because of that the results of this algorithm are worse than in full analysis and systematic method. Thus I recommend, when possible, you use standard methods.

 On the other hand there are situations when we have no information about data meaning or we have to work with untypical environment (e.g. there is no customization for such data, we don't have good dictionaries, way of data entry is unusual, distribution of data is untypical, we have no special tool, etc) it still will work - and we can get relatively good results without exceptionally hard work and your data distribution will be respected as well!

Results and comparison with specialized tools

For test I used data with filled parameter (gender or legal type) and I divided this group into 2 random disjoint groups. One I used to generate dictionary, the other one to estimate gender from full name and control estimate value with filled value.

For gender test from Chicago marathon results I used 80% of 35 747 runners names to create dictionary and 20% (every 5th place – see attachment) to check results. 

For the best match I need biggest positive (estimate male for man and female for woman) results and the smallest negative results (estimate male for woman and female for man).

The algorithm is positive for 6 543 of 7 157 runners (for epsilon = 0 - see attachments for details).




Chicago marathon

Algorithm

epsilon = 0

epsilon = 0.2

Positive

Negative

Positive

Negative

Gender

91.421%

6.371%

72.684%

2.124%

For other test I used real data from Czech bank. Data volume was much bigger than in Chicago marathon. Most of clients are from the Czech Republic but there are also significant groups of client from Slovakia, Russia, Ukraine, Germany, etc. I tested gender and legal type estimation. These combinations meet the algorithm preconditions.

Using the same test data I used SAS Dataflux for Czech Republic [4] for comparison with specialized tool.

 For gender test I used data with filled gender and I divided this group into 2 random disjoint groups. One I used to generate dictionary, the other one to estimate gender from full name and control estimate value with filled value.

 For legal type test I used data with filled legal type and I divided this group into 2 random disjoint groups (one for dictionary, another for estimate and control).

Every group contained tens of thousands records

For the best match I need biggest positive (estimate male for man and female for woman) results and the smallest negative results (estimate male for woman and female for man). 

For details I measured results for Czech and for foreign clients.  




Algorithm

Dataflux

epsilon = 0

epsilon = 0.2

Positive

Negative

Positive

Negative

Positive

Negative

Gender - Total

98.278%

0.236%

95.055%

0.117%

94.050%

0.210%

Gender - Czech

99.597%

0.052%

97.095%

0.026%

99.007%

0.057%

Gender - foreign

88.993%

1.531%

80.695%

0.754%

59.156%

1.292%

Legal type - Total

98.676%

0.400%

88.232%

0.211%

98.749%

0.890%

Legal type - Czech

99.397%

0.233%

89.392%

0.121%

98.881%

0.842%

Legal type - foreign

91.611%

2.037%

76.869%

1.091%

97.450%

1.363%


The algorithm is for bigger epsilon more conservative (returns smaller positive number and smaller negative number). The algorithm better meets conditions for majority group (better dictionary and stronger dependency and also better results for Czech clients). On the other hand Dataflux shows its bigger power on foreign legal types. But Czech Dataflux doesn't work for foreign gender (under 60% of positive).

There is a possibility to change the algorithm. To ignore the words with very small scoring in dictionary, to find the best epsilon like function of count of dictionaries and count of words in full name, to consider Bayes’ theorem for different gender cardinality, to consider more than only 2 results for estimate etc.  But these steps also reduce the strongest quality of algorithm – relatively good results for short time without analyze necessity.


PL/SQL

I only want to show basic principles. For details see attachments.

For brevity I don't consider exceptions and special situations.

We have tables TABLE_OF_WOMEN_FULL_NAMES and TABLE_OF_MEN_FULL_NAMES with one varchar2 column. We use these tables to create dictionaries. 

Run
create_dictionary('TABLE_OF_WOMEN_FULL_NAMES', ‘DICTIONARY_WOMEN’) and create_dictionary('TABLE_OF_MEN_FULL_NAMES',‘DICTIONARY_MEN’)

where

CREATE OR REPLACE Procedure
create_dictionary(table_name_in 
In user_tables.table_name%Type,
table_name_out In varchar2)
Is
Type tabtype Is Table Of
Varchar(2000);
input_table      tabtype;
words Varchar(2000);
count_of_words   Number;
Begin
--create table for dictionary
Execute Immediate 'create table '|| table_name_out ||' (word VARCHAR2(200))';
Execute Immediate 'select * from ' || table_name_in
Bulk Collect
Into input_table;
For i In input_table.first ..
input_table.last
Loop --(1)
--space, comma and dot are separated symbols for
words we replace them with space by using translate
--multiple space we replace with single space by
using regexp_replace
--standartization of word by using upper and
convert oracle functions
words := upper(convert(Trim(regexp_replace(translate(input_table(i),',.','  '),'( ){2,}',' ')),'US7ASCII')) || ' ';
If words = ' '
Then
count_of_words := 0;
Else
count_of_words := length(words) - length(Replace(words, ' '));
End If;
For j In 1 .. count_of_words
Loop --(2)
  --inserted j-th standarted word from full name to
dictionary
If j = 1
Then
Execute Immediate 'insert into ' || table_name_out || ' values (''' ||substr(words,1,instr(words, ' ', 1,
1) - 1) ||''')';
Else
Execute Immediate 'insert into ' || table_name_out || ' values (''' ||
     
substr(words,instr(words,
' ', 1,
j -
1) + 1,     
instr(words,
' ', 1, j) - instr(words, ' ', 1, j - 1) - 1) || ''')';
End If;
End Loop; --(2)
Commit;
End Loop; --(1)
End;

To values from dictionary_men we assign score by using percent_rank. I use also union all twice to give every word from dictionary positive score.

select word, PERCENT_RANK () OVER (order by count(*)) score
from
(select word from DICTIONARY_MEN union all select word from DICTIONARY_MEN union all select null from dual)
group by word

To estimate gender we use following select:

select t1.full_name,
case when
t1.men_score-t2.women_score>&epsilon
then 'male'
     when t2.women_score-t1.men_score>&epsilon then 'female'
     else 'no result' end result 
from
--full_name_without_gender
is table with full names we need to estimate gender
--men and
woman are tables with words from dictionaries and score
 (select t.full_name ,sum(nvl(men.score,0)) men_score
 from full_name_without_gender t
 left join (select word, PERCENT_RANK () OVER (order by count(*)) score
 from (select word from DICTIONARY_MEN union all select word from DICTIONARY_MEN union all select null from dual)
 group by word)
men
 on ' '||upper(convert(translate(full_name,',.-','   '),'US7ASCII'))||' ' like '% '||men.word||' %'
 group by
t.full_name) t1
join
 (select t.full_name ,sum(nvl(women.score,0)) women_score
 from full_name_without_gender t
 left join (select word, PERCENT_RANK () OVER (order by count(*)) score
 from (select word from DICTIONARY_WOMEN union all select word from DICTIONARY_WOMEN union all select null from dual)
 group by word)
women
 on ' '||upper(convert(translate(full_name,',.-','   '),'US7ASCII'))||' ' like '% '||women.word||' %'
 group by
t.full_name) t2
on t1.full_name=t2.full_name

This select with like join condition works really slowly for big data volume. But it takes only few minutes for Chicago marathon results (see attachments).

To accelerate way to result you separate words from full name (in the same way like in procedure create dictionary) and left join it to dictionaries with scores or use other ways to tune select.

Attachments:
chicago_marathon.xls
inserting script.sql
chicago.sql


Sources:

[1] Oracle Database SQL Reference 10g Release 1 (10.1), Chapter SQL Functions

http://download.oracle.com/docs/cd/B14117_01/server.101/b10759/functions001.htm#i88893

[2] Popular baby names

http://www.ssa.gov/OACT/babynames/

[3] Oracle Database Data Warehousing Guide 10g Release 2 (10.2), Chapter 21 - SQL for Analysis and
Reporting

http://download.oracle.com/docs/cd/B19306_01/server.102/b14223.pdf

[4] Dataflux Technology Overview
Course Notes

[5] Bank of America Chicago
marathon 2011 results

http://results.public.chicagomarathon.com/2011/