Peter Brejcak: 2012

21 Jun 2012

About miTSakes 2: phonetic algorithms

Phonetics algorithms using and examples -Soundex, D-M Soundex, Metaphon, NYSIIS.

Full version is available only in Czech language here - Zdrojak.cz on-line magazine for web developers. ISSN 1803-5620

13 Jun 2012

Social network and sport betting

A bookmaker was a very important person for fixed-odds betting. His challenge was to make the best estimation for event probability. Is there any impact of the IT, internet and social network on sport betting? This article is focusing on sport betting, Czech television “emociogram”, and EURO 2012 match Czech Republic – Russia.
Full version is available only in Czech language here - Zdrojak.cz on-line magazine for web developers. ISSN 1803-5620.

A little bit of history

Wager and hazard are with people for thousands of years (for example dice games). Usage of modern probability and statistics is significant younger. 12-years old children are able to calculate probability of minimal 1 head on 2 coins (it’s 75%: head-head, head-tail, tail-head, tail-tail). Famous mathematicians Leibnitz and D’Alembert were not able to find correct answer (Jiri Andel: Mathematics of chance). The history of modern fixed odds dates back to the 19thcentury and the origins of football gambling.

Internet impact on sport betting

IT and internet give advantage to bookmakers and also to customers. Customers are able to get more and newer information about athletes. They have also bigger, significantly bigger choice from sport events and providers.

Providers are able to calculate online behaviors of gamblers and offer easy on-line betting during live match.

1st view: Bookmaker rules

Bookmaker holds advantage (overround) over their customers, so the bookmaker will make a profit over the long term. Even clever customers and rival providers can’t change it. Every long-term customer is going to be in red numbers. Of course there is a big probability of big numbers of short-term losses. There are the betting systems like martingale or D’Alembert. But probability and mean value plays for bookmaker. Advantage of this view is simplicity.

Let’s compute profit mean value (Ez_i) for bookmaker if result is i.

Ez_i=v_i-p_iik_iv_i

Where v_i is stake on result i, p_i for is probability of result i, and k_i is decimal odd of i result. There is just one parameter - p_i.So only problem of bookmaker is to set up k_ip_i<1 for all possible results i.

Problem of this view is situation when customers have better information than bookmaker. And update of all odds during live match for on-line betting is for human bookmaker a big problem.

The Roulette is an example of fix-odd betting game, where the odds are set up correctly and only lucky short-time customer can win.

2nd view: Sport result is useless

Let’s think differently. Is there a possibility for bookmaker to be in profit after every game? It means that stakes are bigger then wins for every result.

Let’s consider football game with probability 1/3 for every possible result (win, draw, loss) and 2 providers. There are stakes 3 000 000 EUR for this match for each provider. Odds for first provider are 2.7-2.7-2.7 and for second provider 2.9-2.9-2.9. So expected profit for the first provider is 300 000 EUR and for the second one is 100 000 EUR.

But what in case of those stakes distribution?

result	win	draw	loss	Sum of stakes	Expected profit
Stakes provider 1	1 400 000	200 000	1 400 000	3 000 000
Provider 1 profit	-780 000	2 460 000	-780 000		300 000
Stakes provider 2	1 000 000	1 000 000	1 000 000	3 000 000
Provider 2 profit	100 000	100 000	100 000		100 000

First provider is going to be in profit only in case of draw. Second is going to be in profit in every case. Which provider is better?

If provider wants profit after every game, provider needs to ensure:

summa_{i, t}v_{i, t} - max_i(summa_tv_{i, t}k_{i, t})>0

where t is time parameter. We consider change of odds in time.

I.e. provider doesn’t care about result of sport game. Provider cares only about stake distribution.

Social network

So provider should estimate behavior (atmosphere) of customers for good stake distribution estimation. But how do you want to measure behavior of group of customers during the game? Social network gives customers a possibility to “like” something.

Czech television uses social network to “Like” or “Dislike” performance of Czech team during EURO 2012 match. I’ve written new article focused on time series of fixed odds on victory of Russians and outputs of Czech television during EURO 2012 match Russia-Czech Republic.

Here is emociogram (time series graph of “Like” and “Dislike”) from Czech television.

Here is graph of time series for (k_t-1)/(k_t-1-1), where k is odd on Russian victory at time t.

They fit together. It is interesting, isn’t it?

So we are able to detect change during game just from liking! There is also strong dependency between liking and odd changing.

Conclusion

Problem of sport betting is more complicated. We don’t consider rival provides, dependency between odds and customer willingness to bet, different limits etc.

Online betting on live games is a big and modern business with quantum of information for bookmakers and customers. Every change during the match has big impact on supply and demand. Every change has also big impact on emotions of spectators.

So what about combination of liking and betting?

9 Jun 2012

Vývoj kurzu počas zápasu Rusko-Česko na EURO 2012

Sledoval som vývoj počas zápasu Rusko-Česko na EURO 2012 pre porovanie s emociogramom ČT na stránke http://www.oddsportal.com/.

3 May 2012

A/B tests and Weldon’s dice

Statistical testing focused on data volume for internet industry. This is translation of my article for Czech on-line magazine for web creators – zdrojak.cz ISSN 1803-5620

The vast majority of your webpage users have more than just an average number of fingers.

This information is true (with big probability). To verify this, an extensive research can be done to evaluate the results of statistical software and analytical tools. We can figure it out even without research. But is anybody interested in it?

Statistics is the science of beauty that can do a lot. Statistics are also used to evaluate the A/B testing (and that's really good).

Little data

The A/B test is nicely described in this article (sorry, only in Czech). However, the results interpretation is bad (see the discussion below the article). This is an example of how not to do it (or rather not to evaluate the results). In this case the problem lies in a little data volume (for this test type). From a small data set you may be able to choose the better of two options (i.e. the best option). In the case of six options (as is considered in the article) however, the situation is considerably more complicated and it shows that those options are not the same (see the tests of goodness of fit), they require more data than in case of the two options. And if you want to choose from six options the best one, the sample will have to be even greater.

Too much data

Another extreme is the big volume of real data. It is an extreme, not a problem. However, statistics and tests are significantly older than IT equipment. For example, T-test began using Guinness brewery employee in the early 20th century. In that period such a large volume of data to analyze was not available as it is at present. And in the tables for critical values, we see that "infinity begins very soon" (already with a sample frequency of 100, the critical value differs by less than 2 percent compared to the "infinite sample"). Also, in most statistics textbooks there are examples where the frequency of the sample is units, most in the tens.

In case of big data volume, the statistics will tend to reject the hypothesis of equality (i.e., using the statistics we can show that two groups have different parameter). The reason is that there are not two exactly same things in the universe (ok, protons are the same, gas atoms are too, but we will not test them). And also small deviation from the theoretical assumptions (i.e. that the random variables being independent and identically distributed), especially during test when a large volume of data is used, might lead to the incorrect conclusion. For example, let’s suppose you have an online store with a huge number of visitors (millions of hits per a week - and considering also web visitors who do not purchase anything). Do the test to determine whether a contemplated change has an impact on the number of items purchased by customers and the total amount of money that customers spend. During this test we find out that the customers purchase 0.3% more items than before the change, which is statistically significant and at the same time they spends about 0.4% less, than before the change , which is not statistically significant (due to larger standard deviation due to the mean value). Is this a good result for us?

Walter Frank Raphael Weldon in the 19th century threw 26,306 times 12 dice and watched the frequency of fives and sixes. The results of the experiment suggest that the dice were not fair. The same conclusion was also found by scientists, when they repeated the experiment (they used a machine to throw the dice and to count the results).

A perfect dice, a coin or roulette simply doesn’t exist. Also, two different marketing campaigns do not make the same result (if, at least, one campaign will have an impact). But to prove it using statistics can take a long.

Random number generators work well for large frequency according to theoretical assumptions. Even when generating 2,000,000,000 throws of an imaginary coin, I didn't show that the algorithm worked poorly (Oracle 11, dbms_random). Head fell 1,000,003,718 times.

How to get the best from A/B tests?

Perform the test correctly. Nonrandom distribution of the group might have surprising conclusions. The campaign A works better for males. The campaign A works also better for females however for the unisex - the unification of men and women – the campaign B works better. See the Simpson's paradox. It is better to have less data of good quality than to have a lot of bad data.

Bigger data volume. It is true that to prove a small difference we need much more data than to prove a big difference (simplification for n-times more accurate estimate of the need to n-square times more data). However, even for large samples it is necessary to follow the methodology. For example, a survey of 2.4 million respondents chose the wrong future U.S. president.

Use the data that you have to their best. For example, if you have historical data for the test participants, it is possible to use Bayesian statistics to get more and better analysis.

And especially before you test, remember what your data mean and what you want to achieve. Quantity of purchased beer and the client’s gender are dependent. Quantity of purchased hair conditioner and the client’s gender are dependent. Quantity of purchased beer and quantity of purchased hair conditioner are dependent. Which one of these dependencies can be useful for the marketing?

20 Mar 2012

About miTSakes

Short introduction to problem of comparing strings - fuzzy matching with examples (Levensthein distance and Jaro-Winkler distance).
Only Czech version available http://zdrojak.root.cz/clanky/jak-na-prelkepy/