DATa-ing

Thoughts on Data & Science

The Science of A/B Testing

01 Sep 2015

Modern day web or mobile ecosystem thrive on data-driven decisions, based on data obtained from well-designed experiments. Lets take look at the science, behind one of the most popular experiments on web - A/B Tests.

This post is part of a series about A/B tests. I will post the links as i write the subsequent posts.

What is A/B testing

A/B testing or split testing is a randomized experiment, which compares performance of a metric, over alternate versions of same web page, simultaneously.

Simply, in A/B testing, different users are shown different variants of same web page and performance of a given metric, say clicking a button, is compared. How we select the users and how measure such a performance is key to successful test.

To understand A/B test, lets break-through the definition one-by-one:

a randomized experiment

What is a randomized experiment and why do we need it ?
A randomized experiment refers to an experiment in which the subjects or objects are selected in randomized fashion. This is done, so that the process of selection represents a chance model and any bias is reduced.

performance of a metric

A metric connects the maths to business.
A metric can be any business query, that the test seeks to be answered. For example, does the new search box works? or does changing color of Buy button increase sales?

alternate versions of same page

A/B test must always be done on same page with slight variation. The slight variation scheme is used to eliminate the confounding variable, which may hide the cause of a particular result or action. However, contextualizing the results of a test may sometimes bring wonderfully surprising results, as mentioned here at airbnb’s blog.

simultaneously

This perhaps may be the most important characterisitc of A/B tests. The test must be performed for all subjects at same time. This is done to create a notion of independence among subjects of test.
For example, running part of test on Sunday or other on Monday, will lead to biased results.

The Basics

In A/B testing, the users of application, are first randomly selected and placed into several buckets or groups and then tested for result. The process can be sumarized as:

  • Categorize visitors into groups
  • Provide variants of a web page to users of each groups, with all users of same group getting same variant.
  • Measure and compare the response

The Maths

As theory of probabilistic development goes with gamblers, the statistical theory goes with, medical field of clinical trials and thus it borrows much of its conventions from there. For simplicity let us assume that we want to generate only two groups.

In an A/B test, the incoming users are randomly placed into two groups.
One group is shown the current web page and another is shown a variant.

The users shown the current page are called - Control Group and the users, which gets the variant are called - Treatment Group.

The resaon for using control group is to have a baseline for comparing the result of variant.

Statistically, A/B tests are simply two-sample hypothesis tests.

What is hypothesis testing?
A hypothesis is simply a statement, that hasn’t been proved true. And hypothesis testing is process of verifying that, whether a given hypothesis is true.

Since, a hypothesis can be complex enough to deduce its probability, we use a different approach to prove our hypothesis, which is known as null-hypothesis testing. In null-hypothesis testing, we use a baseline claim called - null hypothesis and try to reject it based on comparison to another hypothesis called - alternate hypothesis.

In A/B testing, null-hypothesis states that “the variant has no effect on user”, while alternate hypothesis states that “the variant has significant effect on user!”

Mathematically,
$$H_o = p_c - p_t$$ where $H_o$ is the null-hypothesis & $p_c$ and $p_t$ are rate of change of metric for control and treatment groups respectively.

To reject the null-hypothesis, we have to prove that:
$$H_o = p_c - p_t < 0 $$ i.e. performance of metric is less in treatment group, than control group.

But why we need these hypotheses things ?
One common understanding is to think that $p_c - p_t$ is simply the required difference, so why need a hypothesis testing.
The answer is that we want to account for chance variation. Since, even if both groups are shown exactly same page (something called dummy testing), even then the rates will be different due to chance variation. And we don’t want to be misleaded by chance, making sure that the results are due to actual difference.

How to be confident: Two sample Z-test
To account for chance variation, we use little stats and something called Z-test.
Z-test falls in domain of significance testing, allowing us to detect the significance of results as compared to null-hypothesis.

Simply, Z-score is:
$$ Z = \frac{\text{difference of rates}}{\text{Standard Error of sample}}$$ And the complete equation thus becomes,
$$Z = \frac{p_c - p_t}{\sqrt{\frac{p_c(1 - p_c)}{N_c} + \frac{p_t(1 - p_t)}{N_t}}}$$ Here, $N_c$ and $N_t$ are number of users in control and treatment groups respectively.

But what actually is z-score

Z-score is the probability of obtaining a test-statisitc, as extreme as the observed one, assuming the null-hypothesis to be true. It is denoted by P and called as P-value.

If P < 0.05, then we conclude that the null-hypothesis is rejected, making the observed difference statistically significant, otherwise the observed difference is just due to chance variation.

In upcoming post we take a look at some data and code to perform an A/B experiment.

Stay Curious !!

References

Comments