Intro to probability theory – part 2

Ce billet a été traduit de sa version originale en français : Probabilités – Partie 2.

After the previous post about probability theory, here’s the second part, in which I’ll talk about random variables.

The idea of random variables is to have some way of dealing with events for which we do not exactly what happens (for instance, we roll a die), but still want to have some idea of what can happen. The die example is pretty simple, so using random variables may be a bit overkill, but let’s keep examples simple for now.

For a given experiment, we consider a variable, called X, and look at all the values it can reach with the associated probability. If my experiment is “rolling a die and looking at its value”, I can define a random variable on the value of a 6-sided die and call it X. For a full definition of X, I need to provide all the possible values of X (what we call the random variable’s domain) and their associated probabilities. For a 6-sided die, the values are the numbers from 1 to 6; for a non-loaded die, the probabilities are all equal to \displaystyle \frac 1 6. We can write that as follows:

\displaystyle \forall i \in \{1,2,3,4,5,6\}, \Pr[X = i] = \frac 1 6

and read “for all i in the set of values {1,2,3,4,5,6}, the probability that X takes the value i equals \displaystyle \frac 1 6.

One of the basic ways to have an idea about the behaviour of a random variable is to look at its expectation. The expectation of a random variable can be seen as its average value, or as “suppose I roll my die 10000 times, and I average all the results (summing all the results and dividing by 10000), what result would I typically get?”

This expectation (written E[X]) can be computed with the following formula:

E[X] = \displaystyle \sum_{i \in \text{dom}(X)} \Pr[x = i] \times i

which can be read as “sum for all elements i in the domain of X of the probability that X takes the value i, times i. In the die example, since the domain is all the integer numbers from 1 to 6, I can write

\displaystyle \sum_{i=1}^6 \Pr[X = i] \times i

which I can in turn expand as follows:

\begin{aligned}E[X] &=& 1 \times \Pr[X = 1] + 2 \times \Pr[X = 2] + 3 \times \Pr[X = 3]  \\ &+& 4 \times \Pr[X = 4] + 5 \times \Pr[X = 5] + 6 \times \Pr[X = 6]\end{aligned}

Since, for my die, all the probabilities are equal to \displaystyle \frac 1 6, I can conclude with

\displaystyle E[X] = \frac 1 6 \times (1 + 2+3+4+5+6) = \frac{21}{6} = 3.5

So the average value of a die over a large number of experiments is 3.5, as most tabletop gamers would know 😉

Now let’s look at a slightly more complicated example. Suppose that I have n dice, and that I want to know how many 6s I can expect in my n dice. From a handwavy point of view, we know that we will not get an exact answer for every time we roll n dice, but that we can get a rough answer. There’s no reason there should be more or less 6s than 1s, 2s, 3s, 4s or 5s, so generally speaking the dice should be distributed approximately equally in the 6 numbers, so there should be approximately \displaystyle \frac n 6 6s over n dice. (The exception to that being me playing Orks in Warhammer 40k, in which case the expected number is approximately 3 6s over 140 dice.) Let us prove that intuition properly.

I define Y as the random variable representing the number of 6s over n dice. The domain of Y is all the numbers from 0 to n. It’s possible to compute the probability to have, for example, exactly 3 6s over n dice, and even to get a general formula for k dice, but I’m way too lazy to compute all that and sum over n and so on. So let’s be clever.

There’s a very neat trick called linearity of expectation that says that the expectation of the sum of several random variables is equal to the sum of the expectations of said random variables, which we write

E[A + B] = E[A] + E[B]

This is true for all random variables A and B. Beware, though: it’s only true in general for the addition. We cannot say in general that E[A \times B] = E[A] \times E[B]: that’s in particular true if the variables are independent, but it’s not true in general.

Now we’re going to define n variables, called Y_1, Y_2, ..., Y_n so that Y is the sum of all these variables. We can define, for each variable Y_1, the domain {0,1}, and we say that Y_1 is equal to 1 if and only if the die number 1 shows a 6. The other variables Y_i are defined similarly, one for each die. Since I have n variables, which take value 1 when their associated die shows a 6, I can write

\displaystyle Y = \sum_{i = 1}^n Y_i

This is where I use linearity of expectation:

\displaystyle E[Y] = E\left[\sum_{i=1}^n Y_i\right] = \sum_{i=1}^n E[Y_i]

The main trick here is that variables Y_i are much simpler to deal with than Y. With probability \displaystyle \frac 1 6, they take value 1; with probability \displaystyle \frac 5 6, they take value 0. Consequently, the expectation of Y_i is also much easier to compute:

\displaystyle E[Y_i] = 1 \times \frac 1 6 + 0 \times \frac 5 6 = \frac 1 6

Plugging that in the previous result, we get the expectation of Y:

\displaystyle E[Y] = E\left[\sum_{i=1}^n Y_i\right] = \sum_{i=1}^n E[Y_i] = \sum_{i=1}^n \frac 1 6 = \frac n 6

which is the result we expected.

Now my examples are pretty simple. But we can use that kind of tools in much more complicated situations. And there’s a fair amount of other tools that allow to estimate things around random variables, and to have a fairly good idea of what’s happening… even if we involve dice in the process.

Intro to probability theory – part 1

Ce billet est la traduction d’un billet écrit en français ici : Probabilités – partie 1

I’m eventually going to talk a bit about random algorithms, but before I do that, I need to talk about probability theory so that everyone is speaking the same language.

Probability theory is a bit annoying, because probability is something that many people use fairly often in their daily life, but it can be awfully counter-intuitive. That, or human beings suck at probability theory – also a possibility.

I kind of don’t want to formally define probability (“given a universe \Omega, events”, blah blah blah… yeah, no, don’t want to do that), I’m going to try to explain things in a handwavy way, and hope for the best.

First, let me assume that everything I’m talking about here – unless mentioned otherwise explicitly – has “usual” properties: dice are not loaded, coins are real coins, etc.

Let us start with a classical example. If I flip a coin, what is the probability that it shows heads when it lands? The answer is 1/2: I have two possible events (the coin can show heads or tails), and they both have the same probability of occurring. Same thing if I roll a 6-sided die: the probability that it shows 4 is 1/6; I have six possible events that all have the same probability of occurring. In general, I’ll say that if I do an experiment (flipping a coin, rolling a die) that has k possible outcomes, and that all of these outcomes have the same probability (“chance of happening”), then the probability of each of these outcomes (“events”) is \displaystyle \frac 1 k.

It can happen that all the possible outcomes don’t have the same probability, but there are a few non-negotiable rules. A probability is always a number between 0 and 1. An event that never occurs has probability 0; an event that always occurs has probability 1. If I find a coin with two “heads” sides, and I flip it, it shows “heads” with probability 1 and “tails” with probability 0. Moreover, the sum of all the probabilities of all the possible outcomes of a given experience sums to 1. If I have k events that have the same probability, we indeed have \displaystyle k \times \frac 1 k = 1. If I have a die with three sides 1, two sides 2 and one side 3, the probability that it shows 1 is \displaystyle \frac 3 6, the probability that it shows 2 is \displaystyle \frac 2 6, and the probability that it shows 3 is \displaystyle \frac 1 6; the sum of all these is \displaystyle \frac 3 6 + \frac 2 6 + \frac 1 6 = 1.

Now it gets a bit trickier. What is the probability that a (regular) 6-sided die shows 3 or 5? Well, the probability that it shows 3 is \displaystyle \frac 1 6; the probability that it shows 5 is \displaystyle \frac 1 6, and the probability that it shows 3 or 5 is \displaystyle \frac 1 6 + \frac 1 6 = \frac 1 3. But summing the individual probabilities only works if the events are exclusive, that is to say they cannot both be true at the same time: if I rolled a 5, then I can’t have a 3, and vice-versa.

It does not work if two events can both occur at the same time (non-exclusively). For instance, consider that flip a coin and roll a die, and I look at the probability that the coin lands on tails or that the die lands on 6. Then I CANNOT say that it’s equal to the probability of the coin landing on tails – \displaystyle \frac 1 2 -, plus the probability that the die lands on 6 – \displaystyle \frac 1 6 – for a total of \displaystyle \frac 2 3. A way to see that is to tweak the experiment a bit to see that it doesn’t work all the time. For example, the probability that the die lands on 1,2, 3 or 4 is \displaystyle \frac 4 6 = \frac 2 3. The probability that the coin lands on tails is \displaystyle \frac 1 2. And it’s not possible that the combined probability of these two events (the die lands on 1, 2, 3 or 4, or the coin lands on tails) is the sum of both probabilities, because that would yield \displaystyle \frac 2 3 + \frac 1 2 = \frac 7 6, which is greater than 1. (And a probability is always less or equal to 1).

What is true, though, is that if I have events A and B, then

\Pr(A \cup B) \leq \Pr(A) + \Pr(B)

where \Pr(A) is “probability that A occurs”, \Pr(B) is “probability that B occurs” and \Pr(A \cup B) is “probability that A occurs or that B occurs”. Or, in a full sentence: the probability of the union of two events (event A occurs or event B occurs) is always less than or equal to the sum of the probabilities of both events. You can extend this to more events: the probability of the union of several events is less than or equal to the sum of the probabilities of the individual events. It may not seem very deep a result, but in practice it’s used quite often under the name “union bound”. The union bound does not necessarily yield interesting results. For the case “the die lands on 1, 2, 3 or 4, or the coin lands on tails”, it yields a bound at \displaystyle \frac 7 6… and we already had a better bound by the fact that the probability is always less than 1. It is slightly more useful to bound the probability that the die lands on a 6 or that the coin lands on tails: we know that the probability is less than \displaystyle \frac 2 3. When analyzing random algorithms, the union bound is often pretty useful, because the probabilities that we consider are often pretty small, and we can add a lot of them before the bound stops making sense. The union bound is often not enough to make a useful conclusion, but it’s very often a useful tool as a step of a proof.

For the example that I’ve been using so far, the best tool (because it gives an exact result) is the inclusion-exclusion principle. For two events A and B, it’s stated as follows:

\Pr(A \cup B) = \Pr(A) + \Pr(B) - \Pr(A \cap B)

with the same notation as previously, and \Pr(A \cap B) representing the probability that both events A and B occur. In a sentence: the probability that A or B occur is equals to the sum of the probabilities that either of them occurs, minus the probability that both A and B occur at the same time. The idea is that, if two events can occur at the same time, that probability is counted twice when summing the probabilities. It may be clearer with a small drawing (also known as Venn diagram):

If I count everything that is in the green area (corresponding to event A), and everything that is in the pink area (corresponding to event B), I count twice what is both in the green and pink area; if I want to have the correct result, I need to remove once what is common to both areas.

Now the question becomes – how to compute the probability of two events arriving at the same time? There’s the easy case, and the not so easy case. In the easy case, the events are independent: the probability that one occurs has no impact on the probability that the other one occurs. It’s is true (although not necessarily intuitive) if one is working with dice and coins and considering their individual results, but it’s a notion that requires a lot of care to apply (it’s not true anymore if I’m interested in the individual results and the sum of two dice, for example). Proving that two events are independent can be pretty complicated, and handling computations when they are not… as well.

When two events are independent, we have:

\Pr(A \cap B) = \Pr(A) \times \Pr(B)

that is to say that the probability that both events occur is equal to the product of the probabilities of both events. If I’m flipping a coin and rolling a die, the coin landing on tails has no impact on the die landing on 6. Both events are independent; the probability that both events occur is consequently \displaystyle \frac 1 2 \times \frac 1 6 = \frac{1}{12}. Observe that this probability is smaller than \displaystyle \frac 1 2 and smaller than \displaystyle \frac 1 6. It’s “obvious” since a probability is smaller than one, and that consequently multiplying two probabilities together yields a result that is smaller than both. Another way to see that is the fact that the probability that two independent event occur at the same time is less probable than the probability than only one of them occurs.

An example of two events that are not independent would be event A “the die lands on 1” and event B “the die lands on an odd number”. In that case, the events are not independent: if the die lands on 1, then it landed on an odd number; if the die lands on 2, it cannot land on an odd number as well. In that specific case, event A is included in event B, so it’s easy: the probability that both events occur is equal to the probability of the included event: A \cap B is the same thing as A. For a subtler example, one can define A as “the die lands on 1, 4 or 6” and keep B as “the die lands on an odd number”. These events are not independent either. The probability of A and B both occurring is \displaystyle \frac 1 6, corresponding to the die landing on 1 (it’s the only case that’s both in 1, 4, 6 and odd). Multiplying the probabilities of A – \displaystyle \frac 1 2 – and B – \displaystyle \frac 1 2 – without paying attention yields \displaystyle \frac 1 4 and that’s how we get a mistake in our proof.

Last one for the road: I still want to talk about conditional probabilities. Conditional probabilities are a way to handle dependencies between events. We write \Pr(A \mid B), say “probability of A given B“, and understand “probability of A knowing that B occurs”. If A and B are independent, then \Pr(A \mid B) = \Pr(A): the fact that B occurs does not have any impact on the occurrence of A. For the previous case, where A is the event “the die lands on 1” and B is the event “the die lands on an odd number”, we can see the “given B” clause as a restriction of the possible events. The die landed on an odd number, it is known – so now, we know that it has the same probability to have landed on 1, 3 or 5, but a probability of 0 to have landed on 2, 4 or 6. Consequently, we have \displaystyle \Pr(A \mid B) = \frac 1 3.

There’s a general formula for conditional probabilities:

\displaystyle \Pr(A \mid B) = \frac{\Pr(A \cap B)}{\Pr(B)}

It’s possible to show, from this formula, the fact that if A and B are independent, then \Pr(A \mid B) = \Pr(A), because then:

\displaystyle \Pr(A \mid B) = \frac{\Pr(A) \times \Pr(B)}{\Pr(B)} = \Pr(A)

That formula is also pretty useful in the other direction:

\displaystyle \Pr(A \cap B) = \Pr(A \mid B) \times \Pr(B) = \Pr(B \mid A) \times \Pr(A)

because it’s sometimes easier to understand what happens in the context of conditional probabilities than in the context of two non-independent events occurring.

In the next blog post, we’ll talk about random variables 🙂