A Probability/Bayes` Theorem Primer1 Paul De Palma A.1 Some

A Probability/Bayes’ Theorem Primer1
Paul De Palma
A.1 Some Preliminaries
We begin with five definitions:

Definition A.1, Random Experiment
A random experiment is one in which “the outcome cannot be predicted with certainty”
(p. 1).

Definition A.2, Sample Space
The sample space, S, is the set of all possible outcomes of a collection of random
experiments.

Definition A.3, Event
If S designates a sample space and A is a subset of S, then A is an event.

Definition A.4, Relative Frequency
Suppose we conduct a random experiment n times. If event A is a set of outcomes, we
can count the number of times event A occurs, denoted |A|. Then (|A| / n) is referred to as
the relative frequency of event A in the n random experiments.

Definition A.5, Probability
The relative frequency tends to stabilize as n grows in size. If we associate a number, p,
with event A that is approximately equal to the value around which the relative frequency
stabilizes, that number is called the probability of A, denoted P(A). Said another way,
1
This material intended to make the several discussions of Bayes’ Theorem clearer. A fuller discussion is available
“P(A) represents the fraction of times that the outcome of a random experiment results in
the event A in a large number of trials of that experiment” (p. 4).
The following example illustrates these definitions.
Example A.1 Suppose we have a fair six-sided die that we roll repeatedly. The sample S =
{1,2,3,4,5,6}. Let A = {1,2}, i.e, the outcome of a roll is either 1 or 2. P(A) = 2/6 = 1/3. Can
this be verified experimentally? The following table indicates the result of a computer
simulation.
n
|A|
|A|/n
50
16
.32
100 34
.34
250 80
.32
500 163
.326
In fact, probability can be defined more precisely using set theory. Here are a few useful
definitions from set theory.

Definition A.6, Set
A set is an unordered collection of objects.

Defintion A.7, Union
The union of two sets, A and B, is the set that contains those elements that occur in either
A or B. The union of two sets is denoted A

Defintion A.8, Intersection
B.
The intersection of two sets, A and B, is the set that contains those elements that occur in
both A and B. The intersection of two sets is denoted A

B.
Defintion A.9, Complement
If A is a set of elements drawn from S, A’, read A complement, denotes the set of all
elements of S with the elements of A removed.

Definition A.10, Subset
If every element of set A is also an element of set B, then A is a subset of B, written

Definition A.11, Null Set
The null set is the set containing no elements, written

For all sets X,
Definition A.12, Cardinality
The number of elements in a set is called its cardinality. The cardinality of set A is
denoted |A|.
We can now define probability more precisely.

Definition A.13
Probability is a function that maps a number P(A) to each event A in the sample space S
such that the following properties are satisfied:
1. P(A)
0
2. P(S) = 1
3. If
(
)
(
)
(
)
(
)
Less formally, item 1 tells us that probability can never be negative, since it is defined as
the number of times an event occurs divided by the number of trials. Item two says that
probability of the sample space is 1. Since the sample space contains all possible outcomes, one
of the outcomes has to occur for every random experiment. The sum of these outcomes divided
by the number of outcomes is 1. Item 3 tells us that set union is closely related to addition.
Suppose we have a six-sided fair die. If A = {1,2} and B = {3,4}, the probability that you will
roll a 1,2,3,4 is just the probability that you will roll a 1 or 2 plus the probability that you will roll
a 3 or 4, namely 2/3.
Several basic theorems follow from the definitions. The proofs are quite simple and may be
found in Hogg and Tanis (2008).

Theorem A.1
For each event A, P(A) = 1 – P(A’)

Theorem A.2
( )

Theorem A.3
( )
If events A and B exist such that

Theorem A.4
For each event A, P(A)

Theorem A.5
If A and B are any two events, then
(
)
( )
( )
(
)
( )
Notice that this meets the conditions of item 3 in Definiton A.6. If
is the null set,
then its probability is 0.

Theorem A.6
If A, B, and C are any three events, then
(
)
( )
( )
( )
(
(
)
(
)
(
)
)
Theorem A.6 can be generalized to any number of events.
Example A.2
A man is meeting three friends who are flying in from cities A, B, and C. We know the
following probabilities, where P(Q) is the probability that any flight from city Q is on time.
P(A) = .93
P(B) = .89
P(C) = .91
(
) = .87, i.e., the probability that the any flight from both city A and city B is on time.
(
)
(
)
(
)
What is the probability that at least one of the flights is on time? That is what is the probability
that a flight from city A, or a flight from city B, or a flight from city C is on time?
We are seeking (
). The result may be obtained by inserting the above probabilities
directly into the right hand side of Theorem A.6, obtaining .96.
A.2 Conditional Probability
So far we have been considering events that are subsets of the sample space S, for
example, the probability that a 1 or 2 will be rolled using a fair, six-sided die. In this case, the
event is the {1,2}, while S = {1,2,3,4,5,6}. In problems of conditional probability—of which
Bayes’s Theorem is the most obvious instance in speech recognition—we want to define the
probability of an event A considering only those outcomes of a random experiment that are
elements of a prior event B. Consider the following example.
Example A.3 Suppose we are given a list of 20 student IDs that contain no information about
gender or state of residence. We are also given the following summary table:
Male (M) Female (F) Total
Washington (W) 5
8
13
Oregon (O)
3
4
7
Total
8
12
20
If one student is selected at random, the probability that that student lives in Oregon is given by
P(O) = 7/20. Now suppose we are interested only in Oregon residents who are also male. The
time the probability is 3/8, because we are limiting ourselves first to males. We denote this by
P(O|M), read “the probability of O given M, or the probability that a student is from Oregon
given that he is male.”
Notice that
|
( | )
|
|
|
| |
| |
(
)
( )
This leads to a definition of conditional probability.
Definition A.14, Conditional Probability
If P(B) > 0, the conditional probability of an event A, given than an event B has occurred,
is given by:
(
( | )
)
( )
Multiplying both sides by P(B) gives us the multiplication rule:
Definition A.15, Multiplication Rule
(
)
( )
( | )
Or
(
)
( )
( | )
The following example illustrates the use of conditional probability.
Example A.5
Suppose we are given an opaque box containing 3 red marbles and 7 green marbles. We are
asked to close our eyes and choose two marbles at random. What is the probability that the first
marble chosen was red and the second green. Let R be the event defined by the choice of a red
marble and G be the event defined by the choice of the green marble. Then,
P(R) = 3/10
Now, having chosen a marble, there are only nine marbles left in the box, so
P(G|R) = 7/9
The probability of picking a red marble first followed by a green marble is given by the
probability of the intersection of the event R and the event G: (
).
Invoking Definition A.15, we have:
(
)
( )
( | )
=
Notice that the results are the same if the green marble is chosen first. That is, the probability of
first choosing a red marble followed by a green marble equals the probability of first choosing a
green marble followed by a red marble and both are designated as (
)
The multiplication rule can be extended to three events by using the fact that sets are
(
associative under intersection, namely that
)
For convenience, let
Then,
(
)
[(
)
]
(
)
( )
(
But (
So,
)
( )
( | )
( | )
)
( | (
)
(
)
( )
( | )
( | (
)
The general formula for more than three events can be proven by mathematical induction. For k
events, it looks like this:
(
)
(
)
(
|
)
(
| (
)
(
| (
)
To see how this works in practice, consider the following example
Example A.6
Four cards are to be dealt from a deck of ordinary playing cards. What is the probability of the
four cards being a spade, a heart, a diamond, and a club, in that order?
Call the four events, S, H, D, C.
(
)
( )
( | )
( |(
)
( |(
)
The four terms on the right hand side mean, in order:

The probability that a spade is drawn from a deck of 52 cards. Since there are 13 spades
(2 through Ace), P(S) = 13/52

The probability that a heart is drawn from a deck from which a spade has been removed.
Since there are now only 51 cards, P(H|S) = 13/51.

The probability that a diamond is drawn from a deck after both events S and H have
occurred. Since events S and H each resulted in removing a card from the deck,
( |(
) = 13/50.

The probability that a club is drawn from a deck after events S, H, and D have occurred.
Since events S, H, and D each resulted in removing a card from the deck,
) = 13/49.
( |(
So,
(
) = 13/52 * 13/51 * 13/50 * 13/49
A.3 Independent Events
Given a pair of events, it is important to know if the occurrence of one of them changes
the probability of the occurrence of the other. In a coin flipping experiment where we are
interested in the order of heads and tails on two consecutive flips, the sample space is
S = {HH, HT, TH, TT}.
The probability of each of the outcomes in the sample space is ¼. Now, let’s define three events:
C = {TT}
B = {heads flipped second} = {HH, TH}
A = {tails flipped first} = {TT, TH}
By Theorem A.5, P(A) = P(TH
TT) = P(TH) + P(TT) – P(TH ∩ TT) = ¼ + ¼ - 0 = 1/2.
But if we are told that C has already occurred, since C is a subset of A, there is only one
possibility left. So, P(A|C) = 1. In essence, we are asking the probability of two consecutive
flips where the tail occurred on the first flip having already been told that an even occurred
which included the tail on the first flip. More formally,
P(A|C) = P(A∩C)/P(C). But A∩C = C. So P(A|C) = P(C)/P(C) = 1. In effect, the prior
occurrence of C affected the probability of A.
Now suppose we are told that B has occurred.
P(A|B) = P(A ∩B)/P(B). A∩B = {TH}. Its probability is ¼. So, P(A|B) = (1/4)/(1/2) = ½.
Notice that the prior occurrence of B did not affect the probability of A. When this is the case,
we say that the two events are independent. That is, two events are independent if the
occurrence of one of them has no effect on the probability that the other will occur.
That is two
events are independent if at least one of the following equalities holds:
P(A|B) = P(A) if P(B) > 0 or P(B|A) = P(B) if P(A) > 0.
It is easy to show that the second of the equalities also holds using the data from the example.
Now, from the definition of conditional probability (Def. A.14), we have the following
relationships:
( | )
( | )
(
)
( )
(
)
( )
So,
P(A∩B) = P(A|B) * P(B) = P(A) * P(B)
(1)
P(B∩A) = P(B|A) * P(A) = P(B) * P(A)
(2)
Because intersection and multiplication are commutative, (2) can be rewritten:
P(A∩B) = P(A) * P(B)
Which leads to following definition:
Definition A.16, Independence
Two events, A and B, are said to be independent if and only if
P(A∩B) = P(A) * P(B)
The definition of independence can be extended to more than two events as follows.
Definition A.17
Any number of events A, B, C, … N are said to be mutually independent if and only if
both of the following conditions hold:
a. Each pair meets the definition of independence
b. P(A ∩B ∩C ∩ … ∩ N) = P(A) * P(B) * P(C) * … * P(N)
Example A.7
A crucial component of an airliner has three levels of redundancy built in. If level A fails,
control is passed to level B. If level B fails, control is passed to level C. The probability of
failure of each component is .02. The levels are mutually independent. What is the probability
that the system will not fail.
Let F be the event that all systems fail. P(F) = P(A∩B∩C). F’ is the event that the system will
not fail. P(F’) = 1 – P(F) = 1 - .02 * .02 * .02 = .999992
Example A.8
Suppose we have a bowl with three red balls, two blue balls, and 4 green balls. We are to draw
three balls, replacing each after it has been drawn. Clearly, drawing one ball does not affect the
drawing of the next, since the first ball is replaced. Then the probability of drawing a red, blue,
and green ball or two reds and a green is given by:
P({RBG} or (RRG)) = 3/9 * 2/9 * 4/9 + 3/9 * 3/9 * 4/9
A.4 Bayes’ Theorem
Bayes’ theorem, named after an 18th century English cleric challenged to defend the gospels, is
concerned with how the probability of a past event affects the conditional probability of a current
event. Consider the following example. Suppose we have three bowls, A, B, C each containing
marbles as follows:

A contains two red and four white marbles

B contains one red and two white marbles

C contains five red and four white marbles.
The bowls are organized in such a way that the probability of selecting each is given by:

P(A) = 1/3

P(B) = 1/6

P(C) = ½
In this experiment we select a bowl and then draw a marble from that ball. What is the
probability of drawing a red marble, P(R)? Clearly, the outcome is affected by which bowl we
select, the probability of selecting a red marble from, C being higher (5/9) than that of either B
(1/3) or A (2/6). So, selecting a red marble is dependent first on which bowl we select and then
on the probability of selecting a red marble from that bowl. Notice that once we have selected a
bowl, the probability of selecting a red marble is independent of the bowl selection. Further, the
bowl selections are also independent.
Though this is intuitively obvious, let’s refer to the definition of conditional probability
and represent the probability of selecting a red marble from bowl A like this:
( | )
(
)
( )
The probability of selecting a red marbles from bowls B and C may be represented in a similar
fashion.
So,
P(R∩A) = P(R|A) * P(A)
But since the R and A are independent, P(R∩A) = P(R) * P(A). We can do this for each bowl,
giving:

P(R∩A) = P(R) * P(A) (1)

P(R∩B) = P(R) * P(B) (2)

P(R∩C) = P(R) * P(C) (3)
Since we want a red marble from either A or B or C, we can write the probability this way:
R is the union of sets whose probability we have represented in 1, 2 , and 3:
R=(
)
(
)
(
)
Since these sets are non-intersecting, we have:
P(R) =
(
)
(
)
(
)
= P(A) * P(R|A) + P(B) * P(R|B) + P(C) * P(R|C)
(3)
= 1/3*2/6 + 1/6*1/3 + ½*5/9 = 4/9
Since we were given the probabilities of choosing the various bowls and the number of marbles
by color in each bowl, we have been able to compute:

The probability of drawing a red marble from bowl A

The probability of drawing a red marble from bowl B

The probability of drawing a red marble from bowl C

The probability of drawing a red marble from bowl A or bowl B or bowl C
Suppose we change things around a bit. Someone tells us that he has drawn a red marble.
What is the probability that this red marble was drawn from bowl A? That is, instead of
computing P(R|A) and so forth, we want to compute P(A|R). Said another way, what is the
probability of bowl A having been chosen if we know that a red marble was previously selected.
From the definition of conditional probability, we have:
(
( | )
( )
)
( )
( | )
( )
( | )
( )
( | )
( )
( )
( | )
( )
( | )
(4)
We can compute P(B|R) and P(C|R) in the same fashion, noting that the denominator remains the
same.
P(B|R) =
( | )
( )
( )
(5)
and
P(C|R) =
( | )
( )
( )
(6)
Notice that when the given probabilities are plugged into equations (4), (5), and (6), the resulting
probabilities agree with out intuition, namely, that if a red marble is observed, the probability
that it came from bowl C is higher since there are a greater fraction of red marbles in bowl C
than in either bowl A or B:
P(A|R) = 2/8
P(B|R) = 1/8
P(C|R) = 5/8
P(R) is usually called the prior probability. P(A|R), P(B|R), and P(C|R) are called the conditional
probabilities.
It is easy to generalize the foregoing. Since choosing a bowl is an event, that is a set of
trials, the sample space,
. We can think of events A, B, C as mutually exclusive
and exhaustive partitions of the sample space. So, for a sample space, S, with m partitions, we
have:
where each Bi is a mutually exclusive partition. Suppose the prior probabilities of B with respect
to the partitions are positive. That is P(Bi) > 0, i = 1, …, m. In bowl world, this simply means
that S represents of the choice of bowl 1, the choice of bowl 2, and so on.
Suppose A is an event that occurs with each of the B mutually exclusive partitions:
A=(
(
)
)
(
)
Then,
( )
∑
(
)
∑
( )
( | )
(7)
Notice that equation (7) is just equation (3) generalized to m bowls. If P(A) > 0, the definition of
conditional equality tells us that:
(
| )
(
)
(8)
( )
Using the definition of conditional probability and replacing P(A) in (8) with (7) we get the
general form of Bayes Theorem:
(
| )
(
∑
)
( | )
( )
( | )
With just two independent events, X and Y, the simple form is:
( | )
( | )
( )
( )
The following example illustrates the use of Bayes theorem for two events.
Example A.9
A part gotten from supplier X has a 2% failure rate. The same part gotten from supplier Y has a
3% failure rate. A manufacturer purchases 40% of its parts from X and 60% from Y. What is
the probability that a part selected at random from the mixed parts will fail?
We know:
P(F|X) = .02
P(F|Y) = .03
P(X) = .4
P(Y) = .6
The event that a part will fail can be expressed like this:
(
)
(
)
So,
P(F) = (
)
(
)
(X) * P(F|X) + P(Y) * P(F|Y)
= .4 * .02 + .6 * .03 = .026
Now, given that a part failed, what is the probability that it was purchased from supplier X?
Here’s where Bayes’s theorem comes into play:
P(X|F) =
( | )
( )
( )
Example A.10
The example concerning three bowls and marbles of various colors that opens this section
illustrates the use of Bayes’ Theorem with more than two events.
References
De Palma, P. (2010). Syllables and Concepts in Large Vocabulary Speech Recognition.
Dissertation.http://hdl.handle.net/1928/10865: Department of Linguistics, University of
New Mexico.
Hogg, R., Tanis, E. (1988). Probability and Statistical Inferences. NY: Macmillan.