Probability - The Art and Science of Quantitative Reasoning

MA/CS 109
The Art and Science of Quantitative Reasoning
Probability
Our primary topic for the three-week statistics module in this class will be inferential
statistics. Inferential statistics is concerned with making precise quantitative statements
about a population, given data on that population, while acknowledging the inherent uncertainty in the data. This uncertainty may be an unavoidable aspect of our data collection
or, alternatively, may be purposely incorporated into the design of a study. In either case,
in order to succeed in making precise quantitative statements we require a language and associated mathematical machinery for quantifying the uncertainty we face. For this purpose
we employ probability.
We will spend the first week of our module laying a modest foundation in elementary
probability. Although the topic is both broad and deep, even at a foundational level, our
treatment will necessarily be minimal. Our goal here will be simply to gain some basic
understanding of what probabilities are, how they behave, and how to interpret them. An
understanding of these issues will, in turn, greatly facilitate our discussion of select topics
in inferential statistics in the second and third weeks of our module. Probability will also
be seen, in a quick detour back to computer science, to be central to certain types of cryptography, where its usage is purposeful and – to any of us relying on password encryption –
beneficial.
1
1.1
Background
Randomness
Recall the fundamental diagram that has guided much of our inquiry so far in this class.
/
GF
Real-World
Modeling/Abstraction
@A
Analysis/Solution
Figure 1: Fundamental quantitative reasoning diagram.
What most distinguishes the material we will cover in the statistics module of the course from
that covered in the previous two modules on mathematics and on computer science, is (i) the
1
introduction of the concept of randomness as an abstraction for our uncertainty in problems, and (ii) the development of tools from probability and statistics for analysis/solution
in problems involving uncertainty.
In this class, when we refer to a phenomenon as being random we will mean that it
is (seemingly) unsystematic – that is, without order, direction, or regularity, and therefore
unpredictable. The qualification ‘seemingly’ is included in this working definition because
the nature of the information we have on a phenomenon often directly impacts the extent
to which we can make effective predictions concerning its behavior.
For example, mathematical models of growth and change are used to predict the weather
with great accuracy short-term (e.g., on the order of hours to days) but still have little
predictive power longer term (e.g., on the order of weeks and beyond). Essentially, there are
too many factors, either unknown to us or whose effects are unknown to too little accuracy,
that impact weather relatively little over smaller timescales but quite substantially over larger
timescales. Alternatively, consider the type of encryption that is employed in our everyday
usage of the Internet. For a third party, perhaps seeking to act in a malicious manner
towards you or your computer (e.g., steal credit card information, hijack your laptop, etc.),
lack of knowledge of the underlying encryption key is sufficient (hopefully!) to render the
bits encoding your content essentially random from his/her perspective. But, obviously, with
knowledge of this information, such security would be sacrificed.
Thus the notion of randomness is a conceptualization that we will use for capturing
the non-systematic part of what we are facing in a problem. Probability is an attempt to
formalize the language associated with and, in particular, the mathematical treatment of
randomness.
1.2
Some Basic Terminology
Given a random phenomenon, an actual realization of that phenomena is called an outcome.
The set of all possible outcomes is called the sample space. The canonical example of a
random phenomena, and the one we will focus on in lecture, is that of a random coin flip(s).
Since a coin, when flipped, can come up either heads (H) or tails (T), those are the only two
possible outcomes, and the sample space is simply the set {H, T }. Alternatively, suppose
that the phenomenon of interest to us is the number of times your favorite celebrity ‘twitters’
tomorrow. Then the outcome is the actual number of twitters (whatever that ends up being),
and the sample space is the set {0, 1, 2, 3, . . .}.
As a final example, and one which we will revisit throughout these notes, consider the
action of rolling a pair of dice.
Example 1. (Roll the Dice!) The rolling of a die as a mechanism for inducing uncertainty
goes back thousands of years. The mathematical study of the chances of outcomes associated
with die rolls goes back centuries, and in fact is the context within which some of the earliest
work on what is now called ‘probability’ was done. Although the rolling itself, and hence its
outcome, clearly must be governed by the laws of physics, the end result of a die well-rolled
on a flat surface remains beyond our ability to predict with any real accuracy.
Suppose that you have a pair of dice and that you plan to roll them and to report the
value that comes up on each. Then the outcome is the pair of numbers that you report. The
2
sample space is the set of all possible pairs of numbers.
It will be useful for our thinking and calculations to organize the sample space in the form
of a table, as shown in Figure 2. Here the first number (shown in blue), which matches the
row number, corresponds to the number reported for the first die, while the second number
(shown in red), which matches the column number, corresponds to the number reported for
the second die. Figure 2: Table representing the sample space associated with the throwing of two dice.
An outcome is the actual realization of a random phenomena. But in practice our interest often is not necessarily in a specific outcome itself, but rather in a more non-trivial
characterization of that outcome. For example, in a sequence of 10 coin flips we may be
interested in whether or not there were more heads than tails. Alternatively, in monitoring
the number of twitters by your favorite celebrity tomorrow, we may only be interested in
whether there more than, say, 25. Formally, these are examples where we are interested
not in whether a single outcome occured or not, but instead in whether or not the outcome
observed was in in a certain subset of the sample space. Such subsets are called events.
To say that an event occured is to say that at least one of the outcomes defining that event
occured.
Example 2. (Roll the Dice! (cont)) Consider again our example of rolling a pair of dice,
introduced in Example 1. Outcomes were the values that come up on the two dice after they
are rolled. Events are therefore any statement regarding this pair of values. Examples of
such statements include
• The first die is less than the second die;
• Both dice are odd;
• The sum of the two dice is equal to 6;
3
• The first die is equal to three.
The subsets of the sample space corresponding to these four events are illustrated in Figure 3.
Figure 3: Illustration of four possible events when rolling a pair of dice. Top left: the first
die is less than the second die. Top right: both dice are odd. Bottom left: the sum of the
two dice is equal to 6. Bottom right: the first die is equal to three. Subsets of outcomes
defining these events are shaded in light blue.
In general, discussion of sample spaces, outcomes, and events is simplified by the use of
some minimal mathematical notation. The symbol S is typically used to denote the sample
space, while capital letters like A, B, etc. are used to denote events. Note that outcomes
themselves can be thought of as events, in that outcomes are subsets of the sample space
consisting of just one element.
2
Probabilities
Two of the main points regarding probability that we need to understand for our discussion
of inferential statistics in the second and third weeks of our statistics module are (i) what
probabilities are and (ii) how they behave.
4
2.1
What Are Probabilities?
From a mathematical perspective, a probability is a function that assigns numerical values
to events. An event that has a numerical value greater (less) than another is then said to be
more (less) likely than the other. The assignment of numerical values, however, is not allowed
to be completely arbitrary, but rather is subject to certain rules. These rules derive from a
combination of (i) certain basic definitions and assumptions, and (ii) additional results that
follow logically from those.1 In terms of notation, we denote the probability of an event A
by P (A).
A natural question to ask about probabilities is, “From whence do they come?” The
answer is that they typically are obtained in one of three manners (or some combination
thereof): (i) by mathematical calculations, (ii) by empirical observations, and (iii) through
subjective assessment. We will see examples this week of how probabilities may be derived
from mathematical calculations, while later we will focus on the inference of probabilities
from data. Note that in both cases the modeling inherent in our fundamental diagram, in
Figure 1, plays a key role.
Also important is the question of how to interpret probabilities. There are two standard
interpretations of probability. The first treats probabilities as relative frequencies, while the
second treats them as degrees of subjective belief. Both treatments are used in practice,
but the former arguably is used substantially more than the latter. The interpretation of
probabilities as relative frequencies is known as the long-range frequency interpretation
of probablity. Given a random phenomenon we intend to observe, and an event A of interest
to us, this interpretation says that the probability P (A) is the frequency with which A would
be observed to occur over the long run, were the action of our random phenomenon generating
an outcome to be replicated a large number of times. That is, P (A) is the frequency with
which we expect to observe A over the ‘long-run’. See the appendix for a more detailed
discussion of this interpretation.
2.2
A Simple Model: Equally Like Outcomes
The simplest type of probability model, which is nevertheless used with some frequency in
practice, is one which assumes that all outcomes in a sample space are equally likely. That
is, the model assumes that the probability of any outcome is the same as that of any other.
This model has the advantage that it greatly facilitates many of the types of calculations
we’d like to do in probability, with many expressions being derivable in closed form. And in
some situations, the model is in fact quite reasonable.
For example, in modeling the tossing of a coin, if we specify that this be a fair coin,
then we are implicitly saying that the outcomes H and T are equally likely. Similarly, when
surveys are conducted, the common assumption that the survey is based on a so-called
simple random sample of, say, n people from a population of size N (an assumption that will
underlie most of our development when studying statistical inference and opinion polls) is
1
Formally, a probability is a mathematical function that assigns a numerical value to each subset of a
sample space. The assumptions defining the rules to which probabilities are subject are called axioms. These
axioms are generally attributed to the great Russian mathematician A.N. Kolmogorov, and are stated in one
of a handful of equivalent forms, involving either 3 or 4 axioms.
5
equivalent to assuming that any subset of n people (i.e., the outcome of the survey sampling)
is equally likely.
More often, however, this type of model is too simple, and outcomes are not equally
likely. Then, either more complex models must be specified upon which to base our probability calculations or, alternatively, we must turn to statistical inference to estimate these
probabilities. Nevertheless, the model of equally likely outcomes is extremely valuable for
building intuition and, often, as at least a rough approximation to reality.
We will use this model below, and in lecture, for illustration. And we will focus on
examples where the sample space S is a finite set. This latter constraint is convenient in
that it helps reduce most of our probability calculations to counting arguments. In particular,
in this context we then have that,2 for any event A the probability of A is simply
P (A) =
# outcomes in A
.
# possible outcomes in S
(1)
That is, for an equally likely probability model on a finite sample space, the probability of
any event is simply the fraction of outcomes in the sample space that define the event.
Example 3. (Rolling Dice Under an Equally Likely Model) Let us return to the context of
Example 1, in which an outcome is defined as the pairs of numbers facing up after rolling
two dice. The sample space S in this setting contains a total of 36 outcomes, as shown in
Figure 2. If we assume that all 36 outcomes are equally likely, then the probability of a given
event A is obtained, by formula (1), as the number of outcomes in A divided by 36.
So, for example, consulting Figure 3, we find that
15
36
9
P ( Both dice are odd ) =
36
5
P ( The sum of the two dice is equal to 6 ) =
36
6
.
P ( The first die is equal to three ) =
36
P ( The first die is less than the second die ) =
With the ability to make such fundamental calculations, we are then able to make precise
quantitative statements not only regarding the individual events themselves but also regarding,
say, comparisons of different events. For example, we see that it is one and a half times as
likely that both dice are odd as that the first die is equal to three, while it is three times as
likely that the first die is less than the second die as that the sum of the two dice is equal to
6.
It can be shown, using certain rules for probabilities in the spirit of those below (but
slightly beyond the scope of this class), that the ‘equally likely’ model assumed here holds if
(i) the outcome of each die does not influence that of the other in any way, and (ii) each die
is a fair die. 2
This result can be shown to follow from the axioms of probability. Prior to the introduction of the
axioms, however, this expression for P (A) was traditionally taken as the definition of probability. As a
result it is sometimes referred to as the ‘classical definition of probability’.
6
2.3
How Do Probabilities Behave?
We list below some of the key rules that probabilities P must follow. The first three rules
are essentially the assumptions that underlie the system of modern probability, while the
last two rules may be shown to follow from those assumptions.
Rule 1. All probabilities must be between zero and one.
That is, for any event A, we must have 0 ≤ P (A) ≤ 1. This rule simply sets limits to the
range of values that may be assigned as probabilities of events. These limits are analogous to
those implicit in saying, for example, that you can given no less than 0% and no more than
100% in athletics (despite athletes consistently saying in interviews that they are prepared
to give, say, 110% is tomorrow’s game!). One simple use of this rule is as a sanity check that
a probability calculation is correct – if your calculation shows that an event has probability
greater than 1, for example, then your calculation is wrong!
Rule 2. The probability of something happening is certain.
That is, the probability of the entire sample space is one i.e., P (S) = 1. Recall that for
an event to occur means that at least one of the outcomes defining that event must occur.
Therefore, to say that S occured simply means that the realization of a given random process
was one of its possible outcomes! For example, if we flip a coin, then the probability that it
comes up one of heads or tails is 1. This rule, which would seem to be nearly a tautology, is
perhaps most useful as a constraint when proving certain of the other rules.
Rule 3. If two events do not share any outcomes, then the probability of an outcome being
in either event is the sum of the probabilities of the individual events.
More formally, if we have two events A and B, and neither share any outcomes (i.e., A
and B are disjoint sets), then we write P (A or B) = P (A) + P (B). Here ‘or’ denotes a
logical OR operation. In writing A or B we’re defining a new, larger event from the original
events A and B. This event contains all outcomes that are either in A, in B, or in both A
and B. A standard application of this rule, when joined with the previous rule, is to provide
a formal proof of formula (1). Rule 3 also is fundamental to showing that the following two
rules must hold for probabilities.
Rule 4. The probability of an event happening is equal to one minus the probability of the
event not happening.
For a given event A, the event ‘not A’ is formally called the complement of A, and
usually written Ac . This rule says that P (A) = 1 − P (Ac ).
Rule 5. If one event occuring implies that another event occured as well, then the probability
of the second event is at least as large as the probability of the first event.
Denoting the first and second events by A and B, respectively, the stated condition is
the same as saying that A is a subset of B, and the rule itself says that it must then follow
that P (A) ≤ P (B).
7
Example 4. (Rolling Dice Under an Equally Likely Model (cont)) Under a finite sample
space S and the equally likely model introduced in Section 2.2, it is straightforward to illustrate
these various rules. Again we focus on the rolling of a pair of dice.
Consider Rules 1 and 2. Since an event A is, by definition, a subset of the sample space
S, and therefore can contain no more outcomes than S itself (and, certainly, it can contain
no fewer than zero outcomes), by formula (1) we see that P (A) must be between 0 and 1.
Similarly, by formula (1), we see that P (S) = 36/36 = 1.
In order to illustrate the usage of Rule 3, consider the event that the sum of the two dice
is equal to 6. Call this event A and recall that P (A) = 5/36. Now let B be the event that
the sum of the two dice is equal to 4. This event consists of the outcomes {3, 1}, {2, 2}, and
{1, 3}, and therefore has probability 3/36. That A and B are disjoint can be both seen by
inspection of Figure 2 and argued from first principles (since the sum of the two dice cannot
simultaneously be both 6 and 4). Therefore, it follows by Rule 3 that the probability of the
sum of the two dice equaling either 4 or 6 is equal to P (A or B) = 5/36 + 3/36 = 8/36,
which can of course be confirmed through direct counting.
Using Rule 4 we can immediately conclude, for example, that the probability of the first
die being greater than or equal to the second die is equal to 1−15/36 = 21/36. The conclusion
follows because this event is the complement of the event that the first die is less than the
second die, which we saw has probability 15/36. Similarly, application of Rule 5 tells us taht
the probability that the second die is at least two greater than the first die must be no more
than (in fact, it is strictly less than) 15/36. We are allowed to conclude this fact because
every outcome defining this event is an instance of the event that the first die is less than
the second die, and hence the former event is a subset of the latter event. 2.4
Independence
A very important notion in probability is that of independence. What does the word
‘independent’ mean to you when you hear it in every day speech? Presumably it evokes a
sense of freedom from external control, a lack of influence, etc. Probabilists have a formal
definition for the word ‘independent.’ The use of independence as a modeling assumption is
fundamental to much of the probability and statistics used in practice.
Let A and B be two events. We say that A and B are independent if
P (A and B) = P (A)P (B) .
If two events are not independent, we say that they are dependent. Here ‘and’ denotes a
logical AND operation. In writing ‘A and B’ we are defining a new, smaller event from the
original events A and B. This event constitutes those outcomes that satisfy the definitions
of both A and B.
Independence as defined here says that in order to quantify the chance with which both
A and B will occur all that we need is knowledge of the chance with which A and B will
occur individually3 . We illustrate with some examples.
3
Note that the notion of independence is a definition. In particular, it is not a rule that follows from
the axioms of probability. There are, in fact, other ways to define independence. We will see another –
equivalent – definition of independent events in the third week of our statistics unit when we introduce the
idea of conditional frequencies and conditional probability.
8
Example 5. (Rolling Dice Under an Equally Likely Model (cont)) Return again to the
rolling of two dice, as introduced in Example 1, and recall the equally-likely model described
in Example 3. Suppose we define our events A and B as
A = { The first die is equal to 3 }
and
B = { The second die is equal to 6 } .
Under the equally-likely model, P (A and B) = 1/36. In addition, P (A) = P (B) = 1/6.
This can be shown using Rule 3. For example,
P (A) = P (Observe a pair (3, 1) or (3, 2) or (3, 3) or (3, 4) or (3, 5) or (3, 6))
= 1/36 + 1/36 + 1/36 + 1/36 + 1/36 + 1/36
= 1/6 .
As a result, since P (A and B) = 1/36 and P (A)P (B) = (1/6)2 = 1/36, we have that
P (A and B) = P (A)P (B). Therefore, it follows from the equally-likely model that the events
that the first die is 3 and the second die is 6 are independent.
In fact, we can show similarly that any combination of values for the first and second
dice are independent. Hence, the equally-likely model actually implies a relationship between
the two dice analogous to what we would picture in reality i.e., two dice being rolled in such
a way that neither influences the other.
Note, however, that the fact that the dice are independent under this model does not
mean that all events are independent. For example, let A be the event that the first die
is equal to three, and B, the event that both dice are odd. Then we saw in Example 3 that
P (A) = 6/36 = 1/6 and P (B) = 9/36 = 1/4. So we see that P (A)P (B) = 1/6×1/4 = 1/24.
On the other hand, we can see that the event that both the first die is three and both die
are odd consists of three outcomes i.e., (3,1), (3,3), and (3,5). As a result, P (A and B) =
3/36 = 1/12. Since 1/24 6= 1/12, we have that P (A and B) 6= P (A)P (B), and so the events
A and B are dependent. This conclusion, of course, matches our intuition, since it makes
sense that whether or not both dice are odd is influenced by whether the first die is three,
which is an odd number.
In Example 5 we found that independence of what comes up on the face of each of
the dice followed from the original equally-likely model assumption. Alternatively, it is not
uncommon to begin with an assumption of independent die rolls. If we assume that the
outcome of each die roll is equally likely to be any of the numbers 1,2,3,4,5, or 6, and we
assume that the two outcomes are independent, then we find that the probability of any pair
of outcomes is 1/6 × 1/6 = 1/36, just as before. So the two models are equivalent. But the
latter model is arguably more convenient when, for example, we want to model the rolling
of many dice, and not just two. The notion of independence extends accordingly and we
obtain, for example, that the N-tuple of outcomes from rolling N independent dice, each
equally likely to come up 1,2,3,4,5, or 6, have probability (1/6)N .
9
3
Probabilistic Modeling in Action: Genetics
Gregor Mendel (1822-1884) is considered the ‘father of genetics’, having done a variety of
experiments studying inheritance in pea plants, and formulating certain ‘laws’ to summarize
the behavior that he observed. Experience has not supported all of his laws, but it has
supported what is known as Mendel’s First Law. This law is now incorporated into basic
probability models for how characteristics are passed on from parents to children. Here we
look at two examples.
Example 6. (Modeling Flower Color in Mendel’s Peas) The color of flowers in the pea
plant Pisum sativum is governed by a single gene, which can take either of two forms, called
alleles. Represent these alleles with the variables A and a. Each child pea plant gets two
copies of this gene – one from the ‘father’ plant and one from the ‘mother’ plant. The father
and mother in turn each have two alleles. A result of Mendel’s First Law (his so-called ‘law
of segregation’) is that any combination of alleles in which one comes from the father and
one comes from the mother is equally likely4 .
To model genetic inheritance from a probabilistic standpoint, think of the assignment of
alleles to a child pea plant as a random phenomenon. Suppose that the father and mother
both have one A allele and one a allele. Then the corresponding sample space, consisting of
all possible ways for a child to receive a pair of alleles from its parents is {AA, Aa, aA, aa}.
Here, for concreteness, we arbitrarily let the first in a pair correspond to the allele received
from the mother, and the second, to that received from the father.
The sample space is just the set of all possible allele pairs. That all four such pairs are
equally likely means that
P ( Receive AA) = P ( Receive Aa) = P ( Receive aA) = P ( Receive aa) = 1/4 .
Now, the color of the flowers in Mendel’s pea plants can be either purple or white. It is
purple if the plant has at least one A allele. What is the probability of the event that the child
of our two parents has purple flowers? Since this event occurs for each of the three allele
pairs AA, Aa, and aA, we have that
P (Child Has Purple Flowers ) = 3/4.
Put another way, recalling the long-range frequency interpretation of probability, if we
bred many children plants from our two parents, in the long-run we expect to see children
with purple flowers to occur three times as frequently as children with white flowers.
That the allele A controls the color if at least one copy is present leads to it being called
dominant, and the allele a, recessive. The so-called genotype is a reference to allele pairs
without concern for order. So, for example, a child will have the ‘Aa′ genotype if it received
either the Aa or aA allele pair. The genotype ‘Aa′ is called heterozygous, while the genotypes
‘AA′ and ‘aa′ are called homozygous. The probability of the event that the child genotype is
heterozygous is 1/2, as is the probability of the event that it is homozygous.
These various genetic states and their associated probabilities are typically represented in
a diagram like that shown in Figure 4. 4
Equivalently, we can model reality here by assuming that (i) each parent gives an allele to the child
independently of the other parent, and (ii) a parent is equally likely to give either of her/his two alleles.
10
Figure 4: Schematic illustration of the inheritance of flower petal color in Mendel’s experiments.
The basic principles here can be extended to the modeling of two characteristics simultaneously.
Example 7. (Modeling Pea Color and Texture Simultaneously) The inheritance of the characteristics of pea color and texture also follow Mendel’s first law. Furthermore, a result of
Mendel’s first and second laws together is that it is equally likely that a child pea plant have
any combination of the parent alleles for pea color and texture resulting from taking one allele
for each of color and texture from each parent5 . More specifically, represent the alleles for
pea color, occuring as either yellow or green, as Y and y, respectively. Similarly, represent
the alleles for pea texture, occuring as either round or wrinkled, as R and r, respectively.
Then Figure 5 shows all possible quadruples of alleles, as they can be received by a child from
its parents, for these two characteristics.
Considering the joint inheritance of pea color and texture as a random phenomenon, the
figure shows that the corresponding sample space has 16 outcomes. The outcome RRY y, for
example, in the first row and second column, means that the child received the R allele from
both parents, the Y allele from the mother, and the y allele from the father. Yellow color
is dominant over green color, and round texture is dominant over wrinkled. We see then,
for example, that the probability of the event that the child produces peas that are round and
yellow is 9/16. 4
What to Take Away with You
The key elements of this handout that you should take away with you are as follows.
5
Mendel’s Second Law states that which allele a parent gives for one trait will be independent of which
allele the same parent gives for another trait. Empirical evidence has shown that this ‘law’ in fact does not
always hold. In some cases, depending on their locations upon the chromosomes, genes can be linked and
the transfer of their respective alleles becomes dependent. The modeling of genes under so-called linkage
is a substantially more complicated task than that which Mendel faced, and a great deal of work in modern
statistical genetics has focused on this problem.
11
Figure 5: Schematic representation of inheritance of color and texture in Mendel’s peas.
(Note: Entries of the table are written in standard genotypic (i.e., unordered) notation, but
the ordering is implicit from the corresponding marginal row/column entries.)
1. Concepts
• ‘Randomness’ – in the sense of (seemingly) unsystematic – as a conceptualization
of our uncertainty in quantitative reasoning problems.
• The notion of probabilities as functions that assign numbers to outcomes, quantifying their relative uncertainty, subject to certain rules.
• The concept of an ‘equally likely outcomes’ model and the implications that this
model has on the calculation of probabilities of events.
• The concept of ‘independent’ events.
2. Skills
• Identify and distinguish between outcomes, events, and sample spaces when faced
with a given random phenomena.
• Calculate probabilities of events for simple examples, under the ‘equally likely
outcomes model’.
• Utilize Rules 1 through 5 to make basic probabilistic statements and comparisons.
• Determine in simple models whether or not two events are dependent or, conversely, calculate the probability of two independent events both occuring.
5
Appendix: Interpreting Probabilities
Ultimately, our goal here is not to become experts in probability calculations. Rather, our
goal is to become more saavy consumers of probabilistic information, particularly as it relates
to statistical inferences. We have discussed what probabilities are and how they behave.
The last main point regarding probability that we need to understand for our discussion
12
of inferential statistics, in the second and third weeks of our statistics module, is how to
interpret probabilities.
There are two standard interpretations of probability. The first treats probabilities as
relative frequencies, while the second treats them as degrees of subjective belief. Both treatments are used in practice, but the former arguably is used substantially more than the latter.
That is not to say, however, that the latter is not used! That portion of statistics devoted to
inference using the subjective interpretation of probability is called Bayesian statistics,
and its uses are many and varied. For example, Microsoft’s Word’s automatic help tool (the
‘Paperclip’), rests on certain principles and tools from this area of statistics. However, to
delve further in this direction is beyond the scope of our class. Here and throughout the rest
of the statistics module we will focus on the relative frequency interpretation of probability.
Let us return to the general context within which we began our discussion of probability.
We have a random phenomenon we intend to observe, and we have an event of interest to us.
Call this event A. The long-range frequency interpretation of probablity says that
the probability P (A) is the frequency with which A would be observed to occur over the long
run, were the action of our random phenomenon generating an outcome to be replicated a
large number of times.
For example, suppose our random phenomenon is the flipping a fair coin, and our event
A is the event that the coin comes up H. Then P (A) = P ( Coin comes up H ) = 0.5 is
interpreted as the frequency with which we would see heads over the course of a great many
flips.
A natural question is whether this ‘interpretation’ is simply a convenient concept, or
whether it has some grounding in reality. The following example helps shed some light on
the answer.
Example 8. (A Computer Simulation) While we do not have the luxury of waiting until an
(effectively) infinite number of realizations of a random phenomenon are observed, we can
easily simulate many types of random phenomena using a computer.
Consider the action of rolling a pair of dice, as introduced in Example 1. Recall that the
event A that both dice come up odd has probability 9/36. That is, P (A) = 0.25. Using a
standard laptop computer, we can simulate the separate rolling of two fair dice. In fact, we
can do so repeatedly – up to thousands of times in just seconds. So the computer can be used
as a type of laboratory in which to generate empirical evidence with which to investigate the
validity of the long-run frequency interpretation of probability.
Let the number of times we roll the two dice (using the computer) be called n. For a
given value of n, we can take the outcomes of our n rolls, and we can compute the proportion
of times that our event occured. We can then vary the value of n, from small to large. If
the frequency interpretation of probability is accurate in this setting, we should find that the
proportions get closer and closer to P (A) = 0.25 as n gets larger.
Figure 6 shows the results of having done this experiment, for values of n ranging from
1 to 1000. The proportion of times A occurs, for each choice of n, is plotted as a function
of n. We see that these proportions start off rather wildly. But after a while, as n grows,
moving from left to right in the plot, the proportions begin to settle down, fluctuating less
and less, until they appear to remain quite firmly in the neighborhood of 0.25. Our example provides some empirical support for the long-range frequency interpretation
13
0.4
0.3
0.2
0.1
0.0
Proportion of Times Both Odd
0
200
400
600
800
1000
Number of Rolls
Figure 6: Relative frequency with which, when rolling a pair of dice, both dice come up odd,
as a function of the number of times the dice were rolled.
of probability. Clearly, however, we cannot generate such support for all events and all
random phenomena. That this type of empirical behavior will occur quite generally was
proven mathematically by Jacob Bernoulli, in 1689. His result, which states that empirical
frequencies of events settle down to the event probabilities in the limit of a large number of
repetitions of a random phenomena, is known as the law of large numbers.
Interestingly, we can see in Example 8 that even before n becomes particularly large the
observed proportions begin to get relatively close to the underlying probability to which
they are tending. Importantly, from a statistical point of view, this observation suggests
that we might have some reasonable hope of estimating the values of such probabilities using
observed frequencies from finite amounts of data. However, there is the uncertainty to be
dealt with that derives from not having yet reached the ‘long-run’.
Figure 7 shows three traces like that in Figure 6. Here we have simply repeated the
computer experiment described in Example 8 three times. Note that in all three traces,
as expected, the proportions settle down to 0.25. But each trace is nevertheless different –
increasingly so for smaller values of n.
We require a formal manner for quantifying the uncertainty that derives from the limitations of finite data and incorporating it into our reporting of any estimates of this type.
That is the realm of inferential statistics, the next topic in our statistics module.
14
1.0
0.8
0.6
0.4
0.2
0.0
Proportion of Times Both Odd
0
200
400
600
800
1000
Number of Rolls
Figure 7: Results from three different runs of the computer simulation experiment described
in Example 8.
15