Lecture 4

1
Lecture 4 Probability
The Chapter 1 notes addressed numbers, ordered n-tuples of numbers, sets of numbers, and their subsets. Chapter 2
addressed the actions that, when performed, yield numbers. To date, we have addressed these concepts loosely, and in
context. We have depended on the student’s intuition. For example, it is intuitive to recognize that the value of any
probability cannot be negative, nor can it be greater than 1.0. The key points of those chapters can be summarized as
follows:
Given an ordered pair of numbers, call it ( x, y ) , the action that this ordered pair resulted from is the 2-D random variable
( X , Y ) . The set of all possible numbers that might result from ( X , Y ) is called the sample space for ( X , Y ) . It is
denoted as S ( X ,Y ) .
We now give a formal definition of probability.
Definition 1.1 Let X be a random variable with sample space SX. Let X be the field of events associated with SX (i.e. the
collection of all the measurable subsets of the set SX). The probability of any event A  X will be denoted as Pr(A).
Hence, the operation Pr(*) is an operation applied to a set. This operation has the following attributes:
(A1): Pr( )  0 and Pr( S X )  1,
(A2): For any events A, B  X , Pr( A  B)  Pr( A)  Pr( B)  Pr( A  B)
[Note: Sets A and B are said to be mutually exclusive if A  B   .]
Definition 1.2. An event is simply a subset of the sample space.
The concepts of sets, subsets, their union and intersection, and a random
variable as simply an action, have all been covered in the first two chapters.
We also discussed the notion of probability. Definition 1.1 places this notion
of firm ground. Specifically, probability measures the ‘size’ of a set. Axiom
(A1) is self-evident. Axiom (A2) is best acknowledged by the Venn diagram
at right. The yellow rectangle corresponds to the entire sample space, S X .
The “size” (i.e. probability) of this set equals one. The blue and red circles are
clearly subsets of S X . The probability of A is the area in blue. The probability
of B is the area in red. The black area where A and B intersect is equal to
Pr( A  B) .
Figure 1.1 Venn diagram of probabilities.
Definition 1.3 Two events A and B are said to be (statistically, or mutually) independent if
Pr( A  B)  Pr( A) Pr( B) .
(1.1)
QUESTION 1: Are the events A and B mutually independent events?
ANSWER 1: ____________________________________________________
If events A, B, and C are mutually independent, then Pr( A  B  C )
 Pr( A) Pr( B) Pr(C ) .
2
If someone says, for example, that the event that it rains in New York City today is independent of whether or not it is
sunny in Delhi, most people take that to mean that the one event in no way influences the other. We can state this example
in other words: Given the condition that it rains in New York City today, the probability that it will be sunny in Dehli is
unaffected. With this in mind, we will offer an alternative definition of independence based on conditional probability.
The concept of joint and conditional events was discussed in my Chapter 1 & 2 notes. We will repeat the gist of them here
in relation to Figure 1.1.
The joint event A  B is exactly the same subset of the rectangle as the conditional event A | B (which, in words, is the
event A, given that we restrict the sample space to the event B. The difference between the events A  B and A | B is that
the event A  B is a subset of the entire sample space; whereas the event A | B is a subset of the restricted sample space
B. In view of (A1), if B is now the sample space of concern, then it must now have probability equal to 1.0. Clearly, as a
subset of the big sample space Pr( B)  1 . But it should be equally clear that Pr( B | B)  1 . This can be achieved by
dividing Pr(B) by Pr(B) . In the same way, we have Pr( A | B)  Pr( A  B) / Pr( B) . In words, once we restrict our
sample space to the set B, then the probabilities of all sets inside of B have to be scaled by Pr(B) .
Definition 1.3’ Two events A and B are said to be (statistically, or mutually) independent if
Pr( A | B)  Pr( A) .
(1.1’)
This author feels that (1.2’) has much more intuitive appeal than (1.2). Granted, it requires one to have a prior
understanding conditional probability. Some readers might ask how it is possible to have two different definitions of what
it means for events to be independent. To those readers, we offer the following answer:
Claim The equality Pr( A  B)  Pr( A) Pr( B) holds if and only if the equality Pr( A | B)  Pr( A) holds.
Proof: First, suppose that the equality Pr( A  B)  Pr( A) Pr( B) . Then (3.5) becomes Pr( A | B)  Pr( A) . Now, suppose
instead, that the equality Pr( A | B)  Pr( A) holds. The, again from (3.5), we obtain Pr( A  B)  Pr( A) Pr( B) . □
To illustrate the practical value of Definition 1.3, we consider the following example.
Example 1.1 In this example we will investigate the probability of a random sample of bacon being good or bad.
Formally, let X=The act of recording whether or not a randomly selected piece of bacon is bad. Define the event that it is

good as [ X  0] , and that it is bad as [ X  1] . Hence, the sample space for X is S X  {0 ,1} . Define p  Pr[ X  1] . Then

clearly Pr[ X  0]  1  p .It should be clear that we can writ S X  {0 ,1}  {0}  {1}  [ X  0  X  1] . It should also be clear

that {0}  {1}  [ X  0]  [ X  1]  0 . Hence, from Axioms (A1) and (A2) we have
1  Pr( S X )  Pr [ X  0  X  1]  Pr[ X  0]  Pr[ X  1]  Pr[ X  0]  p
which gives Pr[ X  0]  1  p . Many students would claim that this is intuitively obvious. And I would agree. The reason
for carrying out the formalities is many to get the student more comfortable with notation.
With just a little bit of thought, it should be clear that a random variable X whose sample space is S X  {0 ,1} (or any two
values) is the simplest possible random variable. For, if it could take on only one value, then it wouldn’t be random. Even
though it is simplest among all random variables, it is arguably the most important. Think about it. How many times every
day do people address the questions of good/bad , right/wrong , happy/sad , healthy/sick, etc. Because of its
omnipresence, we give the following definition.
3

Definition 1.4 A random variable X, having sample space S X  {0 ,1} with p  Pr[ X  1] is said to be a Bernoulli(p)
random variable.
To investigate the probability structure of X is equivalent to simply estimating p, since this parameter completely captures
this structure. So, our investigation will begin by testing a random sample of n pieces of bacon. Denote these data
collection variables as X  ( X 1 , X 2 ,, X n ) . The reason for choosing the n-D formulation, as opposed to expressing this
collection as { X k }nk 1 , will become clear as we proceed.
Recall that even at this early stage of the course, we agreed that the logical estimator of the unknown parameter p is

 1 n
p   Xk  X .
n k 1

This random variable has sample space S p  {0, 1 / n, 2 / n, , (n  1) / n, 1} . First, consider the event [ p  0] . The only way

that this can happen is if all of the X k s are zero. Hence, the event [ p  0] is equivalent to the event
{( X 1 , X 2 ,, X n )}  [ X 1  0  X 2  0   X n  0] .
Hence,

Pr[ p  0]  Pr[ X 1  0  X 2  0   X n  0] .
Recall that we have randomly selected the pieces of bacon. What this means is that the X k s are mutually independent.
And so, we have:

Pr[ p  0]  Pr[ X 1  0] Pr[ X 2  0]  Pr X n  0]  (1  p) n .
This highlights the value of the Definition 1.3 version of independence.

Next, let’s consider the event [ p  1 / n] . The only way this can happen is if one and only one of the X k s equals 1.0. With
a little thought it should be clear that this is the event (i.e. the subset of SX ) given by

A1 {(1,0,,0) , (0,1,0,,0) ,  , (0,0,,0,1) } .
The set include a total of n ordered n-tuples. Let’s first compute
Pr{(1,0, ,0)}  Pr[ X 1  1  X 2  0    X n  0]  Pr[ X 1  1] Pr[ X 2  0] Pr[ X n  0]  p(1  p) n 1 .
Similarly, we have Pr{(0,1,,0)}  p(1  p) n1 . In fact, each of the n singleton subsets of A1 has probability p(1  p) n 1 .
Since these n singleton subsets are mutually exclusive, it follows that
n


Pr[ p  1 / n]  Pr  X k  1  n p(1  p) n 1 .
 k 1

n

Next, let’s address the event [ p  2 / n] . This event is equivalent to the event  X k  2 , which, in turn, is equivalent to
 k 1


the event
A2 {(1,1,0,,0) , (1,0,1,0,,0) ,, (0,0,,1,1)} .
4
It should be clear that each singleton subset of A2 has probability p 2 (1  p) n2 . The question the is:
How many elements does A2 have? Well, in words, the question is: How many different ways can we place two 1’s in n
slots? The answer is, in words: In n choose 2 ways. Mathematically, it can be shown that the number of ways is:
n 
n!
.
  
 2  2!(n  2)!
Hence, we have:
In fact, we have:

n
 n

Pr[ p  2 / n]  Pr  X k  2    p 2 (1  p) n 2 .
 k 1
  2
n
 n

Pr[ p  y / n]  Pr  X k  y     p y (1  p) n y .
 k 1
  y
n
Let Y   X k . Then this becomes:
k 1
n

Pr[ p  y / n]  PrY  y     p y (1  p) n y .
 y
Clearly, the sample space for Y is SY  {0,1,, n} . This random variable Y, that is the sum of n independent and
identically distributed (iid) Bernoulli(p) random variables has a name:

n
Definition 1.5 The random variable Y   X k where the X k s are iid Ber(p) random variables is called a binomial(n,p)
k 1
random variable.
n
What we have shown is that Y ~ bino (n, p ) has the probability model PrY  y     p y (1  p) n y . Since our estimator of
 y

p is p  Y / n , we know its probability structure. It is simply:
n

Pr[ p  y / n]  PrY  y     p y (1  p) n y .
 y
To finish this example, we will assume that the bacon truth model is p  0.01 (i.e. in words, there’s a 1 in 100 chance
that a tested piece of bacon will be bad). Suppose that the health inspector will collect n=25 random samples of bacon and

test them. Then his estimator of p is p  Y / 25 where Y ~ bino (n  25 , p  0.01) . Hence,
 25 

Pr[ p  y / 25]  PrY  y     p y (1  p) 25 y .
 y
 25 

In particular:
Pr[ p  0]  PrY  0    p 0 (1  p) 25  (1  p) 25  0.9925  0.7778 .
0
In words, there’s a 78% chance that he will not find a single bad piece of bacon.
 25 

Pr[ p  1 / 25]  PrY  1    p1 (1  p) 24  25 p(1  p) 24  25(.01)0.9924  0.1964 .
1
And so the probability that he will find no more than one bad piece of bacon is 0.7778  0.1964  0.9742 .
Similarly:
Now let’s think ‘business’ versus ‘health’. Suppose that you buy the cheapest bacon you can get your hands on, and that
distributor’s strips of bacon have a 1% chance of being bad. Then for every 25 pieces of bacon you eat, the probability
5
that none of them will be bad is 77.78% or ~80%. In other words, there’s about a 20% chance that you will have a ‘bad
bacon experience’  That’s a huge probability for a big-time bacon eater! BUT: from the seller’s point of view, it’s like:
“Well, there’s only a 1 in 5 chance that the inspector will find a piece of bad bacon. And the inspector only comes by once
a year. So, if I can buy cheap bacon at 1/3 the cost of good bacon, it might be worth the risk of getting caught by the
health inspector.”
To finish up, let’s suppose that p=0.005 and that n=100. Rather than going through ‘painful’ computations, we can use
the Matlab commands:
>> y=0:100;
>> fy=binopdf(y,100,.005);
>> stem(y,fy)
>> title('PDF for Y~bino(n=100,p=0.005)')
>> xlabel('y')
>> ylabel('Pr[Y=y]')
>> grid
>> fy(1:5)
ans = 0.6058 0.3044 0.0757 0.0124 0.0015
To compute Pr[Y  1] , we can simply use the above numbers: Pr[Y  1] =0.6058+0.3044=0.9102.
Or- we can be ‘lazy’ and use >> Pr=binocdf(1,100,.005) =0.9102.
Summary
In this lecture we introduced basic element of probability in a formal fashion. To feel comfortable with the concepts it is
paramount that the student be comfortable with set and subset, as well as their intersections/unions. We focused on the
concept of independence, emphasizing the value of assuming that data collection variables are mutually independent and
identically distributed (iid). The bacon example was offered to illustrate that, while not very intuitive, Definition 1.3 of
independence is extremely valuable.
Footnote: [Best entered by a ‘sole man’ https://www.youtube.com/watch?v=8fS9-Yimdhw ]
I realize that some students may have found these notes to be confusing, at the very least. That is
understandable. There is a lot of notation. I would ask that you carry out a self-evaluation: “Is it the notation that is
confusing; thereby clouding the concepts? Or am I ok with the notation, but can’t wrap my head around the concepts?”
Then share your thoughts with me- either in class or in person. I am more than willing to help out.