Review of additional notions from probability

REVIEW OF ADDITIONAL NOTIONS FROM PROBABILITY
THEORY
WINFRIED JUST, OHIO UNIVERSITY
Abstract. This note covers some additional basic concepts of probability
theory. It also includes some practice problems.
1. Permutations, combinations, and the inclusion-exclusion principle
An important example of a finite sample spaces for which usually all elementary
outcomes are assumed equally likely is the space of all permutations (i.e., ordered
arrangements) of r objects out of a given set of n objects. The size of this space is
given by
Prn =
(1)
n!
= n(n − 1) · · · (n − r + 1).
(n − r)!
Another important important example of such spaces is the space of all combinations (i.e., unordered arrangements) of r objects out of a given set of n objects.
The size of this space is given by
Crn =
(2)
n!
n(n − 1) · · · (n − r + 1)
=
.
(n − r)!r!
r!
The probability of the union of two events is given by the formula:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B).
(3)
If A and B are mutually exclusive,then P (A ∩ B) = 0 and (3) simplifies to:
P (A ∪ B) = P (A) + P (B).
(4)
More generally, events A1 , . . . , An are pairwise mutually exclusive if Ai ∩ Aj = ∅
for all 1 ≤ i < j ≤ n. If, in addition, we have A1 ∪ · · · ∪ An = Ω, then we call
the family {A1 , . . . , An } a partition of the sample space. For families of pairwise
mutually exclusive events the following generalization of (4) holds:
P (A1 ∪ · · · ∪ An ) = P (A1 ) + · · · + P (An ).
(5)
For events A1 , . . . , An that are not necessarily pairwise mutually exclusive, the
following generalization of (3) holds.
P (A1 ∪ · · · ∪ An ) =
n
X
P (Ai ) −
i=1
(6)
+
X
X
P (Ai ∩ Aj ) +
1≤i<j≤n
P (Ai ∩ Aj ∩ Ak ) − · · · + (−1)n+1 P (A1 ∩ · · · ∩ An ).
1≤i<j<k≤n
Equation (6) is called the inclusion-exclusion principle.
1
2
WINFRIED JUST, OHIO UNIVERSITY
2. Bayes Rule
Suppose P (A) > 0. Recall that the conditional probability of B given A is the
probability that B occurs if it is already known that A occurred. It is denoted
by P (B|A) and given by the formula
(7)
P (B|A) =
P (A ∩ B)
.
P (A)
Note that if P (A) = 0, then P (B|A) is undefined.
It follows from (7) that the probability of the intersection of A and B is given
by
(8)
P (A ∩ B) = P (A)P (B|A).
Conditional probability is a very important tool for constructing probability
functions on sample spaces of sequences. Suppose our sample space consists of
letter sequences ~s = (s1 , . . . , sn ) of length n. For simplicity, let us write (a1 , . . . , ak )
for the event s1 = a1 & s2 = a2 & . . . & sk = ak and ak for the event sk = ak . Then
(9)
P (a1 , . . . , ak ) = P (a1 )P (a2 |a1 ) . . . P (ak |a1 . . . ak−1 ) . . . P (an |a1 . . . an−1 ).
If the events a1 , . . . , an are independent, then (10) reduces to
(10)
P (a1 , . . . , ak ) = P (a1 )P (a2 ) . . . P (ak ) . . . P (an ).
Formula (10) underlies the procedure of calculating probabilities by using socalled decision trees. Sometimes it is easier to calculate P (B|A) and P (B|Ā) than
P (B) itself. We can then compute P (B) form the formula
(11)
P (B) = P (B|A)P (A) + P (B|Ā)P (Ā).
More generally, if events A1 , . . . , An form a partition of Ω, then the probability
of an event B can be calculated as
(12)
P (B) = P (B|A1 )P (A1 ) + · · · + P (B|An )P (An ).
Formula (12) and its special case (11) are called the formula for the total probability (of B).
Now let us return to equation (8). It implies that
(13)
P (A ∩ B) = P (B|A)P (A) = P (A|B)P (B).
If we divide by P (B) we obtain
(14)
P (A|B) =
P (B|A)P (A)
.
P (B)
Equation (14) is the most general form of Bayes Theorem or Bayes Rule. It
allows us to compute P (A|B), the posterior probability of A after obtaining the
information that B occurred from the prior probability P (A) and the conditional
probability P (B|A). In applications of Bayes Theorem the probability of B is
usually calculated using one of the formulas for total probability, either (11), in
which case Bayes Rule takes the form
(15)
P (A|B) =
P (B|A)P (A)
,
P (B|A)P (A) + P (B|Ā)P (Ā)
REVIEW OF ADDITIONAL NOTIONS FROM PROBABILITY THEORY
3
or (12), in which case Bayes Rule takes the form
(16)
P (A1 |B) =
P (B|A1 )P (A1 )
.
P (B|A1 )P (A1 ) + · · · + P (B|An )P (An )
3. Practice problems for Sections 1 and 2
This section gives some practice problems for the material in Section 1. These
questions form a logical sequence and should be attempted in the order given. The
exception are (9)-(13) which depend only on (1), and (14), which depends only
on (1)-(4)
(1) Suppose you want to model the space of all sequences of length n of letters
from the alphabet {a, c, g, t}. Describe a suitable sample space Ω and determine how many elements it has. If all such sequences are considered equally
likely, what would be the probability of each individual sequence? What
would be the probability that the sequence contains c at loci i1 , i2 , . . . , ik
and other letters at all other loci?
(2) Is the probability function you defined in Problem (1) adequate if you want
to model a genome with cg-content 60%? If not, how would you define
the probability of an individual sequence if you assume that all loci are
independent? What would be the probability that the sequence contains c
at loci i1 , i2 , . . . , ik and other letters at all other loci?
(3) In the model of Problem (1), find the probability that a sequence contains
exactly k occurrences of the letter c. How would you express the probability
that the sequence contains at most k occurrences of the letter c? Hint:
Consider a two-step approach: First choose the loci i1 , i2 , . . . , ik at which c
occurs, think in how many ways this can be done, and then use the result
for the third last sentence of Problem (1).
(4) In the model of Problem (2), find the probability that a sequence contains
exactly k occurrences of the letter c. How would you express the probability
that the sequence contains at most k occurrences of the letter c?
(5) Let E be the event that a sequence of n nucleotides contains kc occurrences
of the letter c, kg occurrences of the letter g, ka occurrences of the letter a,
kt and occurrences of the letter t, where kc + kg + ka + kt = n. Let C(kc )
be the event that c occurs exactly kc times; define P (Cg ), P (Ca ), P (Ct )
analogously. Note that E = C(kc ) ∩ C(kg ) ∩ C(ka ) ∩ C(kt ), and show that
E = C(kc ) ∩ C(kg ) ∩ C(ka ).
(6) Let E be defined as in the previous problem. Show that
P (E) = P (C(ck ))P (C(kg )|C(kc ))P (C(ka )|(C(kc ) ∩ C(kg )).
(7) Find a formula for the probability of event E as in Problem (5) in the model
of Problem (1). Hint: You already found a formula for P (C(kc )). Note that
P (C(kg )|C(kc )) is the same as probability P (C(kg )) in a sequence space
where there are only n − kc nucleotides from the alphabet {g, a, t}, and use
a similar trick for finding P (C(ka )|(C(kc ) ∩ C(kg )).
(8) Find a formula for the probability of event E as in Problem (5) in the model
of Problem (2).
(9) Consider again the sample space of problems (1) with all four nucleotides
equally likely in locus 1, but now assume that the loci are not independent.
Specifically, assume that a c is more likely to be followed by a g, but all other
4
WINFRIED JUST, OHIO UNIVERSITY
probabilities are the same, and there are no other dependencies between
loci. Let si denote the letter encountered in locus i and assume for all
1 ≤ i < n we have:
P (si+1 = g|si = c) = 0.4,
P (si+1 = a|si = c) = P (si+1 = c|si = c) = P (si+1 = t|si = c) = 0.2,
P (si+1 = x|si = d) = 0.25,
(10)
(11)
(12)
(13)
(14)
where x stands for any nucleotide and d stands for any nucleotide other
than c. Let n = 5, and compute the probability of the sequence ccgca
under these assumptions. How does it compare with the probabilities of
the same sequence in the models of Problems (1) and (2)? Why do these
probabilities differ in the way they do?
In the model of Problem (10), find P (s2 = g) and P (s2 = c). Hint: Use
the formula for the total probability.
In the model of Problem (10), find P (s1 = c|s2 = g) and P (s2 = c|s3 = g).
Hint: Use Bayes formula.
In the model of Problem (10), find P (s3 = c) and P (s3 = g).
Write a short MatLab code that computes P (si = c) and P (si = g) for
i = 1, 2, . . . , n. Run it for n = 20. What pattern do you observe for the
sequences of these probabilities? How would you explain these patterns?
Suppose you have a sequence of n nucleotides of a bacterium that was
randomly chosen from a culture that contains 10% bacteria C. minimus,
20% bacteria C. medianis, and 70% C. maximus. It is known that the cgcontent of the genome of C. minimus is 40%, the cg-content for C. medianis
is 50%, and the cg-content for C. maximus is 60%. A quick test tells you
that among the n nucleotides in the sequence of unknown origin exactly k
of them are c’s. Based on this information, how would you calculate the
probability that the sequence comes from any of these sources? Use MatLab
to find the respective probabilities for each organism if n = 10, k = 2 and
n = 50, k = 10. Note that in each case the proportion knc of c’s is the same.
Why do you get such
different probabilities? Hint: Use Bayes rule. The
MatLab code for nk is nchoosek(n,k).
4. Geometric random variables
Imagine that we are repeating an experiment that can result in success or failure
infinitely often. Let ξ be the discrete r.v. that returns the number of the first trial
for which a success occurs. If the trials are independent and the probability of
success in each trial is p, then ξ has a geometric distribution that is given by
(17)
P (ξ = n) = p(1 − p)n−1 ,
where n = 1, 2, . . . . The mean value for this r.v. will be hξi = p1 , and the variance
is given by V ar(ξ) = 1−p
p2 .
Exercise 1. Let ξ be a r.v. with the geometric distribution of (17). Find a formula
for P (ξ = n1 + n2 |ξ > k1 ). Why are geometric random variables called “memoryless?”
REVIEW OF ADDITIONAL NOTIONS FROM PROBABILITY THEORY
5
5. The p-value
An important notion in statistics is the so-called p-value. Its definition uses
the concept of a null hypothesis (usually an assumption that a given r.v. has a
particularly simple distribution or that two r.v.s are independent), performs an
experiment, and calculates the probability that X takes a value that is at least as
extreme as the observed value under the assumption that the null hypothesis is true.
This probability is the p-value. If the p-value turns out lower than a previously
specified significance level α, one can feel entitled to reject the null hypothesis. In
science, one usually works with α = 0.05, but α = 0.01 or α = 0.001 are also
sometimes used. The proper interpretation of the phrase “at least as extreme as
the observed value” usually depends on the context. Suppose you flip a coin 10,000
times and heads comes up 5,100 times. Your “null hypothesis” is the assumption
that the coin is unbiased, in which case the number X of heads is a binomial
variable with parameters n = 10, 000 and p = 0.5. The observed number exceeds
the mean by 100. What is the probability of obtaining this value or at least as
“extreme” ones? This depends. If we are playing in a casino and heads favor the
house, the null hypothesis is really: “The coin is not biased in a way that would
favor the house” and the proper interpretation is to consider all values for X that
are ≥ 5, 100 as “at least as extreme as the observed one.” If, however, we have no
prior conception of why he coin might be biased one way or the other, we need to
consider all values of X that are ≥ 5, 100 together with all values that are ≤ 4, 900
“at least as extreme as the observed one.”
Exercise 2. Use the technique of the previous exercise to calculate the p-values
corresponding to one and the other interpretation. Will the significance level of 0.05
allow us to reject the null hypothesis?
While widely used in science, the p-value may be misleading in some bioinformatics problems. Let us return to a previous exercise and expand on it.
Exercise 3. Suppose you sequence loci s1 , . . . , s3n of a DNA strand. Assume that
the strand is completely random, with all four nucleotides equally likely and no
dependencies between the loci. Let ξ count the number of i’s between 1 and n such
that (s3i−2 , s3i−1 , s3i ) is one of the sequences (tga), (taa), (tag) that represent STOP
codons.
(a) Find a formula for P (ξ = k).
(b) Find a formula for P (ξ = 0) if your sequence represents a fragment of a coding
region that is 900 nucleotides long (with the last three positions representing a STOP
codon), if you know that s1 represents the first position of a codon in the correct
reading frame, but if you don’t know which of the 300 first positions it is, with all
300 first positions being equally likely.
(c) Let η be the r.v. that returns the smallest i such that (s3i−2 , s3i−1 , s3i ) is one of
the sequences (tga), (taa), (tag). Then η returns the number of the trial on which
the first “success” occurs. Assuming that we sequence random DNA, what is the
distribution of η? What is the expected value of η?
(d) Assume that η as in (c) takes the value 65 (that is, the first time you encounter
a triplet that looks like a STOP codon is at positions (193, 194, 195). If your null
hypothesis is that the genome sequence is random, what p-value does this observation
correspond have? Can we reject the null hypothesis at significance level 0.05?
6
WINFRIED JUST, OHIO UNIVERSITY
(e) Consider an idealized bacterium b. idealis by making the following slightly false
assumptions:
• Every nucleotide belongs to a coding region.
• All coding regions are exactly 900 nucleotides long.
• All six reading frames are equally likely.
• A sequence that is read in the wrong reading frame is completely random,
with no dependencies between different loci and all nucleotides equally likely.
Assume again that η as in (c) takes the value 65 (that is, the first time you
encounter a triplet that looks like a STOP codon is at positions (193, 194, 195).
What is the probability that you have been sequencing from a coding region in the
correct reading frame? Hint: Let C be the event “correct reading frame,” let I be the
event “incorrect reading frame.” The a priori probabilities are P (C) = 16 , P (I) = 56
(why?) Let S be the observed occurrence of the first STOP codon in position 66.
Use your work in (a) or (c) to calculate P (S|I); use (b) to calculate P (S|C). Then
use Bayes Theorem to calculate the desired probability P (C|S).
(f ) Would you draw the same or opposite conclusions from an approach based on
the p-value and from an approach based on Bayes Theorem? In each case, would
you consider the available evidence compelling or rather weak?
(g) Consider an idealized amoebum a. idealis by making the following slightly false
assumptions:
• Only 0.01% of all nucleotide belong to a coding region.
• All coding regions are contiguous (no introns) and exactly 900 nucleotides
long.
• All six reading frames are equally likely for the coding regions.
• A sequence that is read in the wrong reading frame or that belongs to an
intergenic region is completely random, with no dependencies between different loci and all nucleotides are equally likely.
Assume again that η as in (c) takes the value 65 (that is, the first time you
encounter a triplet that looks like a STOP codon is at positions (193, 194, 195).
What is the probability that you have been sequencing from a coding region and in
the correct reading frame? Hint: The new twist here is that much of the genome
will no longer be coding in any of the six reading frames.
(h) Would you draw the same or opposite conclusions from an approach based on
the p-value and from an approach based on Bayes Theorem? In each case, would
you consider the available evidence compelling or rather weak?