Copyright, 2010, R.E. Kass, S. Behseta, E.N. Brown, and U. Eden
REPRODUCTION OR CIRCULATION REQUIRES PERMISSION
OF THE AUTHORS
Chapter 3
Probability and Random
Variables
Probability is a rich and beautiful subject, a discipline unto itself. Its origins were concerned primarily with games of chance, and many lectures on
elementary probability theory still contain references to dice, playing cards,
and coin flips. These lottery-style scenarios remain useful because they are
evocative and easy to understand. On the other hand, they give an extremely narrow and restrictive view of what probability is about: lotteries
are based on elementary outcomes that are equally likely, but in many situations where quantification of uncertainty is helpful there is no compelling way
to decompose outcomes into equally-likely components. In fact, the focus on
equally-likely events is characteristic of pre-statistical thinking.1 The great
leap forward toward a more general notion of probability was slow, requiring
over 200 years for full development.2 This long, difficult transition involved
1
See Stigler, S.M. (1986) The History of Statistics: Measurement of Uncertainty before
1900, Harvard.
2
Its beginning point is usually traced to a text by Jacob Bernoulli, posthumouslypublished in the early 1713 (Bernoulli, J. (1713) Ars Conjectandi Basil: Thurnisiorum.),
and its modern endpoint was really only reached in 1933, with the publication of a text
1
2
CHAPTER 3. PROBABILITY AND RANDOM VARIABLES
a deep conceptual shift. In modern texts equally-likely outcomes are used
to illustrate elementary ideas, but they are relegated to special cases. It is
sometimes possible to compute the probability of an event by counting the
outcomes within that event, and dividing by the total number of outcomes;
but in many situations such reasoning is at best a loose analogy. To quantify uncertainty via statistical models a more general and abstract notion of
probability must be introduced.
This chapter begins with the axioms and elementary laws of probability, and
then discusses the way probability is used to describe variability. Quantities that are measured but uncertain are formalized in probability theory as
random variables. More specifically, we set up a theoretical framework for understanding variation based on probability distributions of random variables;
the variation of random variables is supposed to be similar to real-world variation observed in data. Many families of probability distributions are used
throughout the book. The most common ones are discussed in this chapter.
One quick note on terminology: the word stochastic connotes variation describable by probability. Within statistical theory it is often used in specialized contexts, but it is almost always simply a synonym for “probabilistic.”
We occasionally use this word ourselves.
3.1
3.1.1
The Calculus of Probability
Probabilities are defined on sets of uncertain events.
The calculus of probability is defined for sets, which in this context are called
events. That is, we speak of “the probability of the event A” and we will write
this as P (A). Events are considered to be composed of outcomes from some
experiment or observational process. The collection of all possible outcomes
(and, therefore, the union of all possible events) is called the sample space
and will be denoted by S. Thus, an event A is a subset of S. Recall the
by Kolmogorov (Kolmogorov, A.N. (1933) Grundgebriffe der Wahrscheinlichkeithsrechnung, Berlin: Springer-Verlag. English translation: 1950, Foundations of the Theory of
Probability, New York: Chelsea.)
3.1. THE CALCULUS OF PROBABILITY
3
definitions of union and intersection: for events A and B the union A ∪ B
consists of all outcomes that are either in A or in B or in both A and B; the
intersection A ∩ B consists of all outcomes that are in both A and B. The
complement Ac of A consists of all outcomes that are not in A. We say two
events are mutually exclusive or disjoint if they have empty intersection.
Example 3.1.1 Two neurons from primary visual cortex In an experiment on response properties of cells in primary visual cortex, Ryan Kelly
and colleagues recorded approximately 100 neurons simultaneously from an
anesthetized macaque monkey while the animal’s visual system was stimulated by highly irregular random visual input. (Kelly et al., 2007). (Kelly,
R.C., Smith, M.A., Samonds, J.M., Kohn, A., Bonds, A.B., Movshon, J.A.,
and Lee, T.-S. (2007) Comparison of recordings from microelectrod arrays
and single electrodes in the visual cortex, J. Neurosci., 27: 261–264.) The
stimulus they used is known as white noise, which will be defined in Chapter ??. Kelly examined the response of two neurons during 100 milliseconds
of the stimulus. Let A be the event that the first neuron fires at least once
within the 100 millisecond time interval and B the event that the second
neuron fires at least once during the same time interval. Here, A ∪ B is the
event that at least one of the 2 neurons fires at least once, while A ∩ B is
the event that both neurons fire at least once. Because it is possible that
both neurons will fire during the time interval, the events A and B are not
mutually exclusive.
2
We now state the axioms of probability.
Axioms of probability:
1. For all events A, P (A) ≥ 0.
2. P (S) = 1.
3. If A1 , A2 , . . . , An are mutually exclusive events, then P (A1 ∪ A2 ∪ · · · ∪
An ) = P (A1 ) + P (A2 ) + · · · + P (An ).
A technical point is that in advanced texts, Axiom 3 would instead involve
infinitely many events, and an infinite sum:
4
CHAPTER 3. PROBABILITY AND RANDOM VARIABLES
30 . If A1 , A2 , . . . , are mutually
P exclusive events (possibly infinitely many
events), then P (∪i Ai ) = i P (Ai )
where the notations mean that the union and summation extend across all
events.
Regardless of whether one worries about the possibility of infinitely many
events, it is easy to deduce from the axioms the elementary properties we
need.
In a more general form, where mutually exclusiveness is not assumed, one
can show for events A1 , . . . , An (see the exercises):
P (A1 ∪ A2 ∪ ... ∪ An ) =
n
X
i=1
n
X
n
X
P (Ai ) −
P (Ai ∩ Aj )+
i<j,i,j=1
P (Aj ∩ Aj ∩ Ak ) − . . . ± P (A1 ∩ A2 ∩ A3 ... ∩ An )
i<j<k,i,j,k=1
The above is often referred to as the inclusion − exclusion relationship.
Theorem: Five Properties of Probability
For any events A and B we have
(i) P (∅) = 0.
(ii) P (Ac ) = 1 − P (A), where Ac is the complement of A.
(iii) If A and B are mutually exclusive, P (A ∩ B) = 0.
(iv) P (A ∪ B) = P (A) + P (B) − P (A ∩ B).
(v) if A ⊆ B, then P (A) ≤ P (B).
Proof: To prove (i), we let A1 , ..., An be n empty sets. Clearly, Ai are disjoint.
n
n
X
X
n
Therefore, P (∪i=1 Ai ) = P (∅) =
P (Ai ) =
P (∅), which is only possible
i=1
i=1
if P (∅) = 0. To prove (ii), we simply note that S = A ∪ Ac . From axiom (2)
3.1. THE CALCULUS OF PROBABILITY
5
we then have P (A ∪ Ac ) = 1 and because A and Ac are mutually exclusive
axiom (3) gives P (A) + P (Ac ) = 1, which is the same as (ii). It is similarly
easy to prove (iii) and (iv). Finally, to prove (v), since A ⊆ B, we can
write B = A ∪ (Ac ∩ B). However, A and Ac ∩ B are disjoint. Thus,
P (B) = P (A) + P (Ac ∩ B) which completes the proof.
2
These facts are often illustrated by analyzing games of chance, which is the
context in which many of the basic methods of probability were first worked
out. For instance, in picking at random a playing card from a standard 52card deck, we may compute the probability of drawing a spade or a face card,
meaning either a spade that is not a face card, or a face card that is not a
spade, or a face card that is also a spade. We take A to be the event that we
draw a spade and B to be the event that we draw a face card. Then, because
3
, and, applying
there are 3 face cards that are spades we have P (A ∩ B) = 52
3
3
− 52
= 11
.
This matches
the last formula above, we get P (A ∪ B) = 41 + 13
26
a simple enumeration argument: there are 13 spades and 9 non-spade face
cards, for a total of 22 cards that are either a spade or a face card, i.e.,
P (A ∪ B) = 22
= 11
. The main virtue of such formulas is that they also
52
26
apply to contexts where probabilities are determined without reference to a
decomposition into equally-likely sub-components.
Example 3.1.1 (Continued) From 1200 replications of the 100 millisecond
stimulus Kelly calculated the probability that the first neuron would fire at
least once was P (A) = .13 and the probability that the second neuron would
fire at least once was P (B) = .22, while the probability that both would
fire at least once was P (A ∩ B) = .063. Applying the formula for the union
(property (iii) above), the probability that at least one neuron will fire is
P (A ∪ B) = .287.
2
6
3.1.2
CHAPTER 3. PROBABILITY AND RANDOM VARIABLES
When sample space is finite and outcomes are
equally-likely, probability of an event is determined by counting the number of its outcomes.
When the sample space is finite, to calculate the probability of an event, we
need to count the number of outcomes in that event along with the number
of outcomes in the sample space. We show the size of an event A by |A|, and
the size of the sample space by |S|. Then, we define:
P (A) =
|A|
.
|S|
(3.1)
For example, Suppose that we have an urn with four balls, numbered from 1
to 4. The experiment consists of drawing two balls in succession with replacement. We are interested in the event that the sum of the two drawn balls is a
multiplier of 3. Note that S = {(1, 1), (1, 2), (1, 3), (1, 4), . . . , (4, 1), (4, 2), (4, 3), (4, 4)}.
Therefore, |S| = 16. The event A can be written as A = {(1, 2), (2, 1), (3, 3)}.
3
. OfThen |A| = 3. We can now calculate the probability of A as P (A) = 16
ten, counting becomes a tedious task. We discuss some strategies of counting
in a supplementary chapter at the end of the book.
3.1.3
The conditional probability P (A|B) is the probability that A occurs given that B occurs.
We often have to compute probabilities under an assumption that some event
has occurred. For instance, one may be interested in the probability that a
patient may come down with a severe pneumonia given that he has been HIVpositive. Or, it might be of interest to calculate the chances of experiencing
bacterial pneumonia given that the individual is a smoker. If we let A be the
event we are interested in and B the event that is assumed to have occurred,
then we write P (A|B) for the conditional probability of A given B. From a
Venn diagram (see Figure 3.1) it is easy to visualize the calculation required:
we limit the universe to B and ask for the relative probability assigned to the
part of A that is contained in B. Algebraically, the formula is the following:
3.1. THE CALCULUS OF PROBABILITY
7
Figure 3.1: Venn diagram showing the intersection of A and B. The events A
and B are depicted as open and filled-in circles, respectively, while A∩B, the
portion of B that is also in A, is shown with diagonal lines. The conditional
probability of A given B is the relative amount of probability assigned to
A within the probability assigned to B, i.e., the probability assigned to the
region having diagonal lines divided by the probability assigned to the whole
of B.
Definition: Conditional Probability Assume P (B > 0). The conditional
.
probability of A given B is P (A|B) = P P(A∩B)
(B)
Again, using draws from a deck of cards, the probability of drawing a Jack
4/52
given that we draw a face card is P (A|B) = 12/52
= 31 .
A rewriting of the definition of conditional probability is also sufficiently
useful to have a name:
Multiplication rule If P (B) > 0 we have P (A ∩ B) = P (A|B) · P (B).
The multiplication rule can be extended to more than two events. As shown
in the exercise section, the multiplication rule is useful in calculating the
8
CHAPTER 3. PROBABILITY AND RANDOM VARIABLES
probability of sequential sampling without replacement.
Illustration: Committee selection Suppose that we want to choose a
committee of size 4 from a group of 15 graduates and 13 undergraduate students. We are also interested in a particular order GU GU , representing first,
selection of a graduate student followed by an undergraduate, a graduate,
and an undergraduate student. Here the multiplicative rule may be used:
15 13 14 12
· 27 · 26 · 25 = 0.67
P r(GU GU ) = P (G)P (U |G)P (G|U ∩G)P (U |G∩U ∩G) = 28
2
Although conditional probability calculations are pretty straightforward, problems involving conditioning can be confusing. The trick to keeping things
straight is to be clear about the event to be conditioned upon. Here is one
standard example.
Illustration: The boy next door Suppose a family moves in next door
to you and you know they have two children, but you do not know whether
the children are boys or girls. Let us assume the probability that either
particular child is a boy is 12 . We might label them Child 1 and Child 2
(e.g., Child 1 could be the older of the two). Thus, P (Child 1 is a boy) =
P (Child 2 is a boy) = 12 . Now suppose you find out that one of the children
is a boy. What is the probability that the other child is also a boy?
It may seem that the answer is 12 but, if we assume that “you find out one
of the children is a boy” means at least one of the children is a boy, then the
correct answer is 31 . Here is the argument. When you find out that one of
the children is a boy you don’t know whether Child 1 is a boy, nor whether
Child 2 is a boy, but you do know that one of them is a boy—and possibly
both are boys. This information amounts to telling you it is impossible that
both are girls. Let A be the event that both children are boys and B the
event that at least one child is a boy. We want P (A|B). Note that there are
four equally-likely possibilities:
P (Child
= P (Child
= P (Child
= P (Child
1
1
1
1
is
is
is
is
a
a
a
a
boy and Child 2 is a boy)
boy and Child 2 is a girl)
girl and Child 2 is a boy)
girl and Child 2 is a girl).
Thus, we compute P (A ∩ B) = P (A) =
1
4
and P (B) = 34 . Plugging these
3.1. THE CALCULUS OF PROBABILITY
9
numbers into the formula for conditional probability we get P (A|B) = 13 . 2
3.1.4
Probabilities multiply when the associated events
are independent.
Intuitively, two events are independent when the occurrence of one event does
not change the probability of the other event. This intuition is captured by
conditional probability: the events A and B are independent when knowing
that B occurs does not affect the probability of A, i.e., P (A|B) = P (A). This
statement of independence is symmetrical: A and B are also independent if
P (B|A) = P (B). However, these statements are not usually taken as the
definition of independence because they require the events to have nonzero
probabilities (otherwise, conditional probability is not defined). Instead, the
following is used as a definition.
Definition: Independence Two events A and B are independent if and
only if P (A ∩ B) = P (A) · P (B).
Note that from this definition, when A and B are independent and P (B) > 0
we have, as a consequence, P (A|B) = P (A∩B)/P (B) = P (A)P (B)/P (B) =
P (A).
Multiplication of probabilities should be very familiar. If a coin has probability .5 of coming up heads when flipped, then we usually say the probability
of getting two heads is .25 = .5 × .5, because we usually assume that the two
flips are independent.
Example 3.1.1 (Continued) For the probabilities P (A), P (B) given on
page 5 we have P (A)P (B) = .029 while the probability of the intersection
was reported to be P (A ∩ B) = .063. The latter is more than double the
product P (A)P (B). We conclude that the two neurons are not independent.
Their tendency to fire much more often together than they would if they were
independent could be due to their being connected, to their having similar
response properties, or to their both being driven by network fluctuations
(see also Kelly et al., 2010). (Kelly, R.C., Smith, M.A., Kass, R.E., T.S. Lee
(2010) Local field potentials indicate network state and account for neuronal
response variability, J. Computational Neuroscience, to appear.)
2
10
CHAPTER 3. PROBABILITY AND RANDOM VARIABLES
This definition of independence extends immediately to more than two events.
It is extremely useful in that calculations are greatly simplified when events
can be assumed independent. On the other hand, it is treacherous because
the calculations can be way off if the assumption is wrong. The assumption of independence therefore plays a crucial role (even if hidden) in many
statistical models and methods. Several examples are given in the sections
below.
A simple but important detail is that when more than two events are independent, every subset of these events will be independent, but not the other
way around. For example, suppose that the three events A, B, and C are
independent. Then:
P (A ∩ B ∩ C) = P (A) · P (B) · P (C).
The above will lead to the pairwise independence of A, B, and C. In other
words, one can conclude:
P (A ∩ B) = P (A) · P (B),
P (A ∩ C) = P (A) · P (C),
P (B ∩ C) = P (B) · P (C).
But, having the three latter relations will not lead to the independence of
the events A, B, and C.
3.1.5
Bayes’ Theorem for events gives the conditional
probability P (A|B) in terms of the conditional
probability P (B|A).
Bayes’ Theorem is a very simple identity, which we derive easily below. Yet,
it has profound consequences. We can state its purpose formally, without
regard to its applications: Bayes’ Theorem allows us to compute P (A|B)
from the reverse conditional probability P (B|A), if we also know P (A). As
we will see below, and in Chapter 5, there are more complicated versions of
the theorem, and it is especially these that produce the wide range of applications. But the power of the result becomes apparent immediately when
3.1. THE CALCULUS OF PROBABILITY
11
we take B to be some data and A to be a scientific hypothesis. In this case,
we can use the statistical model P (data|hypothesis) to obtain the scientific
inference P (hypothesis|data). In the words used in Chapter 1, Bayes’ Theorem provides a vehicle for obtaining epistemic probabilities from descriptive
probabilities. The inverting of conditional probability statements, together
with the recognition that a different notion of probability was involved, led
to the name “inverse probability” during the early 1800s. This has been
replaced by the name “Bayes” in the theorem, and the adjective “Bayesian”
to describe many of its applications. 3 To appreciate the machinery behind
the Bayes’ theorem, one should have a clear understanding of the law of total
probability. We say the sample space is partitioned by two events A and B,
if A ∪ B = S, and A ∩ B = ∅. Following that definition, for any event A,
events A and Ac partition the sample space. Now, let B be any other event.
Theorem: Law of Total Probability For events A and B we have
P (B) = P (B ∩ A) + P (B ∩ Ac ) = P (B|A)P (A) + P (B|Ac )P (Ac ).
Proof: Note that we can partition the sample space by more than two events.
Consequently, the law of total probability may be generalized as follows: Let
events A1 · · · An partition the sample space. In other words, for every pair
of events namely Ai and Aj , Ai ∩ Aj = ∅. Additionally, ∪ni=1 Ai = S. Then,
for any other event B,
P (B) = P (A1 )P (B|A1 ) + P (A2 )P (B|A2 ) · · · + P (An )P (B|An ).
(3.2)
2
Now we are at a position to give a formal statement of the Bayes theorem.
Bayes’ Theorem in the Simplest Case:
If P (B) > 0 then
P (A|B) =
P (B|A)P (A)
.
P (B|A)P (A) + P (B|Ac )P (Ac )
(3.3)
Proof: We begin by the definition of the conditional probability P (A|B)).
We have P (A|B) = P P(A∩B)
. Using the multiplication rule, we restate the
(B)
3
For historical comments see Stigler (1986) and Fienberg (2006). (Fienberg, S.E. (2006)
When did Bayesian inference become “Bayesian”? Bayesian Analsysis, 1:1–40.)
12
CHAPTER 3. PROBABILITY AND RANDOM VARIABLES
Figure 3.2: Law of total probability when the sample space is partitioned by
five events (A1 to A5 ). To calculate the probability of B, we add the joint
probabilities P (A ∩ Bi ), for i = 1, · · · , 5.
numerator as P (A∩B) = P (A)P (B|A). For the denominator, we employ the
law of total probability for two events. That is P (B) = P (B ∩A)+P (B ∩Ac ).
By replacing the numerator and the denominator with their equivalent values
from the multiplication rule and the law of total probability respectively, we
obtain the Equation (3.3).
2
The “simplest case” modifier here refers to the statement of the theorem using only A and Ac as conditioning events. One interesting class of problems
where this simple case is useful is in the interpretation of clinical diagnostic
screening tests. These tests are used to indicate that a patient may have a
particular disease, but they are not definitive. Bayes’ Theorem serves as a
quantitative reminder that when a disease is rare, screening tests are preliminary, and other information will be needed to provide a diagnosis. A famous
example involves screening for prostate cancer based on the radioimmunoassay prostatic acid phosphatase (PSA). Even though the test is reasonably
accurate, the disease remains sufficiently rare that a random male who tests
as positive will still have a low probability of actually having prostate cancer.
An application of Bayes’ Theorem (with A being the event that a randomly
3.1. THE CALCULUS OF PROBABILITY
13
chosen man will have the disease and B the event that he tests positive) to
data from Watson and Tang (1980), places the probability of disease given a
positive test at about 1/250. The intuition comes from recognizing that the
disease has a prevalence of about 1/3000. Suppose we were to examine 3,000
men, 1 of whom actually had the disease. If the test were 90% accurate, a
10% false positive rate would mean that about 300 men would test positively.
In other words, about 1/300 of the positively tested men would actually have
the disease. Bayes’ Theorem refines this very crude calculation. Here is an
example drawn from neurology.
Example 3.1.2 Diagnostic test for vascular dementia Vascular dementia (VD) is the second leading cause of dementia. It is important that it be
distinguished from Alzheimer’s disease because the prognosis and treatments
are different. In order to study the effectiveness of clinical tests for vascular
dementia, Gold et al. (1997) examined 113 brains of dementia patients post
mortem. (Gold G; Giannakopoulos P; Montes-Paixao Junior C; Herrmann
FR; Mulligan R; Michel JP; Bouras C. (1997) Sensitivity and specificity of
newly proposed clinical criteria for possible vascular dementia. Neurology,
49:690–4.) One of the clinical tests these authors considered was proposed
by the National Institute for Neurological Disorders (NINDS, an institute of
NIH). Gold et al. found that the proportion of patients with VD who were
correctly identifed by the NINDS test was .58, while the proportion of patients who did not have VD who were correctly so identified by the NINDS
tests was .80. These proportions are usually called the sensitivity and specificity of the test. Using these results, let us consider an elderly patient who
is identified as having VD by the NINDS test, and compute the probability
that this person will actually have the disease. Let A be the event that the
person has the disease and B the event that the NINDS test is positive. We
want P (A|B), and we are given P (B|A) = .58 and P (B c |Ac ) = .8). To
apply Bayes’ Theorem we need P (A). Let us take this probability to be
P (A) = .03 (which seems a reasonable value based on Hebert and Brayne,
1995; (Hebert R; Brayne C, Epidemiology of vascular dementia, Neuroepidemiology 14:240–57.)). We then also have P (Ac ) = .97 and, in addition,
P (B|Ac ) = 1 − P (B c |Ac ) = .2. Plugging these numbers into the formula
gives us
(.58)(.03)
= .082
P (A|B) =
(.58)(.03) + (.2)(.97)
14
CHAPTER 3. PROBABILITY AND RANDOM VARIABLES
or, approximately, 1/12. Thus, based on the Gold et al. study, because VD is
a relatively rare disease, without additional evidence, even when the NINDS
test is positive it remains unlikely that the patient has VD.
2
As in Example 3.1.2, this form of Bayes’ Theorem requires probabilities
P (B|A), P (B|Ac ) and P (A) which must come from some background information. All applications of Bayes’ Theorem are analogous in needing
background information as inputs in order to get the desired conditional
probability as output.
To generalize Bayes’ Theorem from the simplest case we need the law of total
probability, which gives a formula for P (B) in terms of a decomposition of
S: Given mutually exclusive events A1 , A2 , . . . , An that are exhaustive in the
sense that S = A1 ∪ A2 ∪ · · · ∪ An , we have
B = (B ∩ A1 ) ∪ (B ∩ A2 ) ∪ · · · ∪ (B ∩ An )
with the sets B ∩ Ai being mutually exclusive. We then have
P (B) = P (B ∩ A1 ) + P (B ∩ A2 ) + · · · + P (B ∩ An )
= P (B|A1 )P (A1 ) + P (B|A2 )P (A2 ) + · · · + P (B|An )P (An )
n
X
=
P (B|Ai )P (Ai ).
i=1
From this we reach at the more general form of the theorem.
Bayes’ Theorem Suppose A1 , A2 , . . . , An are mutually exclusive with P (Ai ) >
0, for all i, and A1 ∪ A2 ∪ · · · ∪ An = S. If P (B) > 0 then
P (Ak |B) =
P (B|Ak )P (Ak )
.
P (B|A1 )P (A1 ) + P (B|A2 )P (A2 ) + · · · + P (B|An )P (An )
Example 3.1.3 Pulmonary Disease. Rosner (2000), gives an example of
a 60 year old nonsmoker with a consistent chronic coughing and occasional
3.1. THE CALCULUS OF PROBABILITY
15
breathlessness. These are possible symptoms of lung cancer as well as a
common lung disease known as sarcoidosis. A lung biopsy is ordered for this
patient, revealing results consistent with both lung cancer and sarcoidosis.
Rosner formalizes the following notation:
Symptoms: A = {cough, results of lung biopsy}.
Disease: B1 =normal, B2 =lung cancer, B3 =sarcoidosis.
Suppose that P (A|B1 ) = 0.001, P (A|B2 ) = 0.90, P (A|B3 ) = 0.90, suggesting that consistent chronic coughing as well as the results of the biopsy may
well indicate a disease. Also, suppose that for a typical 60 year old nonsmoker, P (B1 ) = 0.99, P (B2 ) = 0.001, P (B3 ) = 0.009. According to Rosner,
the latter probabilities would have to obtained from age-sex-smoking specific
prevalence studies for the two possible diseases. Here is how the generalized
Bayes theorem in Equation (??) can be used to calculate the conditional
probability of each of the disease states given the the symptoms in event A.
P (B1 |A) =
P (B1 )P (A|B1 )
3
X
= (0.001)(0.99)/[(0.001)(0.99)+(0.90)(0.001)+(0.90)(0.009)]
P (Bi )P (A|Bi )
i=1
= 0.00099/0.00999 = 0.099.
P (B2 |A) =
P (B2 )P (A|B2 )
3
X
= (0.9)(0.001)/.00999 = 0.090.
P (Bi )P (A|Bi )
i=1
P (B3 |A) =
P (B3 )P (A|B3 )
3
X
= (0.9)(0.009)/0.00999 = 0.811.
P (Bi )P (A|Bi )
i=1
This suggests that the conditional probability of sarcoidosis given the chronic
coughing and the result of the biopsy is significantly higher than the conditional probability of lung cancer given the same symptoms.
2
Example 3.1.4 Decoding of saccade direction from SEF spike counts
Bayes’ Theorem is frequently used to study the ability of the relatively small
networks of neurons to identify a stimulus or determine a behavior. As an
example, Olson et al. (2000, J. Neurophysiol.) reported results from a study
16
CHAPTER 3. PROBABILITY AND RANDOM VARIABLES
of supplementary eye field neurons during a delayed-saccade task. In this
study, which we described previously in Chapter 1, there were four possible
saccade directions: up, right, down, and left. For each direction, and for each
neuron, spike counts in fixed pre-saccade time intervals were recorded across
multiple trials. From a combination of data analysis, and assumptions, the
probability distribution of various spike counts could be determined for each
of the four directions. If we consider a single neuron, we may then let B be
the event that a particular spike count occurs, and the events A1 , A2 , A3 ,
and A4 be the saccade directions up, right, down, left. Assuming the four
directions are equally likely, from the probabilities P (B|Ak ) together with
Bayes’ Theorem, we may determine from the spike count B the probability
that the saccade will be in each of the four directions. In Bayesian decoding,
the signals from many neurons are combined, and the direction Ak having
the largest probability P (B|Ak ) is considered the “predicted” direction. In
unpublished work, Kass and Ventura found that from 55 neurons (many of
which contributed relatively little information), Bayesian decoding was able
to predict the correct direction more than 95% of the time.
2
3.2
Random Variables
In Chapter 1 we said it was important to distinguish data distributions from
probability distributions: probability distributions live in the “theoretical
world” while data live in the “real world.” But if probability distributions
are not, strictly speaking, about data, what are the theoretical things that
do obey the distributions? We need a name for these mathematical objects.
They are called random variables. In this section we develop some of their
basic attributes and properties, and we will use probability distributions to
describe the way they vary.
3.2.1
Random variables take on values determined by
events.
A random variable maps the outcomes of the sample space of a random
experiment to real numbers. Formally, if s ∈ S represents one or more
3.2. RANDOM VARIABLES
17
outcomes in the sample space S, then X(s) ∈ R, where X is a random
variable. The most trivial example would be tossing a coin once. Suppose,
we are interested in the number of heads occurring as an outcome of this
random experiment. Consequently, X(H) = 1, and X(T ) = 0. We can
extend this line of thinking to the experiment of rolling two fair dice. Then,
the sum of two rolls is a random variable, where for example, P(X=12)=1/36,
and P(X=8)=5/36, corresponding to the outcomes (2,6), (6,2), (3,5), (5,3),
12
X
and (4,4). Note that in the dice problem,
P (X = i) = 1. A probability
i=2
distribution of the random variable X tabulates each X = i along with
P (X = i), its corresponding probability.
Let us continue by considering the Hardy-Weinberg distribution, which is fundamental to population genetics because it describes the relative frequency
of genotypes in equilibrium. In the simplest case, we assume each parent
contributes one allele to an offspring. If we write this in the form
(allele from parent 1, allele from parent 2) → offspring pair of alleles
then the possibilities are denoted by
(A, A) → AA
(A, a) → Aa
(a, A) → Aa
(a, a) → aa.
For instance, according to the classic Mendelian story, a garden pea might
be wrinkled if aa but smooth otherwise. We assume that, in the population,
the probability of allele A being inherited by the offspring is P (A), and we
will also write this number as p, so that p = P (A). That is, if we were to
select some offspring at random from the population, the probability that
the offspring would have allele A is p. We also assume the inherited alleles
are independent, meaning that the allele inherited from parent 1 does not
affect the probability of inheriting allele A from parent 2. As we have already
seen, the assumption of independence implies that the probabilities may be
multiplied:
18
CHAPTER 3. PROBABILITY AND RANDOM VARIABLES
P (AA) = p2 ,
P (Aa) = p(1 − p) + (1 − p)p = 2p(1 − p),
P (aa) = (1 − p)2 .
Now take X to be the number of A alleles. Then
P (X = 2) = p2 ,
P (X = 1) = 2p(1 − p),
P (X = 0) = (1 − p)2 .
In this situation X is a random variable and it has a Binomial distribution.
Because of its importance, we will revisit the binomial distribution shortly.
In Chapter 1 we discussed the distinction between continuous and discrete
data. We may similarly distinguish continuous and discrete random variables:
a random variable is continuous if it can take on all values in some interval
(A, B), where it is possible that either A = −∞ or B = ∞ (or both).
The mathematical distinctions between discrete and continuous distributions
are that (i) discrete distributions taken on only certain specific values (such
as non-negative integers) that can be separated from each other and (ii)
wherever summation signs appear for discrete distributions, integrals replace
them for continuous distributions.
3.2.2
Distributions of random variables are defined using cumulative distribution functions and probability density functions, from which theoretical
means and variances may be computed.
There are several definitions we need, which will apply to other probability
distributions besides the Binomial. In the Hardy-Weinberg example, the
values P (X = 0), P (X = 1), and P (X = 2) form the probability mass
function (pmf). For convenience, as indicated in Section 3.2.3, We would
typically write P (X = x), with x taking the values 0, 1, 2. The shorthand
p(x) = P (X = x) is also often used. The function F (x) = P (X ≤ x)
3.2. RANDOM VARIABLES
19
is called the cumulative distribution function (cdf ). Thus, In the HardyWeinberg example we have F (1) = P (X ≤ 1) = P (X = 0) + P (X = 1).
From the pdf we can obtain the cdf, and vice-versa.
We can now formally define the cdf for a discrete random variable: Let X
be a random variable with the pmf f (x). The cdf is defined as
X
F (x) = P (X ≤ x) =
f (t).
∀t≤x
It is useful to consider some fundamental properties of the cumulative distribution function. First of all, F is a nondecreasing function. Secondly, F (x)
goes to 1 in limit when x → ∞, and it goes to 0 in limit when x → −∞.
Finally, F is right-continuous. We prove these properties later in the chapter.
For discrete random variables, one can use the cdf to make a whole range of
probabilistic calculations. For example, for any real value a,
P r(X > a) = 1 − F (a).
Similarly, for (a, b) ∈ R, and a ≤ b, we have
P r(a ≤ X ≤ b) = F (b) − F (a).
Most importantly,
P r(X = a) = F (a) − F (a− ),
where F (a− ) represents the limit of F as x → a from the left. Using the left
limit in computing cdf allows us to express other probabilities such as
P (a ≤ X < b) = F (b− ) − F (a− ),
P (a < X < b) = F (b− ) − F (a).
For discrete random variables, one can use the cdf to make a whole range of
probabilistic calculations. For example, for any real value a,
P r(X > a) = 1 − F (a).
20
CHAPTER 3. PROBABILITY AND RANDOM VARIABLES
Similarly, for (a, b) ∈ R, and a ≤ b, we have
P r(a ≤ X ≤ b) = F (b) − F (a).
Most importantly,
P r(X = a) = F (a) − F (a− ),
where F (a− ) represents the limit of F as x → a from the left. Using the
left limit in computing the cdf would allow us to express other probability
relations such as
P (a ≤ X < b) = F (b− ) − F (a− ),
and
P (a < X < b) = F (b− ) − F (a).
Illustration: Litter sizes of mice Suppose that 50 female mice were maintained in a facility, that each gave birth to a litter, and that the litter sizes
may be summarized in the following table:
size
3 4
5
6
7
8
count 3 7 12 14 10 4
Let us consider choosing a mouse at random from among the 50 that gave
birth, and let X be the litter size for that mouse. By dividing each count in
the table above by 50 we get the following table for the probability distribution of X:
x
3
4
5
6
7
8
p(x) .06 .14 .24 .28 .20 .08
Thus, p(3) = 3/50 = .06 signifies a probability that a randomly drawn mouse
that will have litter size 3.
2
Notice that a plot of the counts (against x) would be a histogram of the 50
litter sizes. Aside from the divisor of 50 used in getting each probability from
the corresponding count, a plot of p(x) against x would look the same as the
histogram of the counts; this would, instead, be a plot of the relative frequencies.4 More generally, a plot of a probability distribution looks something
like a histogram, except that the total amount of probability must equal 1.
4
In this context terminology is inconsistent: “frequency” can mean either “count” or
“relative frequency.”
3.2. RANDOM VARIABLES
21
One way to understand any specification of probabilities p(x) is to consider
them to represent relative frequencies among a population of individuals.
However, in many cases the idea of a random drawing from a population is
an abstraction, and may be rather unrealistic. This is actually an important
philosophical point that has been argued about a great deal, but we will not
go into it.
In experimental settings, it is quite artificial to imagine that the
repeated measurements (trials) of an experiment are being drawn
at random from some population of such things. Similarly, when
there is a single unique event, such as the outcome of a football
game, or the flip of a fair coin, we can be comfortable speaking
about the probability of the outcome without any need for a
population. In the case of the coin, if we let X = 1 if it comes
up heads and X = 0 if it comes up tails, we would take p(1) =
P (X = 1) = .5 and p(0) = P (X = 0) = .5. We could, if we
wished, imagine some very large population of fair coins, just like
the one we are going to flip, among which, if flipped in just the
same way, half would come up heads and half would come up
tails. But we don’t really need this imaginary device: thinking
only about one single coin it remains easy enough to understand
the idea that it is “fair” precisely when p(1) = .5 and p(0) = .5.
That is, the notion that it is equally likely to be heads and tails
does not require further elaboration. If we wished to have an
operational meaning to “fair” we could take it to mean that we
are willing to accept a fair bet, i.e., one in which we would win
the same amount if heads as we would lose if tails.
For our purposes, what is important is that relative frequencies sometimes
define probabilities, and more generally provide a useful analogy for thinking
about probability.
22
CHAPTER 3. PROBABILITY AND RANDOM VARIABLES
Uniform distribution assigns equal probability to the
values of a random variable.
Uniform distribution assigns equal probabilities to the values of a discrete
random variable x. That is,
1 x = 1, ..., k
k
f (x) =
.
0 otherwise
Therefore, the cdf of x is
F (y) =
y
X
1
x=1
k
=
y
.
k
Binomial distribution is used to calculate the probability of independent repetitions of Bernoulli trials.
We demonstrated the idea behind the Binomial distribution In the genetics
example of the previous section. We can now formalize the idea using the
concept of Bernoulli trials. We referred to independent repetitions of random
experiments having two possible outcomes as Bernoulli trials. A simple way
of identifying Bernoulli trials is to represent them as 1s and 0s. If p is the
probability of the occurrence of 1, then q = 1 − p will be the probability of
the occurrence of a 0. p is also known as the probability of success. We can
then calculate the probability of success or failure via
P (X = x) = px q (1−x) ,
for x = 0, 1. Clearly, the above is a pmf because P (X = 0) + P (X = 1) = 1.
This is called the Bernoulli distribution.
Bernoulli distribution can be generalized to calculate the probability of ob-
3.2. RANDOM VARIABLES
23
taining x successes (or 1s) in n independent Bernoulli trials via
n x (1−x)
P (X = x) =
p q
,
x
forx = 0, ..., n. The relationship above is clearly a pmf because
n X
n x (n−x)
p q
= (p + q)n = 1,
x
x=0
via the Binomial expansion. This is called the Binomial distribution.
Poisson distribution gives the probability of the number of occurrences of a random phenomenon in a fixed
period of time.
The French mathematician Poisson studied the behavior of Bernoulli trials
when n is large and p is relatively small, so that the product λ = np is
of moderate value. In his book published in 1837, Poisson noted that by
letting λ = np, the Binomial distribution can be looked at from the following
alternative perspective
n x (1−x)
P (X = x) =
p q
=
x
n(n − 1)...(n − x + 1)(n − x)! λ x λ n−x
=
1−
x!(n − x)!
n
n
λ n
λ −x
n(n − 1)...(n − x + 1) λ x
1−
1−
=
x!
n
n
n
i
xh
n
λ n n−1 n−x+1
λ
λ −x
=
.
...
1−
1−
x! n n
n
n
n
Now, let n goes to infinity to take the limit of the above expression. Note
that
λ n
lim 1 −
= e−λ ,
n→∞
n
hn n − 1 n − x + 1i
lim
.
...
= 1,
n→∞ n
n
n
24
CHAPTER 3. PROBABILITY AND RANDOM VARIABLES
and
lim
n→∞
λ −x
= 1.
1−
n
Therefore,
e−λ λx
lim P (X = x) =
,
n→∞
x!
which is called the Poisson distribution, for x = 0, 1, .... That is,
∞
X
x=0
e−λ
λx
= e−λ eλ = 1.
x!
As described in Feller (1950), the Poisson distribution allows us to calculate
the probability of k occurrences of 1s of the Bernoulli trials in a fixed period of
time. This can be explained alternatively by breaking the unit time interval
into n segments, each of length 1/n. Every segment then is filled with a
0 or one or more 1s. In the Binomial setting, the probability of success is
taken to be the same in all subintervals. Thus, we are really dealing with
Bernoulli trials in non-overlapping intervals. We may further restrict the
situation by imposing an additional assumption that the probability of two
or more 1s in each subinterval is negligible. Subsequently, the probability of
the occurrence of exactly k 1s in the unit time interval is given by the limit
argument presented in the above. This conceptualization of the Poisson
random variable via Bernoulli trials will be of considerable importance to us
as we attempt to model random phenomena in real life.
The mean of a distribution is its balancing point while
the variance is an informative representation of its uncertainty.
Let us go on to the concepts of mean and variance. For the 50 litter sizes in
the table on page 20 we would compute the mean as
mean =
3(3) + 7(4) + 12(5) + 14(6) + 10(7) + 4(8)
= 5.66.
50
3.2. RANDOM VARIABLES
25
Alternatively, we could write
mean = 3(
7
12
14
10
4
3
) + 4( ) + 5( ) + 6( ) + 7( ) + 8( ) = 5.66
50
50
50
50
50
50
which, from the table on page 20 is the same as
mean = 3 · p(3) + 4 · p(4) + 5 · p(5) + 6 · p(6) + 7 · p(7) + 8 · p(8) = 5.66.
This latter form may be interpreted as the litter size we would expect to
see (“on average”) for a randomly drawn mouse, and it is an instance of
the general expression for the mean or expected value or expectation of the
random variable X:
X
µX = E(X) =
x · p(x).
(3.4)
x
Correspondingly, the variance of X is
X
2
(x − µX )2 · p(x)
= Var(X) =
σX
(3.5)
x
p
2
and the standard deviation is σX = σX
. The subscript X is often dropped,
leaving simply µ and σ. The standard deviation summarizes the magnitude
of the deviations from the mean; roughly speaking, it may be considered an
average amount of deviation from the mean. It is thus a measure of the
spread, orPvariability, of the distribution. There are alternative measures
(such as x |x − µ|p(x)), and these are used in special circumstances, but
the standard deviation is the easiest to work with mathematically. It is,
therefore, the most common measure of spread.
Note that µX and σX are theoretical quantities defined for distributions, and
are analogous to the mean and standard deviation defined for data. In fact,
if there are n values of x and we plug into (3.4) the special case p(x) = n1
(which states that all n values of x are equally likely) we get back5 µX = x̄.
Because data are often called samples, the data-based mean and standard
deviation are often called the sample mean and the sample standard deviation
q P
n
1
2
We also get σX =
i=1 (xi − µX ) , which is not quite the same thing as the
n
sample standard deviation; the latter requires a change from n to n − 1 as the divisor for
certain theoretical reasons, including that the sample variance then becomes an unbiased
2
estimator of σX
. See Chapter 3.
5
26
CHAPTER 3. PROBABILITY AND RANDOM VARIABLES
to differentiate them from µX and σX , which are often called the population
mean and standard deviation. This terminology distinguishes samples from
“populations,” rather than distributions, with the word “sample” connoting a batch of observations randomly selected from some large population.
Sometimes there is a measurement process that corresponds to such random
selection. However, as we have already mentioned, probability is much more
general than the population/sample terminology might lead one to expect;
specifically, we do not need to have a well-defined population from which we
are randomly sampling in order to speak of a probability distribution. So, at
least in principle, we might rather avoid calling µX a population mean. On
the other hand, the “sample” terminology is useful for emphasizing that we
are dealing with the observations, as opposed to the theoretical distribution,
and it is deeply imbedded in statistical jargon. Similarly, the “population”
identifier is frequently used rather than “theoretical.” The crucial point is
that one must be careful to distinguish between a theoretical distribution
and the actual distribution of some sample of data. Many analyses assume
that data follow some particular theoretical distribution, and in doing so hope
that the match between theory and reality is pretty good. We will look at
ways of assessing this match in Section ??.
The following properties are often useful.
Theorem Let a be a real constant. Then:
E(a) = a
(3.6)
Var(a) = 0
(3.7)
X
p(x) = a. From (3.5), Var(a) =
and
Proof: From 3.4, E(a) =
X
(a − a)2 p(x) = 0.
X
x
ap(x) = a
x
2
x
Theorem
E(a · X + b) = a · E(X) + b
(3.8)
σaX+b = |a| · σX
(3.9)
and
3.2. RANDOM VARIABLES
27
Proof: For the expected value,
E(aX + b) =
X
(ax + b)p(x)
x
= a(
X
xp(x)) + b
X
x
p(x)
x
= aE(X) + b.
2
The proof of the variance formula is similar.
Theorem Let X be a discrete random variable. Suppose that g is a real
function. Then for real values a and b,
E[a · g(X)] = a · E[g(X)].
(3.10)
Var[a · g(X) + b] = a2 · Var[g(X)].
(3.11)
and
Proof: To prove the first result, we need to note that for any real function
U,
X
U (x)p(x).
E[U (X)] =
x
Let U (X) = a.g(X).
Then,
E[a · g(X)] =
X
x
ag(x)p(x) = a
X
g(x)p(x) = a · E[g(X)].
x
To show the second result, We note that for any real function U ,
Var[U (X)] = E[U 2 (X)] − E 2 [U (X)].
Let U (X) = a · g(X).
Then,
Var[a · g(X)] = E[a2 g 2 (X)] − E 2 [ag(X)] = a2 E[g 2 (X)] − a2 E 2 [g(X)] =
a2 · (E[g 2 (X)] − E 2 [g(X)]) = a2 · Var[g(X)].
28
CHAPTER 3. PROBABILITY AND RANDOM VARIABLES
2
3.2.3
Continuous random variables are similar to discrete random variables.
Suppose X is a continuous random variable on an interval (A, B), with A =
−∞ and B = ∞ both being possible. The probability density function (pdf)
of X will be written as f (x) where now
Z b
P (a ≤ X ≤ b) =
f (x)dx
a
and
Z
B
f (x)dx = 1.
A
Note that in this continuous case there is no distinction between P (a ≤ X)
and P (a < X) (we have P (X = a) = 0). We may think of f (x) as the
probability per unit of x; f (x)dx is the probability that X will lie in an
infinitesimal interval about x: f (x)dx = P (x ≤ X ≤ x + dx). In some
contexts there are various random variables being considered and we write
the pdf of X as fX (x).
A technical point is that when either A > −∞ or B < ∞, by convention,
the pdf f (x) is extended to (−∞, ∞) by setting f (x) = 0 outside (A, B).
Typically, when we say that X is a continuous random variable on an interval
(A, B) we will mean that f (x) > 0 on (A, B). We next give several examples
of continuous distributions.
Illustration: Uniform distribution Perhaps the simplest example is the
uniform distribution. For instance, if the time of day at which births occurred
followed a uniform distribution, then the probability of a birth in any given
30 minute period would be the same as that for any other 30 minute period
throughout the day. In this case the pdf f (x) would be constant over the
interval from 0 to 24 hours. Because it must integrate to 1, we must have
f (x) = 1/24 and the Rprobability of a birth in any given 30 minute interval
a+.5
starting at a hours is a
f (x)dx = 1/48. When a random variable X has a
uniform distribution on a finite interval (A, B) we write this as X ∼ U (A, B)
1
.
2
and the pdf is f (x) = B−A
3.2. RANDOM VARIABLES
0
0.0
2
0.2
4
0.4
29
6
0.6
8
0.8
1.0
0
2
−4
4
−2
6
0
8
2
10
4
Figure 3.3: Plots of pdfs for four continuous distributions. Top left: Exponential. Top right: Gamma, with shape parameter 2. Bottom left: Beta.
Bottom right: Normal.
Figure 3.3 displays pdfs for four common distributions. For the two in the top
panels, Exponential and Gamma distributions, X may take on all positive
values, i.e., values in (0, ∞). The lower left panel shows a Beta distribution,
which is confined to the interval (0,1). A Normal distribution, which ranges
over the whole real line, is shown in the bottom right panel.
Illustration: Normal distribution The normal distribution (also called
the Gaussian distribution) is the most important distribution in Statistics.
The reason for this, however, has little to do with its ability to describe data:
Example ?? presents one of the few examples we know in which the data
really appear normally distributed to a high degree of accuracy; it is rare
30
CHAPTER 3. PROBABILITY AND RANDOM VARIABLES
for a batch of data not to be detectably non-normal. Instead, in statistical
inference, the normal distribution is used to describe the variability in quantities derived from the data as functions of a sample mean. As we discuss
in Chapter ??, according to the Central Limit Theorem, sample means are
approximately normally distributed and in Chapter ??, we will also see that
functions of a sample mean are approximately normally distributed.
The normal distribution is characterized by two parameters: the mean and
the standard deviation (or, equivalently, its square, the variance). When a
random variable X is normally distributed we write X ∼ N (µ, σ 2 ). Both in
most software and in most applications, one speaks of the parameters µ and
σ rather than µ and σ 2 . The pdf for the normal distribution with mean µ
and standard deviation σ is
1 x−µ 2
1
exp(− (
) ).
f (x) = √
2 σ
2πσ
This pdf can be hard to use for analytic calculations because its integral can
not be obtained in explicit form. Thus, probabilities for normal distributions
are almost always obtained numerically. Because of its shape the normal
pdf is often called “the bell-shaped curve.” We exemplify this in the next
example.
2
In fact, the general bell shape of the distribution is not unique to the normal distribution. On the other hand, the normal is very special among
bell-shaped distributions. The most important aspect of its being very special is its role in the Central Limit Theorem, which we’ll come back to in
Chapter ??. We also describe additional important properties of normal
distributions in Chapter ??.
. Then,
We can show f (x) is a pdf , by letting y = x−µ
σ
Z ∞
Z ∞
h −(x − µ)2 i
1 −y2
√
I=
exp
dx
=
e 2 dy
2σ 2
2π
−∞
−∞
Therefore,
2
Z
∞
Z
∞
I =
e
−∞
−(y 2 +z 2 )
2
dydz.
−∞
Now consider using the polar coordinates. That is, let z = r sin θ, and
y = r cos θ, so that dydz = rdθdr, and y 2 +z 2 = r2 , for r > 0, and θ ∈ (0, 2π),
3.2. RANDOM VARIABLES
31
leading to
2
Z
∞
2π
Z
I =
e
0
or equivalently, I =
pdf .
3.2.4
−r 2
2
Z
rdθdr =
0
√
∞
e
−r 2
2
Z
dr
0
2π
dθ = 2π.
0
2π, which completes the argument for f (x) being a
Cumulative distribution functions of continuous
random variables are obtained by integrating their
pdf
The cumulative distribution function (cdf ), or simply distribution function, is
written again as F (x) and is defined as in the discrete case: F (x) = P (X ≤
x). If A = −∞ and B = ∞ this becomes
Z x
F (x) =
f (t)dt.
−∞
If A is a number, i.e., −∞ < A, then F (x) = 0 when x < A and
Z x
f (t)dt,
F (x) =
A
while if B is a number (B < ∞) then F (x) = 1 when x > B. By differentiation (the Fundamental Theorem of Calculus) we also have F 0 (x) = f (x).
The three important properties for the cdf also hold for continuous random
variables. This time, we prove those properties through a theorem.
Theorem.
Let X be a random variable with the probability density function f , and the
cumulative density function F . Then:
(1) CDF is a non-decreasing function. In other words, if x1 < x2 , then:
F (x1 ) ≤ F (x2 ).
(2) CDF is a probability. Therefore, it fluctuates from 0 to 1. To express
this formally, we have: lim F (x) = 0, and lim F (x) = 1.
x→0
x→1
32
CHAPTER 3. PROBABILITY AND RANDOM VARIABLES
(3) CDF is continuous from the right. That is, F (x) = lim
F (u).
u→x
u>x
Proof:
(1) To show that F (x) is non-decreasing, suppose that x1 < x2 are two real
values. Now, note that the event {X < x1 } ⊂ {X < x2 }. Therefore,
P ({X < x1 }) ≤ P ({X < x2 }), or F (x1 ) ≤ F (x2 ). This means F is
non-decreasing.
(2) The proof of this property is more involved. First (see the problems), we
need to show that for any increasing or decreasing sequence of events,
{An , n ≥ 1},
lim P (An ) = P ( lim An ).
(3.12)
n→∞
n→∞
Now, suppose that {y1 < y2 < ... < yn < ...} is an increasing sequence
of real numbers. Note that the events {X ≤ yn } form an increasing
sequence because {yn } gets larger as n increases. Therefore, lim {X <
n→∞
yn } = {X < ∞}. From the last statement, P ( lim {X < yn }) =
n→∞
lim P ({X < yn }) = lim F (yn ) = P ({X < ∞}) = 1. In other words,
n→∞
n→∞
lim F (x) = 1. The proof of lim F (x) = 0 is similar, and it is done
x→∞
x→−∞
through letting {tn , n ≤ 1} be a decreasing set of events.
(3) Suppose that y1 > y2 > . . . is a decreasing set of real values that
converges to x. Then, because of 3.12, the following argument holds:
F (x) = P (X ≤ x) = lim P (X ≤ yn ).
n→∞
This completes the proof, because the second statement is the definition
of the limit on the right side of x.
Note that the cdf accumulates probability up to and including a certain value,
say x. For continuous random variables, this translates into the associated
area under the probability density curve up to x. Sometimes, the area under
the curve is particularly important. For example: F (M ) = P (X ≤ M ) = 21 ,
defines the median. Mathematically,
Z M
1
f (x)dx = .
2
−∞
3.2. RANDOM VARIABLES
33
In general, Q is called a quantile when F (Q) = P (X ≤ Q) = pQ , for pQ
being a probability. In addition to the median, one can think of some other
important quantiles such as the first and the third quartile for which pQ =
0.25, and pQ = 0.75 respectively. Also, quantiles are referred to as percentiles,
when the areas under the curve are expressed in percentages. Let Mp be the
p − th percentile, then
Z
Mp
f (x)dx = p,
F (Mp ) =
−∞
which in turns
Mp = F −1 (p).
For some distributions, this inverse calculation is quite important and has
many applications. Fortunately, as shown in the appendix, R comes with
library functions for the calculation of p and Mp for a wide range of distributions.
Illustration. The Median of a Uniform Distribution Suppose that
X ∼ U (a,
b). In other words,
1
a<x<b
b−a
f (x) =
0
otherwise,
Z M
dx
1
for real values a and b, such that a < b. We have
= , leading to
b−a
2
a
a+b
M= 2 .
In the continuous case, the expected value of X is
Z
B
µX = E(X) =
xf (x)dx
A
and the standard deviation of X is σX =
Z
p
V (X) where
B
V (X) =
(x − µX )2 f (x)dx
A
is the variance of X. Note that in each of these formulas we have simply
replaced sums by integrals in the analagous definitions for discrete random
34
CHAPTER 3. PROBABILITY AND RANDOM VARIABLES
variables. Note, too, that pdf and cdf values for certain continuous distributions may be computed with statistical software.6 We again have
µa·X+b = a · µX + b
(3.13)
σa·X+b = |a| · σX .
(3.14)
These formulas are just as easy to prove as (3.8) and (3.9).
The quantiles or percentiles are often used in working with continuous distributions: for p a number between 0 and 1 (such as .25), the pth quantile
or 100p-th percentile (e.g., the .25 quantile or the 25th percentile) of a distribution having cdf F (x) is the value η such that p = F (η). Thus, we may
write η = F −1 (p), where F −1 is the inverse cdf.
Illustration: Exponential distribution Let us illustrate these ideas in
the case of the Exponential distribution, which is special because it is easy
to handle and also because of its importance in applications. An interesting
application to ion channel opening durations appears in Section ??.
A random variable X is said to have an Exponential distribution with parameter λ when its pdf is
f (x) = λe−λx
(3.15)
for x > 0, and is 0 for x ≤ 0. We will then say that X has an Exp(λ)
distribution and we will write X ∼ Exp(λ) where the sign ∼ means “is
distributed as”. The pdf of X when X ∼ Exp(1) is shown in Figure 3.4.
Also illustrated in that figure is computation of probabilities as areas under
the pdf. For example,
Z
P (X > 2) =
∞
f (x)dx
2
which means we compute the area under the curve to the right of x = 2.
For the exponential distribution this value is easy to compute using calculus.
6
The definitions of expectation and variance assume that the integrals are finite; there
are, in fact, some important probability distributions that do not have expectations or
variances because the integrals are infinite.
3.2. RANDOM VARIABLES
35
0.6
0.4
0.0
0.2
Exponential Density
0.8
1.0
P(X > 2 | λ = 1)
0
1
2
3
4
5
x
Figure 3.4: The pdf of a random variable X having an Exponential distribution with λ = 1. The shaded area under the pdf gives P (X > 2).
The cdf of an Exponential distribution is
Z x
F (x) =
λe−λt dt
0
x
−λt = −e = 1−e
0
−λx
.
Thus, when X ∼ Exp(λ) we also have
P (X > x) = e−λx
and if λ = 1
P (X > 2) = 1 − F (2) = e−2 .
(3.16)
36
CHAPTER 3. PROBABILITY AND RANDOM VARIABLES
The quantiles are also easily obtained. For example, if X ∼ Exp(λ) the .95
quantile of X is the value η.95 such that P (X ≤ η.95 ) = F (η.95 ) = .95. We
have
.95 = F (η.95 ) = 1 − e−λη.95
and solving for η.95 gives η.95 = − loge (.05)/λ.
If X ∼ Exp(λ) then, by calculations we omit here, we obtain
E(X) = 1/λ
V (X) = 1/λ2
σX = 1/λ.
We discuss the Exponential distribution further in Section ??.
2
The formulas and concepts that apply to random variables are usually stated
with the notation of integrals rather than sums. This is partly because it is
cumbersome to repeat everything for both continuous and discrete random
variables, when the results are in essence the same. In fact, there is an
elegant theory of integration7 that, among other things, treats continuous
and discrete random variables together, with summations becoming special
cases of integrals. Throughout our presentation we will, for the most part,
discuss the continuous case with the understanding that the analogous results
follow for discrete random variables. For example, the terminology pdf will
be used for both continuous and discrete random variables, where for the
latter it will really refer to a probability mass function.
For many purposes we don’t actually need formulas such as those derived for
the Exponential distribution. Most statistical software contains routines to
generate random observations artificially8 from standard distributions, such
as those presented below, and the software will typically also provide pdf
values, probabilities, and quantiles.
7
Lebesgue integration is a standard topic in mathematical analysis; see for example,
Billingsley, P. (1995) Probability and Measure, Third Edition.
8
The numbers generated by the computer are really pseudo-random numbers because
they are created by algorithms that are actually deterministic, so that in very long sequences they repeat and their non-random nature becomes apparent. However, good
computer simulation programs use good random number generators, which take an extremely long time to repeat, so this is rarely a practical concern.
3.2. RANDOM VARIABLES
3.2.5
37
The hazard function of a random variable X at x
is its conditional probability density, given that
X ≥ x.
Another useful characterization of a probability distribution arises in some
specialized contexts, including the analysis of spike train data. The hazard
function of a continuous random variable X is
λ(x) =
f (x)
.
1 − F (x)
Note that if P (B) > 0 then
P (A|B) =
P (A ∩ B)
.
P (B)
Applying this to a continuous random variable X we have
P (X ∈ (x, x + h)|X > x) =
F (x + h) − F (x)
.
1 − F (x)
Therefore,
P (X ∈ (x, x + h)|X > x)
f (x)
=
= λ(x)
h→0
h
1 − F (x)
lim
which provides the fundamental interpretation of λ(x)dx, that it is the probability X ∈ (x, x+dt) given X > x. In the context of lifetime analysis, where
the X is the lifetime of a unit (a lightbulb; a person) its values are in units of
time t and λ(t)dt is the probability of failure (death) in the interval (t, t + dt)
given that failure has not yet occurred. If X is the elapsed time that an
ion channel is open then λ(t)dt becomes the probability the ion channel will
close in the interval (t, t + dt), given that it has remained open up to time t.
3.2.6
Sometimes it is desirable to derive the pdf of a
function of a random variable
38
CHAPTER 3. PROBABILITY AND RANDOM VARIABLES
Let X be a random variable whose p.d.f are c.d.f. are known and are shown as
f (x), and F (x) respectively. Consider a strictly increasing function of X say,
U (x). Suppose that we are interested in obtaining the p.d.f. of Y = U (X).
Note that Y is a new random variable whose c.d.f. can be calculated:
G(y) = P r[Y ≤ y] = P r[U (X) ≤ y] = P r[X ≤ U −1 (y)] = F [U −1 (y)].
To obtain the p.d.f. of Y , we simply need to differentiate G with respect to Y .
dG(y)
dF [U −1 (y)]
dU −1 (y)
=
= f [U −1 (y)]
.
dy
dy
dy
Note that same approach would apply when U is a strictly decreasing function of X. Nonetheless, as shown below, one has to be careful about the
details:
g(y) =
G(y) = P r[Y ≤ y] = P r[U (X) ≤ y] = P r[X ≥ U −1 (y)] = 1 − F [U −1 (y)].
Therefore,
g(y) =
d{1 − F [U −1 (y)]}
dU −1 (y)
dG(y)
=
= −f [U −1 (y)]
.
dy
dy
dy
But since U is strictly decreasing, the derivative in the last term is going to
be negative. So altogether, we come up with this general relationship:
dU −1 (y) (3.17)
g(y) = f [U −1 (y)]
.
dy
Illustration: A simple example for the change of variable.
Suppose that random variable X has the following pdf
4x3 0 < x < 1
f (x) =
0
otherwise.
q
We would like to derive the pdf of Y = 1 − 3X . Since x = ± 1−y
, we have
3
2
dx 1
.
= p
dy
2 3(1 − y)
3.2. RANDOM VARIABLES
39
Therefore, from 3.17,
g(y) =
2
(1
9
0
3.2.7
− y) −2 < y < 1
otherwise.
Moment Generating Functions are useful in identifying probability distributions
Let X be a random variable, and k be an integer. We call E(X k ) the
k-th moment of X. The mean is the first moment, because µ = E(X).
The variance can be calculated using the first two moments, because σ 2 =
E(X 2 ) − [E(X)]2 . We define the k-th central moment as E(X − µ)k . The
variance then is the second central moment since σ 2 = E(X − µ)2 . Note that
E(X − µ) = µ − µ = 0.
An interesting result about moments is that if moments of a higher order
exist, the moments of all lower orders should exist as well. That is, if k > j
are two integers, and if E(X k ) < ∞ then E(X j ) < ∞ (see problem 52).
Why should one care about moments? Isn’t it sufficient to derive the mean
and the variance of random variables? Most of the times the answer is yes.
However, we can learn more about random variables and their distributions
by studying their moment generating functions, as explained below.
Suppose that X is a random variable, and t is a real number. Then
M (t) = E(etx )
(3.18)
is called the moment generating hfunctioni or the m.g.f of X. Obviously,
0
= E(X). In other words, the
M (0) = E(1) = 1. Also, M (0) = dtd M (t)
t=0
first derivative of the m.g.f. evaluated at 0, generates the mean. Moreover,
M (k) (0) = E(X k ).
(3.19)
That is, the k-th derivative of the m.g.f evaluated at 0, generates the k-th
moment.
40
CHAPTER 3. PROBABILITY AND RANDOM VARIABLES
The proof of 3.19 is built around the assumption that one can change the
order of differentiation and integration. This is possible here because of a
property called the Leibniz rule. We provide the details in an appendix at
the end of the book.
Illustration: Moment generating function of an exponential random variable. Suppose that X ∼ exponential(λ). In other words, f (x) =
λe−λx , for x > 0, and λ > 0. To obtain the m.g.f., we write
Z ∞
Z ∞
tx
tx
−λx
E[e ] =
e λe dx = λ
e(t−λ)x dx.
0
0
Note that in order for this integral to converge, we must have t < λ. Therefore,
λ
.
M (t) = E[etx ] =
λ−t
Two immediate consequences of the above are
0
E(X) = M (0) =
1
,
λ
and
00
0
V ar(X) = M (0) − [M (0)]2 =
1
.
λ2
Theorem: Moment generating function of the sum of independent
random variables. Suppose that random variables X1 · · · Xn are independent, and let MXi (t) denote the m.g.f. of Xi . Let Y = X1 + · · · Xn . The
m.g.f. of Y is as follows
MY (t) =
n
Y
MXi (t).
(3.20)
i=1
Proof: We can prove 3.20 by writing the definition of the m.g.f. for Y :
tY
t(X1 +···+Xn )
MY (t) = E[e ] = E[e
] = E[
n
Y
etXi ].
i=1
But X1 · · · Xn are independent. Therefore,
Z ∞
Z ∞ Y
n
n
Y
tXi
E[ e ] =
···
( etxi )f (x1 , ..., xn )dx1 · · · dxn =
i=1
−∞
−∞ i=1
3.2. RANDOM VARIABLES
Z
∞
Z
···
−∞
∞
41
n
n
n
Y
Y
Y
( etxi )( f (xi ))dx1 · · · dxn =
E[etXi ].
−∞ i=1
i=1
i=1
Illustration: Moment generating function of sums of Poisson random variable. Suppose that X1 · · · Xn are independent Poisson random
variables with the intensity rate λi . In other words,
e−λi λxi
,
x!
for x = 0, 1, 2, · · · . It is left as an exercise for the readers to show that the
m.g.f. for this random variable is of the form
f (xi ) =
t
MXi (t) = eλi (e −1) .
Suppose that Y = X1 + · · · + Xn . Then, according to the previous theorem,
MY (t) =
n
Y
t
P
eλ(e −1) = e(
λi )(et −1)
,
i=1
which is the m.g.f. of a Poisson distribution with the intensity rate
n
X
λi .
i=1
Problems
Basics
(1) Show that if A ⊂ B then P r(A) ≤ P r(B).
(2) Show that P r(A ∪ B) = P r(A) + P r(B) − P r(A ∩ B).
(3) Challenge. Show that:
P r(A ∪ B ∪ C) = P r(A) + P r(B) + P r(C) − P r(A ∩ B) − P r(A ∩ C) −
P r(B ∩ C) + P r(A ∩ B ∩ C)
42
CHAPTER 3. PROBABILITY AND RANDOM VARIABLES
(4) Challenge Consider the events
. Prove that:
Pn A1 , ..., AnP
n
P
1 ∪ A2 ∪ ... ∪ An ) =
i=1 P r(Ai ) −
i<j,i,j=1 P r(Ai ∩ Aj )+
Pr(A
n
i<j<k,i,j,k=1 P r(Aj ∩ Aj ∩ Ak ) − ... ± P r(A1 ∩ A2 ∩ A3 ... ∩ An )
(5) Give an example such that
P r(A ∪ B) ≤ P r(A) + P r(B)
(6) Let P r(A∆B) define the probability that exactly one of the two events
A or B will occur. Show that:
|P r(A) − P r(B)| ≤ P r(A∆B)
(7) Verify the inequality
P r(A∆C) ≤ P r(A∆B) + P r(B∆C)
(8) Prove that:
(a) If P r(A) = P r(B) = 0, then P r(A ∪ B) = 0.
(b) If P r(A) = P r(B) = 1, then P r(A ∩ B) = 1.
(9) Three coins are tossed: two nickels and a dime. We pay 10 cents to
enter the game and receive those coins which fall heads up. What is
the probability of winning some money?
(hint: write the sample space!)
(10) Suppose that all of the freshmen of the engineering college in UCLA
took calculus and discrete mathematics last semester. Suppose that
70% of the students passed calculus, 55% passed discrete math, and
45% passed both. If a randomly selected freshman is found to have
passed calculus last semester, what is the probability that he/she also
passed discrete math last semester?
(11) Suppose that a bus arrives at a station every day at a random time
between 1:00 P.M. and 1:30 P.M. Suppose that at 1:10 the bus has not
arrived yet and we want to calculate the probability that it will arrive
in not more than 10 minutes from now.
3.2. RANDOM VARIABLES
43
(12) An English class consists of 15 Koreans, 10 Iranians, and 20 Germans.
A paper is found belonging to one of the students of this class. If the
name on the paper is not Korean, what is the probability that it is
Iranian? (Assume the names completely identify ethnic groups.)
(13) Suppose that two dice were rolled and it was observed that the sum T
of the numbers was odd. What is the probability that T was less than
8?
Independence of Two Events
(14) Two machines 1 and 2 in a factory are operated independently of each
other. Let A be the event that machine 1 will become inoperative
during a given 8-hour period; let B be the event that machine 2 will
become inoperative during the same period. Suppose that P (A) = 31
and P (B) = 14 . Find the probability that at least one of the machines
will become inoperative during the given period.
(15) A balance die is rolled. Let A be the event that an even number is
obtained, and let B be the event that one of the numbers 1,2,3, or 4 is
obtained. Show that the events A and B are independent.
(16) Assume that two events A and B are independent. Prove that events
A and B c are also independent.
Law of Total Probability
(17) Let S = s1 , s2 , s3 , s4 , and suppose that the probability of each outcome
is 1/4. Let A, B, and C be defined as follows:
A = s1 , s2 ,
and B = s1 , s3
C = s1 , s4 .
Remember:
P (A) =
k
X
P (Bj )P (A|Bj )
(3.21)
j=1
P (A|C) =
k
X
j=1
P (Bj |C)P (A|Bj ∩ C)
(3.22)
44
CHAPTER 3. PROBABILITY AND RANDOM VARIABLES
(18) Two boxes contain long bolts and short bolts. Suppose that one box
contains 60 long bolts and 40 short bolts, and the other box contains
10 long bolts and 20 short bolts. Suppose also that one box is selected
at random and a bolt is then selected at random from that box. What
is the probability that this bolt is long?
Bayes Theorem
(19) 35% of CSUB’s undergraduate students are males and 65% are females.
Suppose that 3% of males and 4% of females are Business majors. Find
the probability that if we choose a Business major, then he is a male.
(20) I have two dice, one is a four sider and the other is a 20-sider. The sides
of the first are labeled 1,2,3,4 and the second are labeled 1,2,3,...,20.
Considering the dice separately, you regard the sides to be equally
likely. I pick one of the two dice and get either the four-sider, event
F or the 20-sider event T = F c with P r(F ) = P r(F c ) = 1/2. I rolled
the die I picked. I tell you the result is 13. You do not need Bayes’
rule to know that I picked the 20-sider. Because: P r(F |rolled13) = 0.
However, suppose that I tell you the result was 3. Which die did I pick?
We know two things:
P r(3|F ) = 1/4
P r(3|T ) = 1/20
I am interested in finding P (F |3).
(21) There are five urns, and they are numbered 1 to 5. Each urn contains
10 balls. Urn i has i defective balls and 10 − i non-defective balls, for
i = 1, ..., 5. For example, urn 3 has 3 defective balls and 7 non-defective
balls. Consider the following random experiment:
First, an urn is selected at random, and then a ball is selected at random
from the selected urn. (The experimenter does not know which urn is
selected.) We are interested in two questions:
(1) What is the probability that a defective ball is selected?
(2) If we have already selected a ball and noted that it is defective,
what is the probability that it came from urn 5?
(22) Three machines M1 , M2 , and M3 were used for producing a large batch
of similar manufactured items. Suppose that 20 percent of the items
3.2. RANDOM VARIABLES
45
were produced by machine M1 , 30 percent of the items were produced
by machine M2 , and 50 percent by machine M3 . Suppose further that
1 percent of the items produced by machine M1 are defective, that 2
percent produced by machine M2 are defective, and 3 percent produced
by machine M3 are defective. Finally, suppose that one item is selected
at random from the entire batch, and it is found to be defective. Find
the probability that this item was produced by machine M2 .
Discrete Random Variables
(23) Flip a fair coin three times and let X be the number of heads. Obtain
the Probability Mass Function or pmf for this random variable.
(24) Three balls are to be randomly selected without replacement from an
urn containing 20 balls numbered 1 through 20. First, obtain the probability distribution of random variable X, the largest number selected.
Next, if we bet that at least one of the drawn balls has a number as or
large than 17, what is the probability that we win the bet?
(25) Suppose that three cards are drawn from an ordinary deck of 52 cards,
one by one, at random and with replacement. Let X be the number of
spades drawn; then X is a random variable. If an outcome of spades is
denoted by s, and other outcomes are represented by t,
(a) Write the sample space in terms of s and t.
(b) Obtain the pmf (or pf ) of X.
(26) If 10% of the balls in a certain box are red, and if 20 balls are selected
from the box at random, with replacement, what is the probability that
more than three red balls will be obtained?
pdf, pmf, and cdf
(27) Consider rolling a fair six-sided die. Note that P (s) = 61 for all s
members of S = {1, 2, 3, 4, 5, 6}. What is the cdf or F (x) = P (X ≤ x)
for this problem? Draw the function.
46
CHAPTER 3. PROBABILITY AND RANDOM VARIABLES
(28) Let’s assume that the cdf of a random variable x is as follows:
0
x
<
2
4
F (x) =
(x − 2) /16 2 ≤ x < 4
1
otherwise
(a) Graph F (x).
(b) Calculate the following probabailities:
P (X ≤ 3), P (X < 3), P (X > 2.5), P (1.5 < X ≤ 3.4).
(29) Suppose that a bus arrives at the station every day between 10:00 A.M.
and 10.30 A.M., at random. Let X be the arrival time; find the cdf of
X and sketch its graph.
(30) While walking in a certain park, the time X, in minutes, between
seeing two people smoking has a probability density function (pdf ) of
the following
form:
λxe−x
x>0
f (x) =
0
otherwise
(a) Calculate the value of λ.
(b) Find the cdf of X.
(c) What is the probability that Sam, who has just seen a person
smoking, will see another person smoking in 2 to 5 minutes? In
at least 7 minutes?
This is a special case of the gamma distribution. We will revisit the
gamma distribution in the near future!
(31) Consider the
following joint-distribution:
cx2 y x2 ≤ y ≤ 1
f (x, y) =
0
otherwise
(a) Calculate the value of c.
(b) Calculate the cdf of X.
(c) Find P (X ≥ Y )?
3.2. RANDOM VARIABLES
47
Mean and Variance
(32) Let X follow a Uniform (a, b).
(a) Calculate the expected value or E(X) of X.
(b) Calculate V ar(X).
(33) The random
variable X has the
λe−λx x > 0(λ > 0)
f (x) =
0
otherwise
Find the expected value of X.
following
pdf (exponential):
(34) Let X has
the following pdf : (Poisson)
e−λ λx x = 0, 1, 2, ...
x!
f (x) =
0
otherwise
(a) Find E(X).
(b)* Find V ar(X)
(35) Let X has
the following pdf : (Cauchy)
c
−∞
<
x
<
∞
1+x2
f (x) =
0
otherwise
(a) Find c.
(b) Show that E(X) does not exist.
(36) Prove that the following relation holds for any continuous random variable such as X:
Z ∞
Z ∞
E(X) =
[1 − F (t)]dt −
F (−t)dt.
0
0
Note that, a direct consequence of the above relation is the following:
Z
∞
Z
P (X > t)dt −
E(X) =
0
∞
P (X ≤ −t)dt.
0
48
CHAPTER 3. PROBABILITY AND RANDOM VARIABLES
(37) The time elapsed (in minutes), between the placement of an order
of
is random with density function: f (x) =
pizza and its delivery
1 25 < x < 40
15
0
otherwise
(a) Obtain the mean and the standard deviation of X.
(b) Suppose that it takes 12 minutes for the pizza shop to bake pizza.
Find the mean and the standard deviation of the time it takes for
the delivery person to deliver pizza.
(38) Jensen’s inequality.
(a) Prove the following inequality: |E(X)| ≤ E(|X|).
(g) Let Φ be a real function. Show that Φ[E(X)] ≤ E[Φ(X)].
Functions of Random Variables
(39) Let X be
a random variable with
the density function:
2e−2x
x>0
f (x) =
0
otherwise
– Find the pdf of Y =
√
X.
(b) Find the mean and the standard deviation of X.
(40) Let X follow a Normal distribution with mean 0 and standard deviation 1 (see
below). Find the pdf of Y=2X+5.
√1 e−x2 /2 −∞ < x < ∞
2π
f (x) =
0
otherwise
(41) Let X follow a uniform distribution on the interval [0, 1]. Find the pdf
of Y = ln(1/X).
3.2. RANDOM VARIABLES
49
(51) Suppose that X1 · · · X3 are independent random variables having the
same following probability function:
1
−x2
,
f (x) = √ exp
2
2π
for −∞ < x < ∞. Let y1 = x1 , y2 =
the joint distribution of Y1 , Y2 , Y3 .
x1 +x2
,
2
and y3 =
x1 +x2 +x3
.
3
Find
Moment Generating Functions
(42) Prove that if E(|X|k ) < ∞ for some positive integer k, then E(|X|j ) <
∞ for every positive integer j such that j < k.
© Copyright 2026 Paperzz