1: PROBABILITY REVIEW [.03in] - School of Mathematics and

1: PROBABILITY REVIEW
Marek Rutkowski
School of Mathematics and Statistics
University of Sydney
Semester 2, 2016
M. Rutkowski (USydney)
Slides 1: Probability Review
1 / 56
Outline
We will review the following notions:
1
Discrete Random Variables
2
Expectation of a Random Variable
3
Equivalence of Probability Measures
4
Variance of a Random Variable
5
Continuous Random Variables
6
Examples of Probability Distributions
7
Correlation of Random Variables
8
Conditional Distributions and Expectations
M. Rutkowski (USydney)
Slides 1: Probability Review
2 / 56
PART 1
DISCRETE RANDOM VARIABLES
M. Rutkowski (USydney)
Slides 1: Probability Review
3 / 56
Sample Space
We collect the possible states of the world and denote the set by Ω.
The states are called samples or elementary events.
The sample space Ω is either countable or uncountable.
A toss of a coin: Ω = {Head, Tail} = {H, T}.
The number of successes in a sequence of n identical and independent
trials: Ω = {0, 1, . . . , n}.
The moment of occurrence of the first success in an infinite sequence
of identical and independent trials: Ω = {1, 2, . . . }.
The lifetime of a light bulb: Ω = {x ∈ R |x ≥ 0 }.
The choice of a sample space is arbitrary and thus any set can be
taken as a sample space. However, practical considerations justify the
choice of the most convenient sample space for the problem at hand.
Discrete (finite or infinite, but countable) sample spaces are easier to
handle than general sample spaces.
M. Rutkowski (USydney)
Slides 1: Probability Review
4 / 56
Discrete Random Variables
Examples of random variables:
Prices of listed stocks.
Fluctuations of currency exchange rates.
Payoffs corresponding to portfolios of traded assets.
Definition (Discrete Random Variable)
A real-valued map X : Ω → R on a discrete sample space Ω = (ωk )k∈I ,
where the set I is countable, is called a discrete random variable.
Definition (Probability)
A map P : Ω 7→ [0, 1] is called a probability on a discrete sample space Ω
if the following conditions are satisfied:
P.1. P(ωk ) ≥ 0 for all k ∈ I,
P
P.2. k∈I P(ωk ) = 1.
M. Rutkowski (USydney)
Slides 1: Probability Review
5 / 56
Probability Measure
Let F = 2Ω stand for the class of all subsets of the sample space Ω.
The set 2Ω is called the power set of Ω.
Note that the empty set ∅ also belongs to any power set.
Any set A ∈ F is called an event (or a random event).
Definition (Probability Measure)
A map P : F 7→ [0, 1] is called a probability measure on (Ω, F) if
M.1. For any sequence Ai ⊂ F, i = 1, 2, . . . of events such that
Ai ∩ Aj = ∅ for all i 6= j we have
P(∪∞
i=1 Ai ) =
∞
X
P(Ai ).
i=1
M.2. P(Ω) = 1.
For any A ∈ F the equality P(Ω \ A) = 1 − P(A) holds.
M. Rutkowski (USydney)
Slides 1: Probability Review
6 / 56
Probability Measure on a Discrete Sample Space
Note a probability P : Ω 7→ [0, 1] on a discrete sample space Ω
uniquely specifies probabilities of all events Ak = {ωk }.
It is common to write P({ωk }) = P(ωk ) = pk .
The following result shows that any probability on a discrete sample
space Ω generates a unique probability measure on (Ω, F).
Theorem
Let P : Ω 7→ [0, 1] be a probability on a discrete sample space Ω. Then
the unique probability measure on (Ω, F) generated by P satisfies for all
A∈F
X
P(A) =
P(ωk ).
ωk ∈A
The proof of the theorem is easy since Ω is assumed to be discrete.
M. Rutkowski (USydney)
Slides 1: Probability Review
7 / 56
Example: Coin Flipping
Example (1.2)
Let X be the number of “heads” appearing when a fair coin is tossed
twice. We choose the sample space Ω to be
Ω = {0, 1, 2}
where a number k ∈ Ω represents the number of times “head” has
occurred.
The probability measure P on Ω is defined as
0.25, if k = 0, 2,
P(k) =
0.5, otherwise.
We recognise here the binomial distribution with n = 2 and p = 0.5.
Note that a single flip of a coin is a Bernoulli trial.
M. Rutkowski (USydney)
Slides 1: Probability Review
8 / 56
Example: Coin Flipping
Example (1.2 Continued)
We now suppose that the coin is not a fair one.
Let the probability of “head” be p for some p 6= 21 .
Then the probability measure P is given by

 q 2 , if k = 0,
2pq, if k = 1,
P(k) =

p2 , if k = 2,
where q := 1 − p is the probability of “tail” appearing.
We deal here with the binomial distribution with n = 2
and 0 < p < 1.
M. Rutkowski (USydney)
Slides 1: Probability Review
9 / 56
PART 2
EXPECTATION OF A RANDOM VARIABLE
M. Rutkowski (USydney)
Slides 1: Probability Review
10 / 56
Expectation of a Random Variable
Definition (Expectation)
Let X be a random variable on a discrete sample space Ω endowed with
a probability measure P. The expectation (expected value or mean
value) of X is given by
X
X
EP (X) = µ :=
X(ωk ) P(ωk ) =
xk p k .
k∈I
k∈I
EP (·) is called the expectation operator over the probability P.
Note that the expectation of a random variable can be seen as the
weighted average.
It is impossible to know the exact event to happen in the future and
thus expectation is useful in making decisions when the probabilities
of future outcomes are known.
M. Rutkowski (USydney)
Slides 1: Probability Review
11 / 56
Expectation Operator
Any random variable defined on a finite set Ω admits the expectation.
When the set Ω is countable (but infinite),
then we say that X is
P
P-integrable whenever EP (|X|) = ω∈Ω |X(ω)| P(ω) < ∞.
In that case the expectation EP (X) is well defined (and finite).
Theorem (1.1)
Let X and Y be two P-integrable random variables and P be a probability
measure on a discrete sample space Ω. Then for all α, β ∈ R
EP (αX + βY ) = α EP (X) + β EP (Y ) .
Hence EP (·) : X → R is a linear operator on the space X of P-integrable
random variables.
M. Rutkowski (USydney)
Slides 1: Probability Review
12 / 56
Expectation Operator
Proof of Theorem 1.1.
We note that
EP (|αX + βY |) ≤ |α| EP (|X|) + |β| EP (|Y |) < ∞
so that the random variable αX + βY belongs to X . The linearity of
expectation can be easily deduced from its definition
X
EP (αX + βY ) =
(αX(ω) + βY (ω)) P(ω)
ω∈Ω
=
X
αX(ω) P(ω) +
ω∈Ω
=α
X
X
βY (ω) P(ω)
ω∈Ω
X(ω) P(ω) + β
ω∈Ω
X
Y (ω) P(ω)
ω∈Ω
= α EP (X) + β EP (Y )
M. Rutkowski (USydney)
Slides 1: Probability Review
13 / 56
Expectation: Coin Flipping
Example (1.3)
A fair coin is tossed three times. The player receives one dollar each
time “head” appears and loses one dollar when “tail” occurs.
Let the random variable X represent the player’s payoff.
The sample space Ω is defined as Ω = {0, 1, 2, 3} where k ∈ Ω
represents the number of times “head” occurs.
The probability measure is given by
 1
 8 , if k = 0, 3,
P(k) =
 3
8 , if k = 1, 2.
This is the binomial distribution with n = 3 and p = 12 .
M. Rutkowski (USydney)
Slides 1: Probability Review
14 / 56
Expectation: Coin Flipping
Example (1.3 Continued)
The random variable X is defined as

−3,



−1,
X(k) =
1,



3,
if
if
if
if
k
k
k
k
= 0,
= 1,
= 2,
= 3.
Hence the player’s expected payoff equals
EP (X) =
3
X
X(k)P(k)
k=0
3 3 3 3
= − − + + = 0.
8 8 8 8
M. Rutkowski (USydney)
Slides 1: Probability Review
15 / 56
Expectation of a Function of a Random Variable
Function of a random variable.
Let X be a random variable and P be a probability measure on a
discrete sample space Ω.
We define Y = f (X) where f : R → R is an arbitrary function.
Then Y is also a random variable on the sample space Ω endowed
with the same probability measure P. Moreover,
X
EP (Y ) = EP (f (X)) =
f (X(ω))P(ω).
ω∈Ω
If a random variable X is deterministic then EP (X) = X and
EP (f (X)) = f (X).
M. Rutkowski (USydney)
Slides 1: Probability Review
16 / 56
PART 3
EQUIVALENCE OF PROBABILITY MEASURES
M. Rutkowski (USydney)
Slides 1: Probability Review
17 / 56
Equivalence of Probability Measures
Let P and Q be two probability measures on a discrete sample space Ω.
Definition (Equivalence of Probability Measures)
We say that the probability measures P and Q are equivalent and we
denote P ∼ Q whenever for all ω ∈ Ω
P(ω) > 0
⇔
Q(ω) > 0.
The random variable L : Ω → R+ given as
L(ω) =
Q(ω)
P(ω)
is called the Radon-Nikodym density of Q with respect to P. Note that
X
X
EQ (X) =
X(ω) Q(ω) =
X(ω)L(ω) P(ω) = EP (LX) .
ω∈Ω
M. Rutkowski (USydney)
ω∈Ω
Slides 1: Probability Review
18 / 56
Example: Equivalent Probability Measures
Example (Equivalent Probability Measures)
The sample space Ω is defined as Ω = {1, 2, 3, 4}.
Let the two probability measures P and Q be given by
4 1 2 1
1 3 2 2
, , ,
and
, , ,
.
8 8 8 8
8 8 8 8
It is clear that P and Q are equivalent, that is, P ∼ Q.
Moreover, the Radon-Nikodym density L of Q with respect
to P can be represented as follows
1
1
4, , 1,
.
3
2
Check that for any random variable X: EQ (X) = EP (LX) .
M. Rutkowski (USydney)
Slides 1: Probability Review
19 / 56
PART 4
VARIANCE OF A RANDOM VARIABLE
M. Rutkowski (USydney)
Slides 1: Probability Review
20 / 56
Risky Investments
When deciding whether to invest in a given portfolio, an agent may
be concerned with the “risk” of his investment.
Example (1.4)
Consider an investor who is given an opportunity to choose between the
following two options:
1
The investor either receives or loses 1,000 dollars with equal
probabilities. This random payoff is denoted by X1 .
2
The investor either receives or loses 1,000,000 dollars with equal
probabilities. We denote this random payoff as X2 .
Hence in both scenarios the expected value of the payoff equals 0
EP (X1 ) = EP (X2 ) = 0.
The following question arises: which option is preferred?
M. Rutkowski (USydney)
Slides 1: Probability Review
21 / 56
Variance of a Random Variable
Definition (Variance)
The variance of a random variable X on a discrete sample set Ω is
defined as
n
o
V ar (X) = σ 2 := EP (X − µ)2 ,
where P is a probability measure on Ω.
Variance is a measure of the spread of a random variable about its
mean and also a measure of uncertainty.
In financial applications, it is common to identify variance
of the price of a security with its degree of “risk”.
Note that the variance is non-negative: V ar (X) = σ 2 ≥ 0.
The variance is equal to 0 if and only if X is deterministic.
M. Rutkowski (USydney)
Slides 1: Probability Review
22 / 56
Variance of a Random Variable
Example (1.4 Continued)
The variance of option 1 equals
V ar (X1 ) =
(1000 − 0)2 (−1000 − 0)2
+
= 106 .
2
2
The variance of option 2 equals
V ar (X2 ) =
(106 − 0)2 (−106 − 0)2
+
= 1012 .
2
2
Therefore, the option represented by X2 is riskier than the option
represented by X1 .
A risk-averse agent would prefer the first option over the second.
A risk-loving agent would prefer the second option over the first.
M. Rutkowski (USydney)
Slides 1: Probability Review
23 / 56
Variance of a Random Variable
Theorem (1.2)
Let X be a random variable and P be a probability measure on a discrete
sample space Ω. Then the following equality holds
V ar (X) = EP X 2 − µ2 .
Proof of Theorem 1.2.
V ar (X) = EP (X − µ)2 = EP X 2 − 2µX + µ2 (linearity)
= EP X 2 − 2µ EP (X) + µ2 = EP X 2 − µ2 .
M. Rutkowski (USydney)
Slides 1: Probability Review
24 / 56
Independence of Random Variables
Definition (Independence)
Two discrete random variables X and Y are independent if for all
x, y ∈ R
P (X = x, Y = y) = P (X = x) P (Y = y)
where P (X = x) is the probability of the event {X = x}.
A useful property of independent random variables X and Y is
EP (XY ) = EP (X) EP (Y ) .
Theorem (1.3)
For independent discrete random variables X, Y and arbitrary α, β ∈ R
V ar (αX + βY ) = α2 V ar (X) + β 2 V ar (Y ) .
M. Rutkowski (USydney)
Slides 1: Probability Review
25 / 56
Independence of Random Variables
Proof of Theorem 1.3.
Let EP (X) = µX and EP (Y ) = µY . Theorem 1.1 yields
EP (αX + βY ) = αµX + βµY .
Using Theorem 1.2, we obtain
V ar (αX + βY ) = EP (αX + βY )2 − (αµX + βµY )2
= α2 EP X 2 + 2αβ EP (XY ) + β 2 EP (Y )
− (αµX + βµY )2
= α2 EP X 2 − µ2X + β 2 EP Y 2 − µ2Y
+ 2αβ (EP (X) EP (Y ) − µX µY )
= α2 V ar (X) + β 2 V ar (Y ) .
M. Rutkowski (USydney)
Slides 1: Probability Review
26 / 56
PART 5
CONTINUOUS RANDOM VARIABLES
M. Rutkowski (USydney)
Slides 1: Probability Review
27 / 56
Continuous Random Variables
Definition (Continuous Random Variable)
A random variable X on the sample space Ω is said to have a continuous
distribution if there exists a real-valued function f such that
f (x) ≥ 0,
Z
∞
f (x) dx = 1,
−∞
and for all real numbers a < b
Z
P (a ≤ X ≤ b) =
b
f (x) dx.
a
Then f : R → R+ is called the probability density function (pdf) of a
continuous random variable X.
M. Rutkowski (USydney)
Slides 1: Probability Review
28 / 56
Continuous Random Variables
Assume that X is a continuous random variable.
Note that a probability density function need not satisfy the
constraint f (x) ≤ 1.
A function F (x) is called a cumulative distribution function (cdf)
of a continuous random variable X if for all x ∈ R
Z x
F (x) = P (X ≤ x) =
f (y) dy.
−∞
The relationship between the pdf f and cdf F
Z x
d
F (x) =
f (y)dy ⇔ f (x) =
F (x).
dx
−∞
M. Rutkowski (USydney)
Slides 1: Probability Review
29 / 56
Continuous Random Variables
The expectation and variance of a continuous random variable X are
defined as follows:
Z ∞
x f (x) dx,
EP (X) = µ :=
−∞
Z ∞
2
V ar (X) = σ :=
(x − µ)2 f (x) dx,
−∞
or, equivalently,
Z
∞
V ar (X) =
−∞
x2 f (x) dx − µ2 = EP X 2 − (EP (X))2
The properties of expectations of discrete random variables carry over
to continuous random variables, with probability measures being
replaced by pdfs and sums by integrals.
M. Rutkowski (USydney)
Slides 1: Probability Review
30 / 56
PART 6
EXAMPLES OF PROBABILITY DISTRIBUTIONS
M. Rutkowski (USydney)
Slides 1: Probability Review
31 / 56
Discrete Probability Distributions
Example (Binomial Distribution)
Let Ω = {0, 1, 2, . . . , n} be the sample space and let X be
the number of successes in n independent trials where p is
the probability of success in a single Bernoulli trial.
The probability measure P is called the binomial distribution if
n k
P(k) =
p (1 − p)n−k for k = 0, . . . , n
k
where
n
n!
=
.
k
k!(n − k)!
Then
EP (X) = np
M. Rutkowski (USydney)
and V ar (X) = np(1 − p).
Slides 1: Probability Review
32 / 56
Discrete Probability Distributions
Example (Poisson Distribution)
Let the sample space be Ω = {0, 1, 2, . . . }.
We take an arbitrary number λ > 0.
The probability measure P is called the Poisson distribution if
P (k) =
λk e−λ
k!
for k = 0, 1, 2, . . . .
Then
EP (X) = λ = V ar (X) .
It is known that the Poisson distribution can be obtained as the limit
of the binomial distribution when n tends to infinity and the sequence
npn tends to λ > 0.
M. Rutkowski (USydney)
Slides 1: Probability Review
33 / 56
Discrete Probability Distributions
Example (Geometric Distribution)
Let Ω = {1, 2, 3, . . . } be the sample space and X be the number of
independent trials to achieve the first success.
Let p stand for the probability of a success in a single trial.
The probability measure P is called the geometric distribution if
P (k) = (1 − p)k−1 p for k = 1, 2, 3, . . . .
Then
EP (X) =
M. Rutkowski (USydney)
1
p
and V ar (X) =
Slides 1: Probability Review
1−p
.
p2
34 / 56
Continuous Probability Distributions
Example (Uniform Distribution)
We say that X has the uniform distribution on an interval (a, b) if
its pdf equals
1
b−a , if x ∈ (a, b) ,
f (x) =
0, otherwise.
It is clear that
Z
∞
Z
f (x) dx =
−∞
a
b
1
dx = 1.
b−a
We have
EP (X) =
M. Rutkowski (USydney)
a+b
2
and V ar (X) =
Slides 1: Probability Review
(b − a)2
.
12
35 / 56
Continuous Probability Distributions
Example (Exponential Distribution)
We say that X has the exponential distribution on (0, ∞) with the
parameter λ > 0 if its pdf equals
λe−λx , if x > 0,
f (x) =
0,
otherwise.
It is easy to check that
Z ∞
Z
f (x) dx =
−∞
λe−λx dx = 1.
0
We have
EP (X) =
M. Rutkowski (USydney)
∞
1
λ
and V ar (X) =
Slides 1: Probability Review
1
.
λ2
36 / 56
Continuous Probability Distributions
Example (Gaussian Distribution)
We say that X has the Gaussian (normal) distribution with mean
µ ∈ R and variance σ 2 > 0 if its pdf equals
f (x) = √
1
2πσ 2
e−
(x−µ)2
2σ 2
for x ∈ R.
We write X ∼ N (µ, σ 2 ).
One can show that
Z ∞
Z
f (x) dx =
−∞
∞
−∞
√
1
2πσ 2
e−
(x−µ)2
2σ 2
dx = 1.
We have
EP (X) = µ and V ar (X) = σ 2 .
M. Rutkowski (USydney)
Slides 1: Probability Review
37 / 56
Continuous Probability Distributions
Example (Standard Normal Distribution)
If we set µ = 0 and σ 2 = 1 then we obtain the standard normal
distribution N (0, 1) with the following pdf
x2
1
n(x) = √ e− 2
2π
for x ∈ R.
The cdf of the probability distribution N (0, 1) equals
Z x
Z x
u2
1
√ e− 2 du for x ∈ R.
N (x) =
n(u) du =
2π
−∞
−∞
The values of N (x) can be found in the cumulative standard
normal table (also known as the Z table).
If X ∼ N µ, σ 2 then Z := X−µ
σ ∼ N (0, 1).
M. Rutkowski (USydney)
Slides 1: Probability Review
38 / 56
LLN and CLT
Theorem (Law of Large Numbers)
Assume that X1 , X2 , . . . are independent and identically distributed
random variables with mean µ. Then with probability one,
X1 + · · · + Xn
→ µ as n → ∞
n
Theorem (Central Limit Theorem)
Assume that X1 , X2 , . . . are independent and identically distributed
random variables with mean µ and variance σ 2 > 0. Then for all real x
Z x
X1 + · · · + Xn − nµ
1
2
√
√ e−u /2 du = N (x).
lim P
≤x =
n→∞
σ n
2π
−∞
M. Rutkowski (USydney)
Slides 1: Probability Review
39 / 56
PART 7
CORRELATION OF RANDOM VARIABLES
M. Rutkowski (USydney)
Slides 1: Probability Review
40 / 56
Covariance of Random Variables
In the real world, agents can invest in several securities and typically
would like to benefit from diversification.
The price of one security may affect the prices of other assets. For
example, the stock index falling may lead the price of gold to rise.
To quantify this effect, it is convenient to introduce the notion of
covariance.
Definition (Covariance)
The covariance of two random variables X1 and X2 is defined as
Cov (X1 , X2 ) = σ12 := EP {(X1 − µ1 ) (X2 − µ2 )}
where µi = EP (Xi ) for i = 1, 2.
M. Rutkowski (USydney)
Slides 1: Probability Review
41 / 56
Correlated and Uncorrelated Random Variables
The covariance of two random variables is a measure of the degree of
variation of one variable with respect to the other.
Unlike the variance of a random variable, covariance of two random
variables may take negative values.
σ12 > 0: An increase in one variable tends to coincide with
an increase in the other.
σ12 < 0: An increase in one variable tends to coincide with
a decrease in the other.
σ12 = 0: Then the random variables X1 and X2 are said to
be uncorrelated. If σ12 6= 0 then X1 and X2 are correlated.
Definition (Uncorrelated Random Variables)
We say that the random variables X1 and X2 are uncorrelated whenever
Cov (X1 , X2 ) = σ12 = 0.
M. Rutkowski (USydney)
Slides 1: Probability Review
42 / 56
Properties of the Covariance
Theorem (1.4)
The following properties are valid:
Cov (X, X) = V ar (X) = σ 2 .
Cov (X1 , X2 ) = EP (X1 X2 ) − µ1 µ2 .
EP (X1 X2 ) = EP (X1 ) EP (X2 ) if and only if X1 and X2 are
uncorrelated, that is, Cov (X1 , X2 ) = 0.
V ar (a1 X1 + a2 X2 ) = a21 σ12 + a22 σ22 + 2a1 a2 σ12 .
V ar (X1 + X2 ) = V ar (X1 ) + V ar (X2 ) if and only if X1 and
X2 are uncorrelated.
If X1 and X2 are independent then they are uncorrelated.
The converse is not true: it may happen that X1 and X2 are
uncorrelated, but they are not independent.
M. Rutkowski (USydney)
Slides 1: Probability Review
43 / 56
Correlation Coefficient
We can normalise the covariance measure. We obtain in this way the
so-called correlation coefficient.
Note that the correlation coefficient is only defined when σ1 > 0 and
σ2 > 0, that is, none of the two random variables is deterministic.
Definition (Correlation Coefficient)
Let X1 and X2 be two random variables with variances σ12 and σ22 and
covariance σ12 . Then the correlation coefficient
ρ = ρ(X1 , X2 ) = corr (X1 , X2 )
is defined by
ρ :=
M. Rutkowski (USydney)
σ12
Cov (X1 , X2 )
=p
.
σ1 σ2
V ar (X1 ) V ar (X2 )
Slides 1: Probability Review
44 / 56
Properties of the Correlation Coefficient
Theorem (1.5)
The correlation coefficient satisfies −1 ≤ ρ ≤ 1. The random variables X1
and X2 are uncorrelated whenever ρ = 0.
Proof.
The
from an application
of the Cauchy-Schwarz inequality:
Pn result follows
P
Pn
n
2
2
2
( k=1 ak bk ) ≤ ( k=1 ak )( k=1 bk ).
ρ = 1: If one variable increases, the other will also increase.
0 < ρ < 1: If one increases, the other will tend to increase.
ρ = 0: The two random variables are uncorrelated.
−1 < ρ < 0: If one increases, the other will tend to decrease.
ρ = −1: If one increases, the other will decrease.
M. Rutkowski (USydney)
Slides 1: Probability Review
45 / 56
Linear Regression in Basic Statistics
If we observe the time series of prices of two securities the following
question arises: how are they related to each other?
Consider two random variables X and Y on the sample space
Ω = {ω1 , ω2 , . . . , ωn }. Denote X(ωi ) = xi and Y (ωi ) = yi .
A linear fit is a relation of the following form, for α, β ∈ R,
yi = α + βxi + i .
The number i is called the residual of the ith observation.
In Statistics, the best linear fit to observed data is obtained through
the method of least-squares: minimise Σni=1 2i . The line y = α + βx is
called the least-squares linear regression.
We denote by the random variable such that (ωi ) = i . This means
that we set := Y − α − βX.
M. Rutkowski (USydney)
Slides 1: Probability Review
46 / 56
Linear Regression in Probability Theory
In Probability Theory, if α and β are chosen to minimise
n
2 o
= EP 2 = Σni=1 2 (ωi )P(ωi )
EP Y − α − βX
then the random variable `Y |X := α + βX is called the linear
regression of Y on X.
The next result is valid for both discrete and continuous random
variables.
Theorem (1.6)
The minimizers α and β are
α = µY − βµX
and β =
σXY
2 .
σX
Hence the linear regression of Y on X is given by
`Y |X =
M. Rutkowski (USydney)
σXY
2 (X − µX ) + µY .
σX
Slides 1: Probability Review
47 / 56
Covariance Matrix
Recall that the covariance of random variables Xi and Xj equals
Cov (Xi , Xj ) = σij := EP {(Xi − µi ) (Xj − µj )}
where µi = EP (Xi ) and µj = EP (Xj ).
Definition (Covariance Matrix)
We consider random variables Xi for i = 1, 2, . . . , n with variances
σi2 = σii and covariances σij . The matrix S given by

 2
σ1 σ12 · · · σ1n
 σ21 σ 2 · · · σ2n 
2


S :=  .
..
.. 
.
.
.
 .
.
.
. 
σn1 σn2 · · ·
σn2
is called the covariance matrix of (X1 , . . . , Xn ).
M. Rutkowski (USydney)
Slides 1: Probability Review
48 / 56
Correlation Matrix
Definition (Correlation Matrix)
We consider random variables Xi for i = 1, 2, . . . , n with variances
σi2 = σii and covariances σij . The matrix P given by


1 ρ12 · · · ρ1n
 ρ21 1 · · · ρ2n 


P :=  .
..
.. 
..
 ..
.
.
. 
ρn1 ρn2 · · ·
1
is called the correlation matrix of (X1 , . . . , Xn ).
S and P are symmetric and positive-definite matrices.
S = DP D where D2 is a diagonal matrix whose main diagonal
coincides with the main diagonal in S.
If Xi s are independent (or uncorrelated) random variables then S is a
diagonal matrix and P = I is the identity matrix.
M. Rutkowski (USydney)
Slides 1: Probability Review
49 / 56
Linear Combinations of Random Variables
Theorem (1.7)
Assume that the random variables Y1 and Y2 are some linear combinations
of Xi , that is, there exists vectors aj ∈ Rn for j = 1, 2 such that
aj1
aj2
..
.

Yj =
n
X
i=1


aji Xi = 

ajn
T 








X1
X2
..
.

 
 = aj , X

Xn
where h·, ·i denotes the inner product in Rn . Then
Cov (Y1 , Y2 ) = a1
M. Rutkowski (USydney)
T
S a2 = a1
Slides 1: Probability Review
T
DP D a2 .
50 / 56
PART 8
CONDITIONAL DISTRIBUTIONS AND EXPECTATIONS
M. Rutkowski (USydney)
Slides 1: Probability Review
51 / 56
Conditional Distributions and Expectations
Definition (Conditional Probability)
For two random variables X1 and X2 and an arbitrary set A such that
P(X2 ∈ A) 6= 0, we define the conditional probability
P(X1 ∈ A1 | X2 ∈ A2 ) :=
P(X1 ∈ A1 , X2 ∈ A2 )
P(X2 ∈ A2 )
and the conditional expectation
EP (X1 | X2 ∈ A) :=
EP (X1 1{X2 ∈A} )
P(X2 ∈ A)
where 1{X2 ∈A} : Ω → {0, 1} is the indicator function of {X2 ∈ A}, that is,
1, if X2 (ω) ∈ A,
1{X2 ∈A} (ω) =
0, otherwise.
M. Rutkowski (USydney)
Slides 1: Probability Review
52 / 56
Discrete Case
Assume that X and Y are discrete random variables
pi = P(X = xi ) > 0
for i = 1, 2, . . .
and
∞
X
pi = 1,
i=1
pbj = P(Y = yj ) > 0
for j = 1, 2, . . .
and
∞
X
pbj = 1.
j=1
Definition (Conditional Distribution and Expectation)
Then the conditional distribution equals
pX|Y (xi | yj ) = P(X = xi | Y = yj ) :=
P(X = xi , Y = yj )
pi,j
=
P(Y = yj )
pbj
and the conditional expectation EP (X | Y ) is given by
EP (X | Y = yj ) :=
∞
X
xi P(X = xi | Y = yj ) =
i=1
M. Rutkowski (USydney)
Slides 1: Probability Review
∞
X
i=1
xi
pi,j
.
pbj
53 / 56
Discrete Case
It is easy to check that
pi = P(X = xi ) =
∞
X
P(X = xi | Y = yj ) P(Y = yj ) =
j=1
∞
X
pX|Y (xi | yj ) pbj .
j=1
The expected value EP (X) satisfies
EP (X) =
∞
X
EP (X | Y = yj ) P(Y = yj ).
j=1
Definition (Conditional cdf)
The conditional cdf FX|Y (· | yj ) of X given Y is defined for all yj such that
P(Y = yj ) > 0 by
X
FX|Y (x | yj ) := P(X ≤ x | Y = yj ) =
pX|Y (xi | yj ).
xi ≤x
Hence EP (X | Y = yj ) is the mean of the conditional distribution.
M. Rutkowski (USydney)
Slides 1: Probability Review
54 / 56
Continuous Case
Assume that the continuous random variables X and Y have the joint
pdf fX,Y (x, y).
Definition (Conditional pdf and cdf)
The conditional pdf of Y given X is defined for all x such that
fX (x) > 0 and equals
fY |X (y | x) :=
fX,Y (x, y)
fX (x)
for y ∈ R.
The conditional cdf of Y given X equals
Z
y
FY |X (y | x) := P(Y ≤ y | X = x) =
−∞
M. Rutkowski (USydney)
Slides 1: Probability Review
fX,Y (x, u)
du.
fX (x)
55 / 56
Continuous Case
Definition (Conditional Expectation)
The conditional expectation of Y given X is defined for all x such that
fX (x) > 0 by
Z ∞
Z ∞
fX,Y (x, y)
EP (Y | X = x) :=
y dFY |X (y | x) =
y
dy.
fX (x)
−∞
−∞
An important property of conditional expectation is that
Z ∞
EP (Y ) = EP (EP (Y | X)) =
EP (Y | X = x)fX (x) dx.
−∞
Hence the expected value EP (Y ) can be determined by first
conditioning on X (in order to compute EP (Y |X)) and then
integrating with respect to the pdf of X.
M. Rutkowski (USydney)
Slides 1: Probability Review
56 / 56