Multivariate probability distributions and linear regression

Multivariate probability distributions
and linear regression
UNIVERSITY OF HELSINKI
Dept. of Computer Science
Patrik Hoyer
1
Contents:
•
•
•
•
•
•
•
•
•
•
Random variable, probability distribution
Joint distribution
Marginal distribution
Conditional distribution
Independence, conditional independence
Generating data
Expectation, variance, covariance, correlation
Multivariate Gaussian distribution
Multivariate linear regression
Estimating a distribution from sample data
UNIVERSITY OF HELSINKI
Dept. of Computer Science
Patrik Hoyer
2
•
•
Random variable
-
sample space (set of possible elementary outcomes)
probability distribution over sample space
Examples:
-
The throw of a die
x
P (x)
-
2
3
4
5
6
1/6
1/6
1/6
1/6
1/6
1/6
The sum of two dice
x
P (x)
-
1
2
3
4
5
6
7
8
9
10
11
12
1/36
1/18
1/12
1/9
5/36
1/6
5/36
1/9
1/12
1/18
1/36
...
(6,6)
Two separate dice (red, blue)
x
P (x)
UNIVERSITY OF HELSINKI
Dept. of Computer Science
(1,1)
(1,2)
(1,3)
(1,4)
(1,5)
(1,6)
(2,1)
(2,2)
(2,3)
1/36
1/36
1/36
1/36
1/36
1/36
1/36
1/36
1/36
Patrik Hoyer
1/36
3
•
Discrete variables:
-
•
Finite number of states
(e.g. dice examples)
Infinite number of states
(e.g. how many heads before one tales in a sequence of coin
tosses?)
Continuous variables:
Each particular state has a probability of zero, so we need the
concept of a probability density:
P (X ≤ x) =
!
x
p(t) dt
−∞
(e.g. how long time until next bus arrives? what will be the price
of oil a year from now?)
UNIVERSITY OF HELSINKI
Dept. of Computer Science
Patrik Hoyer
4
•
A probability distribution satisfies...
1. Probabilities are non-negative:
P (X = x) = PX (x) = P (x) ≥ 0
2. Sum to one:
!
�
P (x) = 1
(discrete)
x
p(x) dx = 1
(continuous)
[Note that in the discrete case this means that there exists no
value of x such that P (x) > 1. However this does not in
general hold for a continuous density p(x)!]
UNIVERSITY OF HELSINKI
Dept. of Computer Science
Patrik Hoyer
5
•
The joint distribution of two random variables:
-
Let X and Y be random variables. Their joint distribution is
P (x, y) = P (X = x and Y = y)
-
Example: Two coin tosses, X denotes first throw, Y denotes
second (note: independence!)
Y
P (x, y) :
-
X
H
T
H
0.25
0.25
T
0.25
0.25
Example: X : Rain today? Y : Rain tomorrow?
Y
P (x, y) :
UNIVERSITY OF HELSINKI
Dept. of Computer Science
Patrik Hoyer
X
Y
N
Y
0.5
0.2
N
0.1
0.2
6
•
Marginal distribution:
-
‘Interested in or observing only one of the two variables’
The distribution is obtained by summing (or integrating) over
the other variable:
P (x) =
-
!
P (x, y)
y
p(x) =
�
p(x, y) dy
Example (continued): What is the probability of rain
tomorrow? That is, what is P (y) ?
Y
X
!
P (y) :
UNIVERSITY OF HELSINKI
Dept. of Computer Science
Y
N
Y
0.5
0.2
N
0.1
0.2
0.6
0.4
Patrik Hoyer
In the same fashion, we can
calculate that the chance of rain
today is 0.7.
7
•
Conditional distribution:
-
‘If we observe X = x how does that affect our belief about the
value of Y ?’
Obtained by selecting the appropriate row/column of the joint
distribution, and renormalizing it to sum to one:
P (x, y)
P (y|X = x) = P (y|x) =
P (x)
-
p(x, y)
p(y|x) =
p(x)
Example (continued): What is the probability that tomorrow
rains, given that today does not rain? i.e. what
is P (y | X = ‘no rain’) ?
Y
X
UNIVERSITY OF HELSINKI
Dept. of Computer Science
Y
N
Y
0.5
0.2
N
0.1
0.2
Patrik Hoyer
P (y | X = ‘no rain’)
⇒
0.1
0.2
/ (0.1 + 0.2) ≈
Y
N
0.33
0.67
8
Chain rule:
P (x, y) = P (x)P (y|x) = P (y)P (x|y)
p(x, y) = p(x)p(y|x) = p(y)p(x|y)
•
So the joint distribution can be specified directly, or using the
marginal and conditional distribution (can even choose, ‘which
way’ one specifies it)
UNIVERSITY OF HELSINKI
Dept. of Computer Science
Patrik Hoyer
9
•
Independence:
Two random variables are independent, if and only if knowing the
value of one does not change our belief about the second:
∀x : P (y|x) = P (y)
⇔
∀y : P (x|y) = P (x)
This is equivalent to being able to write the joint distribution as
the product of the marginals:
P (x, y) = P (x)P (y)
We write this as:
X⊥
⊥Y
⊥ Y )P
or, if we want to explicitly specify the distribution: (X ⊥
•
Example: Two coin tosses...
UNIVERSITY OF HELSINKI
Dept. of Computer Science
Patrik Hoyer
10
•
Three or more variables:
-
joint distribution: P (v, w, x, y, z, . . .)
(‘multidimensional array/function’)
marginal distributions: (e.g.)
!
P (x) =
P (v, w, x, y, z, . . .)
v,w,y,z,...
P (x, y) =
!
P (v, w, x, y, z, . . .)
v,w,z,...
-
conditional distributions: (e.g.)
P (x|v, w, y, z, . . .) = P (v, w, x, y, z, . . .)/P (v, w, y, z, . . .)
P (x, y|v, w, z, . . .) = P (v, w, x, y, z, . . .)/P (v, w, z, . . .)
P (v, w, y, z, . . . |x) = P (v, w, x, y, z, . . .)/P (x)
!
P (x|y) =
P (v, w, x, z, . . . |y)
← marginal and conditional
v,w,z,...
UNIVERSITY OF HELSINKI
Dept. of Computer Science
Patrik Hoyer
11
-
Chain rule
P (v, w, x, y, z, . . .) = P (v)P (w|v)P (x|v, w)P (y|v, w, x)
P (z|v, w, x, y)P (. . . |v, w, x, y, z)
-
Complete independence between all variables if and only if:
P (v, w, x, y, z, . . .) = P (v)P (w)P (x)P (y)P (z)P (. . .)
-
Conditional independence (e.g: if we know the value of z then x
does not give any additional information about y ):
P (x, y|z) = P (x|z)P (y|z)
⊥Y |Z
This is also written: X ⊥
⊥ Y | Z)P
or explicitly noting the distribution: (X ⊥
UNIVERSITY OF HELSINKI
Dept. of Computer Science
Patrik Hoyer
12
-
In general we can say that marginal distributions are
conditional on not knowing the value of other variables:
P (x) = P (x|∅)
and (marginal) independence is independence conditional on
not observing other variables:
P (x, y|∅) = P (x|∅)P (y|∅)
-
Example of conditional independence:
Drownings and ice-cream sales. These are mutually dependent
(both happen during warm weather) but are, at least
approximately, conditionally independent given the weather
UNIVERSITY OF HELSINKI
Dept. of Computer Science
Patrik Hoyer
13
Example: conditional dependence: Two coin tosses and a bell that rings
whenever they get the same result. The coins are marginally
independent but conditionally dependent given the bell!
X : First coin toss
Y : Second coin toss
Z : Bell
Y
P (x, y) =
X
H
T
H
0.25
0.25
T
0.25
0.25
(independent)
Y
P (x, y | Z = ‘bell rang’) =
UNIVERSITY OF HELSINKI
Dept. of Computer Science
Patrik Hoyer
X
H
T
H
0.5
0
T
0
0.5
(dependent!)
14
•
Data generation, sampling
-
Given some P (x) , how can we draw samples (generate data)
from that distribution?
Answer: Divide the unit interval [0,1] into parts corresponding
to the probabilities, draw a uniformly distributed number in
the interval, and select the state into which we fell:
0
1
P (x2 )
P (x1 )
UNIVERSITY OF HELSINKI
Dept. of Computer Science
P (x4 )
0.30245...
⇒ X := x2
P (x6 )
P (x3 ) P (x5 )
Patrik Hoyer
15
•
Given a joint distribution P (x, y, z) , how can we draw samples
(generate data)?
-
We could list all joint states, then proceed as above, or...
Draw data sequentially from conditional distributions:
1. First draw x from P (x)
2. Next y from P (y|x)
3. Finally z from P (z|x, y)
Note: We can freely choose any ordering of the
variables!
UNIVERSITY OF HELSINKI
Dept. of Computer Science
Patrik Hoyer
16
Example (continued): Two coin tosses and a bell that rings if and
only if the two tosses give the same result
-
can draw all the variables simultaneously by listing all the joint
states, calculating their probabilities, placing them on the unit
interval, and then draw the joint state
can first independently generate the coin tosses, then assign
the bell
can first draw one coin toss and the bell, and then assign the
second coin toss
UNIVERSITY OF HELSINKI
Dept. of Computer Science
Patrik Hoyer
17
•
Numerical random variables
-
Expectation:
!
E{X} =
xP (x)
(discrete)
x
E{X} =
-
�
x p(x) dx
(continuous)
2
Variance: Var(X) = σX
= σXX = E{(X − E{X})2 }
Covariance: Cov(X, Y ) = σXY = E{(X − E{X})(Y − E{Y })}
Correlation coefficient:
UNIVERSITY OF HELSINKI
Dept. of Computer Science
Patrik Hoyer
ρXY
σXY
=� 2 2
σ X σY
18
-
Multivariate numerical random variables... (random vectors)
Expectation:



E{V} = 

E{V1 }
E{V2 }
..
.
E{VN }





Covariance matrix (‘variance-covariance matrix’)
CV = ΣV = E{(V − E{V})(V − E{V})T }


=
UNIVERSITY OF HELSINKI
Dept. of Computer Science
Var(V1 )
Cov(VN , V1 )
Patrik Hoyer
..
Cov(V1 , VN )
.
Var(V2 )


 
=
σV1 V1
σVN V1
..
σV1 VN
.
σVN VN



19
•
Conditional expectation, variance, covariance, correlation
-
Conditional expectation (note: function of y !)
E{X|Y = y} =
�
xP (x|y)
(discrete)
x
E{X|Y = y} =
-
�
x p(x|y) dx
(continuous)
Conditional variance (note: function of y !)
2
Var(X|Y = y) = σX|y
= σXX|y = E{(X − E{X})2 }P (X|Y =y)
-
Conditional covariance (note: function of z !)
Cov(X, Y | z) = σXY |z = E{(X − E{X})(Y − E{Y })}P (X,Y
-
| Z=z)
Conditional correlation coefficient (note: function of z !)
ρXY |z
UNIVERSITY OF HELSINKI
Dept. of Computer Science
σXY |z
= �
2 σ2
σX|z
Y |z
Patrik Hoyer
20
sam roweis
sam roweis
0.1
-
multidimensional gaussian
(revised
July
1999)density
(revised
July
199
a
d-dimensional
multidimensional
gaussian
(normal)
for
x
is:
Multivariate Gaussian (‘normal’) density:
�
�
1
p(x) = N (µ, Σ) = (2π)−d/2 |Σ|−1/2 exp − (x − µ)T Σ−1 (x − µ)
0.1 multidimensional
0.1 gaussian
multidimensional
gaussian
2
(1)
d-dimensional multidimensional
a d-dimensional
gaussian
multidimensional
(normal) density
gaussian
for x is:(n
it hasaentropy:
has the following properties:
�
��
�
�
1
1
1
−1/2
d−1/2
N (µ,SΣ)
(2π)
|Σ|matrix
N (µ,
exp
−as
= (2π)
exp
(x
−−d/2
µ)T|Σ|
Σ−1
(x −
µ) −(2)(
mean vector
and
thebits
only
==covariance
log2 −d/2
(2πe)
|Σ|
−Σ)
const
2
2
2
•
parameters
whereit Σ
a symmetric postive
covariance matrix and the
hasis entropy:
it hassemi-definite
entropy:
(unfortunate) constant is the log of� the units �in which x is�measured �over
1
1
d
the “natural units”
S = log2 (2πe)d |Σ| −Sconst
= logbits
2 (2πe) |Σ| −
2
2
0.2 where
linearΣ functions
of awhere
normal
is xa2 symmetric
postive
Σ semi-definite
is vector
a symmetric
covariance
postive semi-defini
matrix an
(unfortunate)
is(unfortunate)
the log of theconstant
units iniswhich
the log
x is
of measured
the unit
no matter
how x is constant
distributed,
the “natural units”
the “natural units”
E[Ax + y] = A(E[x]) + y
0.2
(3a)
linear functions
ofy]a=
linear
normal
functions
vector
T of a normal vec
Covar[Ax0.2
+
A(Covar[x])A
(3b)
x1
no matter how x is distributed,
no matter how x is distributed,
Patrik Hoyer
in particular
this means that for normal distributed quantities:21
E[Ax + y] = A(E[x])
+ yE[Ax +
�
� y] = A(E[x
UNIVERSITY OF HELSINKI
Dept. of Computer Science
•
all marginal and conditional distributions are also Gaussian, and
the conditional (co)variances do not depend on the values of the
conditioning variables:
Let x and y be random vectors whose dimensions are n and m. If they
are joined together into one random vector z = (xT , yT ), with dimension
n + m, then its mean mz and covariance matrix Cz are
�
�
�
�
mx
Cx Cxy
mz =
,
Cz =
,
(1)
my
Cyx Cy
where mx and my are the means of x and y, and Cx and Cy are the
covariance matrices of x and y respectively, and Cxy contains the cross
covariances.
If z is multivariate Gaussian then x and y are also Gaussian. Additionally
the conditional distributions p(x|y) and p(y|x) are Gaussian. The latter’s
mean and covariance matrix are
my|x
Cy|x
= my + Cyx C−1
x (x − mx )
= Cy − Cyx C−1
x Cxy
(2)
(3)
Let v be a Gaussian random vector over three variables (v1 , v2 , v3 )T whose
UNIVERSITY OFmean
HELSINKI m
= 0, and covariance matrix
v = E{v}
Patrik Hoyer
Dept. of Computer Science


22
•
•
The conditional variance, conditional covariance, and conditional
correlation coefficient, for the Gaussian distribution, are known
2
as partial variance σX·Z
, partial covariance σXY ·Z , and partial
correlation coefficient ρXY ·Z (respectively)
These can of course always be computed directly from the
covariance matrix (regardless of whether the distribution
actually is Gaussian!)...
...but they can only be safely interpreted as conditional variance,
conditional covariance, and conditional correlation coefficient
(respectively) for the Gaussian distribution.
UNIVERSITY OF HELSINKI
Dept. of Computer Science
Patrik Hoyer
23
•
for Gaussian:
zero (partial) covariance
zero (conditional) covariance
(conditional) independence
i.e. (σXY ·Z = 0) ⇔ (∀z : σXY |z = 0) ⇔ (X ⊥
⊥ Y | Z)
•
in general:
we only have one-way implication:
zero (conditional) covariance ⇐ (conditional) independence
⊥ Y | Z)
i.e. (∀z : σXY |z = 0) ⇐ (X ⊥
Note, however, that conditional independence does not imply zero
partial covariance in the completely general case!
UNIVERSITY OF HELSINKI
Dept. of Computer Science
Patrik Hoyer
24
Linear regression:
ŷ = ryx x + �y
y
x
•
•
•
•
Fit a line through the data, explaining how y varies with x .
Minimize sum of squares error between ŷ and y .
ryx =
σXY
2
σX
Probabilistic interpretation: ŷ ≈ E{Y | X = x}
(note that this is true only for roughly linear relationships)
UNIVERSITY OF HELSINKI
Dept. of Computer Science
Patrik Hoyer
25
•
Note the symmetry: We could equally well regress x on y !
x̂ = rxy y + �x
x
y
UNIVERSITY OF HELSINKI
Dept. of Computer Science
Patrik Hoyer
26
•
Multivariate linear regression:
ẑ = ax + by + �z
a = rzx·y =
σZX·Y
2
σX·Y
x
z
y
Note that the partial regression coefficient rzx·y is NOT the same,
in general, as one gets from regressing z on x , ignoring y : rzx
Note also that rzx·y is derived from the partial (co)variances. This
holds regardless of the form of the underlying distribution.
UNIVERSITY OF HELSINKI
Dept. of Computer Science
Patrik Hoyer
27

Download Report

Multivariate probability distributions and linear regression

Paperzz.com

Your Paperzz