Background

Chapter 1 and 2: Background
CSCI B554, Spring 2015
Reminders
•
•
Assignment #1 is released, due January 28
Readings, with Thought Questions due January 21
• [Optional refresher] Chapter 1 on probabilities and Chapter 2 on graph
basics
• Chapter 3: Sections 3.1, 3.3
• Chapter 5: intro to Section 5.1 and Section 5.1.1
• Chapter 8: Suggest you read the entire chapter. Section 8.2 is a basic
refresher on densities for continuous random variables, so is optional.
Background
2
Background: random variables
•
•
•
Event space: set of possible values a
random variable can take, e.g.
{smokes, not smokes} or any real
Probabilities are over the set of
events
0.4
0.35
0.3
0.25
0.2
0.15
X is a random variable —
essentially a set of possible events
that could occur according to some
1
2
p(x) = p exp x /2
probability mass or density
2⇡
For E = {smokes, not smokes}, P(X =
smokes) = 0.1, x = smokes is an
instance or sample of that RV
0.1
0.05
0
−3
•
Background
−2
−1
0
1
2
3
3
Introduction to Probability1
David Barber
University College London
1 These slides accompany the book Bayesian Reasoning and Machine Learning. The book and demos can be downloaded from
www.cs.ucl.ac.uk/staff/D.Barber/brml. Feedback and corrections are also available on the site. Feel free to adapt these slides for your own purposes,
but please include a link the above website.
Background
4
Rules of probability
p(x = x) : the probability of variable x being in state x.
⇢
1 we are certain x is in state x
p(x = x) =
0 we are certain x is not in state x
Values between 0 and 1 represent the degree of certainty of state occupancy.
domain
dom(x) denotes the states x can
Ptake. For example, dom(c) = {heads, tails}.
When summing over
P a variable
P x f (x), the interpretation is that all states of x
are included, i.e.
x f (x) ⌘
s2dom(x) f (x = s).
distribution
Given a variable, x, its domain dom(x) and a full specification of the probability
values for each of the variable states, p(x), we have a distribution for x.
normalisation
The summation of the probability over all the states is 1:
X
p(x = x) = 1
x2dom(x)
We will usually more conveniently write
Background
P
x
p(x) = 1.
5
Operations
AND
Use the shorthand p(x, y) ⌘ p(x \ y) for p(x and y). Note that p(y, x) = p(x, y).
marginalisation
Given a joint distr. p(x, y) the marginal distr. of x is defined by
X
p(x) =
p(x, y)
y
More generally,
p(x1 , . . . , xi
1 , xi+1 , . . . , xn )
=
X
p(x1 , . . . , xn )
xi
Background
6
Conditional Probability and Bayes’ Rule
The probability of event x conditioned on knowing event y (or more shortly, the
probability of x given y) is defined as
p(x, y)
p(y|x)p(x)
p(x|y) ⌘
=
(Bayes’ rule)
p(y)
p(y)
Throwing darts
p(region 5, not region 20)
p(region 5|not region 20) =
p(not region 20)
p(region 5)
1/20
1
=
=
=
p(not region 20)
19/20
19
Interpretation
p(A = a|B = b) should not be interpreted as ‘Given the event B = b has occurred,
p(A = a|B = b) is the probability of the event A = a occurring’. The correct
interpretation should be ‘p(A = a|B = b) is the probability of A being in state a
under the constraint that B is in state b’.
Background
7
Probability tables
The a priori probability that a randomly selected Great British person would live in
England, Scotland or Wales, is 0.88, 0.08 and 0.04 respectively.
We can write this as a vector
0
1 0
p(Cnt = E)
@ p(Cnt = S) A = @
p(Cnt = W)
(or probability table) :
1
0.88
0.08 A
0.04
whose component values sum to 1.
The ordering of the components in this vector is arbitrary, as long as it is
consistently applied.
Background
8
Probability tables
We assume that only three Mother Tongue languages exist : English (Eng),
Scottish (Scot) and Welsh (Wel), with conditional probabilities given the country
of residence, England (E), Scotland (S) and Wales (W). Using the state ordering:
M T = [Eng, Scot, Wel];
we write a (fictitious) conditional
0
0.95 0.7
p(M T |Cnt) = @ 0.04 0.3
0.01 0.0
Background
Cnt = [E, S, W]
probability table
1
0.6
0.0 A
0.4
9
Probability tables
The distribution p(Cnt, M T ) = p(M T |Cnt)p(Cnt) can be written as a 3 ⇥ 3
matrix with (say) rows indexed by country and columns indexed by Mother Tongue:
0
1 0
1
0.95 ⇥ 0.88 0.7 ⇥ 0.08 0.6 ⇥ 0.04
0.836 0.056 0.024
@ 0.04 ⇥ 0.88 0.3 ⇥ 0.08 0.0 ⇥ 0.04 A = @ 0.0352 0.024
0 A
0.01 ⇥ 0.88 0.0 ⇥ 0.08 0.4 ⇥ 0.04
0.0088
0
0.016
By summing a column, we have the marginal
0
1
0.88
p(Cnt) = @ 0.08 A
0.04
Summing the rows gives the marginal
0
1
0.916
p(M T ) = @ 0.0592 A
0.0248
Background
10
Probability tables
Large numbers of variables
For joint distributions over a larger number of variables, xi , i = 1, . . . , D, with
each variable
QDxi taking Ki states, the table describing the joint distribution is an
array with i=1 Ki entries.
Explicitly storing tables therefore requires space exponential in the number of
variables, which rapidly becomes impractical for a large number of variables.
Indexing
A probability distribution assigns a value to each of the joint states of the
variables. For this reason, p(T, J, R, S) is considered equivalent to p(J, S, R, T ) (or
any such reordering of the variables), since in each case the joint setting of the
variables is simply a di↵erent index to the same probability.
One should be careful not to confuse the use of this indexing type notation with
functions f (x, y) which are in general dependent on the variable order.
Background
11
Inspector Clouseau
Inspector Clouseau arrives at the scene of a crime. The Butler (B) and Maid (M )
are his main suspects. The inspector has a prior belief of 0.6 that the Butler is the
murderer, and a prior belief of 0.2 that the Maid is the murderer. These
probabilities are independent in the sense that p(B, M ) = p(B)p(M ). (It is
possible that both the Butler and the Maid murdered the victim or neither). The
inspector’s prior criminal knowledge can be formulated mathematically as follows:
dom(B) = dom(M ) = {murderer, not murderer}
dom(K) = {knife used, knife not used}
p(B = murderer) = 0.6,
p(knife
p(knife
p(knife
p(knife
used|B
used|B
used|B
used|B
p(M = murderer) = 0.2
= not murderer,
= not murderer,
= murderer,
= murderer,
M
M
M
M
= not murderer)
= murderer)
= not murderer)
= murderer)
= 0.3
= 0.2
= 0.6
= 0.1
The victim lies dead in the room and the inspector quickly finds the murder
weapon, a Knife (K). What is the probability that the Butler is the murderer?
(Remember that it might be that neither is the murderer).
Background
12
Inspector Clouseau
Using b for the two states of B and m for the two states of M ,
P
X
X p(B, m, K)
p(B) m p(K|B, m)p(m)
P
p(B|K) =
p(B, m|K) =
=P
p(K)
b p(b)
m p(K|b, m)p(m)
m
m
Plugging in the values we have
p(B = murderer|knife used) =
=
6
10
6
10
2
10
⇥
1
10
300
⇡ 0.73
412
+
8
10
2
10
⇥
⇥
6
10
1
10
8
6
+ 10
⇥ 10
4
2
2
+ 10
⇥
+
10
10
8
10
⇥
3
10
Hence knowing that the knife was the murder weapon strengthens our belief that
the butler did it.
Exercise: compute the probability that the Butler and not the Maid is the
murderer.
Background
13
Inspector Clouseau
The role of p(knife used) in the Inspector Clouseau example can cause some
confusion. In the above,
X
X
p(knife used) =
p(b)
p(knife used|b, m)p(m)
b
m
is computed to be 0.456. But surely, p(knife used) = 1, since this is given in the
question!
Note that the quantity p(knife used) relates to the prior probability the model
assigns to the knife being used (in the absence of any other information). If we
know that the knife is used, then the posterior is
p(knife used, knife used)
p(knife used)
p(knife used|knife used) =
=
=1
p(knife used)
p(knife used)
which, naturally, must be the case.
Background
14
Independence
Variables x and y are independent if knowing one event gives no extra information
about the other event. Mathematically, this is expressed by
p(x, y) = p(x)p(y)
Independence of x and y is equivalent to
p(x|y) = p(x) , p(y|x) = p(y)
If p(x|y) = p(x) for all states of x and y, then the variables x and y are said to be
independent. We write then x ?? y.
interpretation
Note that x ?? y doesn’t mean that, given y, we have no information about x. It
means the only information we have about x is contained in p(x).
factorisation
If
p(x, y) = kf (x)g(y)
for some constant k, and positive functions f (·) and g(·) then x and y are
independent.
Background
15
Conditional Independence
X ?? Y| Z
denotes that the two sets of variables X and Y are independent of each other
given the state of the set of variables Z. This means that
p(X , Y|Z) = p(X |Z)p(Y|Z) and p(X |Y, Z) = p(X |Z)
for all states of X , Y, Z. In case the conditioning set is empty we may also write
X ?? Y for X ?? Y| ;, in which case X is (unconditionally) independent of Y.
Conditional independence does not imply marginal independence
p(x, y) =
X
z
p(x|z)p(y|z) p(z) 6=
|
{z
}
cond. indep.
X
|
p(x|z)p(z)
z
{z
p(x)
X
}|
z
p(y|z)p(z)
{z
p(y)
}
Conditional dependence
If X and Y are not conditionally independent, they are conditionally dependent.
This is written
X >>Y| Z
Similarly X >>Y| ; can be written as X>>Y.
Background
16
Conditional Independence example
Based on a survey of households in which the husband and wife each own a car, it
is found that:
wife’s car type ?? husband’s car type| family income
There are 4 car types, the first two being ‘cheap’ and the last two being
‘expensive’. Using w for the wife’s car type and h for the husband’s:
0
1
0
1
0.7
0.2
B 0.3 C
B 0.1 C
C , p(w|inc = high) = B
C
p(w|inc = low) = B
@ 0 A
@ 0.4 A
0
0.3
0
1
0
0.2
B 0.8 C
B
C , p(h|inc = high) = B
p(h|inc = low) = B
@ 0 A
@
0
p(inc = low) = 0.9
Background
1
0
0 C
C
0.3 A
0.7
17
Conditional Independence example
Then the marginal distribution p(w, h) is
X
p(w, h) =
p(w|inc)p(h|inc)p(inc)
inc
giving
0
1
0.126 0.504 0.006 0.014
B 0.054 0.216 0.003 0.007 C
C
p(w, h) = B
@ 0
0
0.012 0.028 A
0
0
0.009 0.021
From this we can find the marginals and calculate
0
1
0.117 0.468 0.0195 0.0455
B 0.0504 0.2016 0.0084 0.0196 C
C
p(w)p(h) = B
@ 0.0072 0.0288 0.0012 0.0028 A
0.0054 0.0216 0.0009 0.0021
This shows that whilst w ?? h| inc, it is not true that w ?? h. For example, even if
we don’t know the family income, if we know that the husband has a cheap car
then his wife must also have a cheap car – these variables are therefore dependent.
Background
18
Scientific Inference
Much of science deals with problems of the form : tell me something about the
variable ✓ given that I have observed data D and have some knowledge of the
underlying data generating mechanism. Our interest is then the quantity
p(D|✓)p(✓)
p(D|✓)p(✓)
p(✓|D) =
=R
p(D)
p(D|✓)p(✓)
✓
This shows how from a forward or generative model p(D|✓) of the dataset, and
coupled with a prior belief p(✓) about which variable values are appropriate, we can
infer the posterior distribution p(✓|D) of the variable in light of the observed data.
Generative models in science
This use of a generative model sits well with physical models of the world which
typically postulate how to generate observed phenomena, assuming we know the
model. For example, one might postulate how to generate a time-series of
displacements for a swinging pendulum but with unknown mass, length and
damping constant. Using this generative model, and given only the displacements,
we could infer the unknown physical properties of the pendulum, such as its mass,
length and friction damping constant.
Background
19
Prior, Likelihood and Posterior
For data D and variable ✓, Bayes’ rule tells us how to update our prior beliefs
about the variable ✓ in light of the data to a posterior belief:
p(✓|D) =
| {z }
posterior
p(D|✓) p(✓)
| {z } |{z}
likelihood prior
p(D)
| {z }
evidence
The evidence is also called the marginal likelihood.
The term likelihood is used for the probability that a model generates observed
data.
Background
20
Prior, Likelihood and Posterior
More fully, if we condition on the model M , we have
p(D|✓, M )p(✓|M )
p(✓|D, M ) =
p(D|M )
where we see the role of the likelihood p(D|✓, M ) and marginal likelihood p(D|M ).
The marginal likelihood is also called the model likelihood.
The MAP assignment
The Most probable A Posteriori (MAP) setting is that which maximises the
posterior,
✓⇤ = argmax p(✓|D, M ) = argmax p(✓, D|M )
✓
✓
The Max Likelihood assignment
When p(✓|M ) = const.,
✓⇤ = argmax p(✓, D|M ) = argmax p(D|✓, M )
✓
Background
✓
21
Video Break
https://www.youtube.com/watch?v=lLULRlmXkKo
https://www.youtube.com/watch?v=CkTftoNFeGY
Time 2:50 to 4:10
Background
22
Vector algebra
Vectors
Let x denote the n-dimensional column vector with components
0
1
x1
B x2 C
B
C
B .. C
@ . A
xn
A vector can be considered a n ⇥ 1 matrix.
Addition
0
B
B
x+y =B
@
Background
x1
x2
..
.
xn
1
0
C B
C B
C+B
A @
y1
y2
..
.
yn
1
0
C B
C B
C=B
A @
x 1 + y1
x 2 + y2
..
.
x n + yn
1
C
C
C
A
23
Scalar product
w·x=
n
X
w i xi = w T x
i=1
The length of a vector is denoted |x|, the squared length is given by
2
|x| = xT x = x2 = x21 + x22 + · · · + x2n
A unit vector x has xT x = 1. The scalar product has a natural geometric
interpretation as:
w · x = |w| |x| cos(✓)
where ✓ is the angle between the two vectors. Thus if the lengths of two vectors are
fixed their inner product is largest when ✓ = 0, whereupon one vector is a constant
multiple of the other. If the scalar product xT y = 0, then x and y are orthogonal.
Background
24
Linear dependence
A set of vectors x1 , . . . , xn is linearly dependent if there exists a vector xj that can
be expressed as a linear combination of the other vectors. If the only solution to
n
X
↵i x i = 0
i=1
is for all ↵i = 0, i = 1, . . . , n, the vectors x1 , . . . , xn are linearly independent.
Background
25
Matrices
An m ⇥ n matrix A is a collection of scalar values arranged in a rectangle of m
rows and n columns.
0
1
a11 a12 · · · a1n
B a21 a22 · · · a2n C
B
C
B ..
..
.. C
.
.
@ .
.
.
. A
am1 am2 · · · amn
The i, j element of matrix A can be written Aij or more conventionally
⇥ 1 ⇤ aij .
Where more clarity is required, one may write [A]ij (for example A ij ).
Matrix addition
For two matrix A and B of the same size,
[A + B]ij = [A]ij + [B]ij
Background
26
Solving Linear Systems
Problem
Given square N ⇥ N matrix A and vector b, find the vector x that satisfies
Ax = b
Solution
Algebraically, we have the inverse:
x=A
1
b
In practice, we solve solve for x numerically using Gaussian Elimination – more
stable numerically.
Complexity
Solving a linear system is O N 3 – can be very expensive for large N .
Approximate methods include conjugate gradient and related approaches.
Background
27
Single-variate calculus
For a function f defined
on a scalar x, the derivative is
df
f (x + h)
(x) = lim
h!0
dx
h
f (x)
df
At any point, x,
(x) gives
dx
the slope of the tangent
to the function at f(x)
GIF from Wikipedia: Tangent
Background
28
Multivariate Calculus
Partial derivative
Consider a function of n variables, f (x1 , x2 , . . . , xn ) or f (x). The partial
derivative of f w.r.t. xi is defined as the following limit (when it exists)
@f
f (x1 , . . . , xi
= lim
h!0
@xi
1 , xi
+ h, xi+1 , . . . , xn )
h
f (x)
Gradient vector
For function f the gradient is denoted rf or g:
0
B
rf (x) ⌘ g(x) ⌘ @
Background
@f
@x1
..
.
@f
@xn
1
C
A
29
Interpreting the gradient vector
Consider a function f (x) that depends on a vector x. We are interested in how the
function changes when the vector x changes by a small amount : x ! x + ,
where is a vector whose length is very small:
f (x + ) = f (x) +
X
i
@f
+O
i
@xi
2
We can interpret the summation above as the scalar product between the vector
@f
rf with components [rf ]i = @x
and .
i
f (x + ) = f (x) + (rf )T + O
Background
2
30
Interpreting the Gradient
x1
Figure : Interpreting the gradient. The ellipses are
contours of constant function value, f = const. At
any point x, the gradient vector rf (x) points along
the direction of maximal increase of the function.
f(x)
x2
The gradient points along the direction in which the function increases most
rapidly. Why? Consider a direction p̂ (a unit length vector). Then a displacement,
units along this direction changes the function value to
f (x + p̂) ⇡ f (x) + rf (x) · p̂
The direction p̂ for which the function has the largest change is that which
maximises the overlap
rf (x) · p̂ = |rf (x)||p̂| cos ✓ = |rf (x)| cos ✓
where ✓ is the angle between p̂ and rf (x). The overlap is maximised when ✓ = 0,
giving p̂ = rf (x)/|rf (x)|. Hence, the direction along which the function
changes the most rapidly is along rf (x).
Background
31
Higher derivatives
The ‘second-derivative’ of an n-variable function is defined by
✓
◆
@
@f
i = 1, . . . , n; j = 1, . . . , n
@xi @xj
which is usually written
@2f
, i 6= j
@xi @xj
@2f
, i=j
2
@xi
If the partial derivatives @f /@xi , @f /@xj and @ 2 f /@xi @xj are continuous, then
@ 2 f /@xi @xj exists and
@ 2 f /@xi @xj = @ 2 f /@xj @xi .
This is also denoted by rrf . These n2 second partial derivatives are represented
by a square, symmetric matrix called the Hessian matrix of f (x).
0
1
2
2
@ f
@ f
.
.
.
2
@x1 @xn
B @x.1
C
.
C
.
.
Hf (x) = B
.
.
@
A
2
2
@ f
@ f
.
.
.
@x1 @xn
@xn 2
Background
32
Chain rule
Let each xj be parameterized by u1 , . . . , um , i.e. xj = xj (u1 , . . . , um ).
n
X @f @xj
@f
=
@u↵
@xj @u↵
j=1
or in vector notation
@
@x(u)
T
f (x(u)) = rf (x(u))
@u↵
@u↵
Background
33
Resources for more advanced information
on matrices and densities
•
Matrix cookbook in resources section in OnCourse:
https://resources.oncourse.iu.edu/access/content/group/
SP15-BL-CSCI-B554-32906/matrixcookbook.pdf
•
Chapter 8 in the textbook gives a good overview of
common distributions
•
Rule of thumb for matrix multiplication: the dimensions
must match
n
m
Background
A
k
n
AB is (mxn) times (nxk)
and so is an mxk matrix
B
34
Revisiting supervised learning and regression
•
•
Given data x = age, want distribution over y = weight
X
Typical strategy is to parametrize the distribution
N (µ = xw,
1
= 1), pw (y|x) = p exp (y
2⇡
xw)2 /2
•
•
Why? Believe y is a function of x, i.e., f(x) = y
•
So, y = xw + epsilon, where the uncertainty in y is captured
by the random variable ✏ ⇠ N (0, 1)
Y
For each pair of instances (x,y) of random variables X,Y, we
can predict y with f(x), but uncertainty in estimate due to
noise, measurement errors, lack of reliability in the data, etc.
Background
35
Single-variate linear regression
arg max pw (y|x) = arg max log pw (y|x)
. log is monotonic
arg max log pw (y|x) = arg max
. minimizing negative of a function
w
w
w
log pw (y|x)
w
equal to maximizing the function
Samples (x1 , y1 ), . . . , (xN , yN )
pw (y1 , . . . , yN |x1 , . . . , xN ) =
N
Y
i=1
p(yi |x1 , . . . , xN ) =
log pw (y1 , . . . , yN |x1 , . . . , xN ) =
min
w
d
=
dw
N
X
i=1
1
log p
2⇡
log pw (y1 , . . . , yN |x1 , . . . , xN ) = min
w
N
X
i=1
Background
(yi
xi w)xi =
N
X
i=1
yi x i
N
X
1
i=1
2
(yi
N
Y
i=1
p(yi |xi )
log exp (yi
xi w)2 /2
xi w)2
x2i w = 0 =) w =
N
X
i=1
yi x i /
N
X
x2i
i=1
36
Multi-variate linear regression
Samples (x1 , y1 ), . . . , (xN , yN )
I
T
2
p
log pw (yi |xi ) = log
log exp( (yi xi w) /2)
2⇡
N
X
1
min log pw (y1 , . . . , yN |x1 , . . . , xN ) = min
(yi xTi w)2
w
w
2
i=1
@
=
@w
N
X
xTi w)xi =
(yi
i=1
=) w =
N
X
xi xTi
i=1
where A =
N
X
i=1
Background
T
xi xi
!
2R
N
X
yi x i
xi xTi w = 0
i=1
1 N
X
yi x i
i=1
d⇥d
, b=
N
X
i=1
d
xi yi 2 R , Aw = b
37
Outer-product and matrix product
w=
N
X
T
xi xi
i=1
where A =
2
N
X
!
yi x i
i=1
xi xTi 2 Rd⇥d , b =
i=1
x2i1
6 xi2 xi1
6
T
xi xi = 6
..
4
.
xid xi1
Background
1 N
X
N
X
i=1
xi yi 2 Rd , Aw = b
xi1 xi2
2
xi2
..
.
···
···
..
.
xi1 xid
xi2 xid
..
.
xid xi2
···
x2id
3
7
7
7
5
38
If things ever get confusing…
•
This course is about obtaining probabilities over random
variables
•
Either we are trying to do
• Inference: use some known (conditional) probabilities to infer other
probabilities, using a combination of Bayes rule, marginalization, and setting
random variables to instances from the given evidence
• Learn: learn the parameters theta (or a distribution over theta)
Background
39
Exercise: computational differences from
conditional independence
•
•
•
Say you have 4 binary random variables, T, J, R and S
•
•
What if p(T | J, R, S) = p(T | R, S)? Now how many?
p(T,J,R,S) = P(T | J,R,S)P(J | R, S) P(R | S) P(S)
How many entries do we need to specify all these
conditional probabilities? Hint: p(R |S) requires p(R=1|S=0)
and p(R=1 | S=1) to be fully specified.
What if p(J | R, S) = p(J| S)?
Background
40