Chapter 1 and 2: Background CSCI B554, Spring 2015 Reminders • • Assignment #1 is released, due January 28 Readings, with Thought Questions due January 21 • [Optional refresher] Chapter 1 on probabilities and Chapter 2 on graph basics • Chapter 3: Sections 3.1, 3.3 • Chapter 5: intro to Section 5.1 and Section 5.1.1 • Chapter 8: Suggest you read the entire chapter. Section 8.2 is a basic refresher on densities for continuous random variables, so is optional. Background 2 Background: random variables • • • Event space: set of possible values a random variable can take, e.g. {smokes, not smokes} or any real Probabilities are over the set of events 0.4 0.35 0.3 0.25 0.2 0.15 X is a random variable — essentially a set of possible events that could occur according to some 1 2 p(x) = p exp x /2 probability mass or density 2⇡ For E = {smokes, not smokes}, P(X = smokes) = 0.1, x = smokes is an instance or sample of that RV 0.1 0.05 0 −3 • Background −2 −1 0 1 2 3 3 Introduction to Probability1 David Barber University College London 1 These slides accompany the book Bayesian Reasoning and Machine Learning. The book and demos can be downloaded from www.cs.ucl.ac.uk/staff/D.Barber/brml. Feedback and corrections are also available on the site. Feel free to adapt these slides for your own purposes, but please include a link the above website. Background 4 Rules of probability p(x = x) : the probability of variable x being in state x. ⇢ 1 we are certain x is in state x p(x = x) = 0 we are certain x is not in state x Values between 0 and 1 represent the degree of certainty of state occupancy. domain dom(x) denotes the states x can Ptake. For example, dom(c) = {heads, tails}. When summing over P a variable P x f (x), the interpretation is that all states of x are included, i.e. x f (x) ⌘ s2dom(x) f (x = s). distribution Given a variable, x, its domain dom(x) and a full specification of the probability values for each of the variable states, p(x), we have a distribution for x. normalisation The summation of the probability over all the states is 1: X p(x = x) = 1 x2dom(x) We will usually more conveniently write Background P x p(x) = 1. 5 Operations AND Use the shorthand p(x, y) ⌘ p(x \ y) for p(x and y). Note that p(y, x) = p(x, y). marginalisation Given a joint distr. p(x, y) the marginal distr. of x is defined by X p(x) = p(x, y) y More generally, p(x1 , . . . , xi 1 , xi+1 , . . . , xn ) = X p(x1 , . . . , xn ) xi Background 6 Conditional Probability and Bayes’ Rule The probability of event x conditioned on knowing event y (or more shortly, the probability of x given y) is defined as p(x, y) p(y|x)p(x) p(x|y) ⌘ = (Bayes’ rule) p(y) p(y) Throwing darts p(region 5, not region 20) p(region 5|not region 20) = p(not region 20) p(region 5) 1/20 1 = = = p(not region 20) 19/20 19 Interpretation p(A = a|B = b) should not be interpreted as ‘Given the event B = b has occurred, p(A = a|B = b) is the probability of the event A = a occurring’. The correct interpretation should be ‘p(A = a|B = b) is the probability of A being in state a under the constraint that B is in state b’. Background 7 Probability tables The a priori probability that a randomly selected Great British person would live in England, Scotland or Wales, is 0.88, 0.08 and 0.04 respectively. We can write this as a vector 0 1 0 p(Cnt = E) @ p(Cnt = S) A = @ p(Cnt = W) (or probability table) : 1 0.88 0.08 A 0.04 whose component values sum to 1. The ordering of the components in this vector is arbitrary, as long as it is consistently applied. Background 8 Probability tables We assume that only three Mother Tongue languages exist : English (Eng), Scottish (Scot) and Welsh (Wel), with conditional probabilities given the country of residence, England (E), Scotland (S) and Wales (W). Using the state ordering: M T = [Eng, Scot, Wel]; we write a (fictitious) conditional 0 0.95 0.7 p(M T |Cnt) = @ 0.04 0.3 0.01 0.0 Background Cnt = [E, S, W] probability table 1 0.6 0.0 A 0.4 9 Probability tables The distribution p(Cnt, M T ) = p(M T |Cnt)p(Cnt) can be written as a 3 ⇥ 3 matrix with (say) rows indexed by country and columns indexed by Mother Tongue: 0 1 0 1 0.95 ⇥ 0.88 0.7 ⇥ 0.08 0.6 ⇥ 0.04 0.836 0.056 0.024 @ 0.04 ⇥ 0.88 0.3 ⇥ 0.08 0.0 ⇥ 0.04 A = @ 0.0352 0.024 0 A 0.01 ⇥ 0.88 0.0 ⇥ 0.08 0.4 ⇥ 0.04 0.0088 0 0.016 By summing a column, we have the marginal 0 1 0.88 p(Cnt) = @ 0.08 A 0.04 Summing the rows gives the marginal 0 1 0.916 p(M T ) = @ 0.0592 A 0.0248 Background 10 Probability tables Large numbers of variables For joint distributions over a larger number of variables, xi , i = 1, . . . , D, with each variable QDxi taking Ki states, the table describing the joint distribution is an array with i=1 Ki entries. Explicitly storing tables therefore requires space exponential in the number of variables, which rapidly becomes impractical for a large number of variables. Indexing A probability distribution assigns a value to each of the joint states of the variables. For this reason, p(T, J, R, S) is considered equivalent to p(J, S, R, T ) (or any such reordering of the variables), since in each case the joint setting of the variables is simply a di↵erent index to the same probability. One should be careful not to confuse the use of this indexing type notation with functions f (x, y) which are in general dependent on the variable order. Background 11 Inspector Clouseau Inspector Clouseau arrives at the scene of a crime. The Butler (B) and Maid (M ) are his main suspects. The inspector has a prior belief of 0.6 that the Butler is the murderer, and a prior belief of 0.2 that the Maid is the murderer. These probabilities are independent in the sense that p(B, M ) = p(B)p(M ). (It is possible that both the Butler and the Maid murdered the victim or neither). The inspector’s prior criminal knowledge can be formulated mathematically as follows: dom(B) = dom(M ) = {murderer, not murderer} dom(K) = {knife used, knife not used} p(B = murderer) = 0.6, p(knife p(knife p(knife p(knife used|B used|B used|B used|B p(M = murderer) = 0.2 = not murderer, = not murderer, = murderer, = murderer, M M M M = not murderer) = murderer) = not murderer) = murderer) = 0.3 = 0.2 = 0.6 = 0.1 The victim lies dead in the room and the inspector quickly finds the murder weapon, a Knife (K). What is the probability that the Butler is the murderer? (Remember that it might be that neither is the murderer). Background 12 Inspector Clouseau Using b for the two states of B and m for the two states of M , P X X p(B, m, K) p(B) m p(K|B, m)p(m) P p(B|K) = p(B, m|K) = =P p(K) b p(b) m p(K|b, m)p(m) m m Plugging in the values we have p(B = murderer|knife used) = = 6 10 6 10 2 10 ⇥ 1 10 300 ⇡ 0.73 412 + 8 10 2 10 ⇥ ⇥ 6 10 1 10 8 6 + 10 ⇥ 10 4 2 2 + 10 ⇥ + 10 10 8 10 ⇥ 3 10 Hence knowing that the knife was the murder weapon strengthens our belief that the butler did it. Exercise: compute the probability that the Butler and not the Maid is the murderer. Background 13 Inspector Clouseau The role of p(knife used) in the Inspector Clouseau example can cause some confusion. In the above, X X p(knife used) = p(b) p(knife used|b, m)p(m) b m is computed to be 0.456. But surely, p(knife used) = 1, since this is given in the question! Note that the quantity p(knife used) relates to the prior probability the model assigns to the knife being used (in the absence of any other information). If we know that the knife is used, then the posterior is p(knife used, knife used) p(knife used) p(knife used|knife used) = = =1 p(knife used) p(knife used) which, naturally, must be the case. Background 14 Independence Variables x and y are independent if knowing one event gives no extra information about the other event. Mathematically, this is expressed by p(x, y) = p(x)p(y) Independence of x and y is equivalent to p(x|y) = p(x) , p(y|x) = p(y) If p(x|y) = p(x) for all states of x and y, then the variables x and y are said to be independent. We write then x ?? y. interpretation Note that x ?? y doesn’t mean that, given y, we have no information about x. It means the only information we have about x is contained in p(x). factorisation If p(x, y) = kf (x)g(y) for some constant k, and positive functions f (·) and g(·) then x and y are independent. Background 15 Conditional Independence X ?? Y| Z denotes that the two sets of variables X and Y are independent of each other given the state of the set of variables Z. This means that p(X , Y|Z) = p(X |Z)p(Y|Z) and p(X |Y, Z) = p(X |Z) for all states of X , Y, Z. In case the conditioning set is empty we may also write X ?? Y for X ?? Y| ;, in which case X is (unconditionally) independent of Y. Conditional independence does not imply marginal independence p(x, y) = X z p(x|z)p(y|z) p(z) 6= | {z } cond. indep. X | p(x|z)p(z) z {z p(x) X }| z p(y|z)p(z) {z p(y) } Conditional dependence If X and Y are not conditionally independent, they are conditionally dependent. This is written X >>Y| Z Similarly X >>Y| ; can be written as X>>Y. Background 16 Conditional Independence example Based on a survey of households in which the husband and wife each own a car, it is found that: wife’s car type ?? husband’s car type| family income There are 4 car types, the first two being ‘cheap’ and the last two being ‘expensive’. Using w for the wife’s car type and h for the husband’s: 0 1 0 1 0.7 0.2 B 0.3 C B 0.1 C C , p(w|inc = high) = B C p(w|inc = low) = B @ 0 A @ 0.4 A 0 0.3 0 1 0 0.2 B 0.8 C B C , p(h|inc = high) = B p(h|inc = low) = B @ 0 A @ 0 p(inc = low) = 0.9 Background 1 0 0 C C 0.3 A 0.7 17 Conditional Independence example Then the marginal distribution p(w, h) is X p(w, h) = p(w|inc)p(h|inc)p(inc) inc giving 0 1 0.126 0.504 0.006 0.014 B 0.054 0.216 0.003 0.007 C C p(w, h) = B @ 0 0 0.012 0.028 A 0 0 0.009 0.021 From this we can find the marginals and calculate 0 1 0.117 0.468 0.0195 0.0455 B 0.0504 0.2016 0.0084 0.0196 C C p(w)p(h) = B @ 0.0072 0.0288 0.0012 0.0028 A 0.0054 0.0216 0.0009 0.0021 This shows that whilst w ?? h| inc, it is not true that w ?? h. For example, even if we don’t know the family income, if we know that the husband has a cheap car then his wife must also have a cheap car – these variables are therefore dependent. Background 18 Scientific Inference Much of science deals with problems of the form : tell me something about the variable ✓ given that I have observed data D and have some knowledge of the underlying data generating mechanism. Our interest is then the quantity p(D|✓)p(✓) p(D|✓)p(✓) p(✓|D) = =R p(D) p(D|✓)p(✓) ✓ This shows how from a forward or generative model p(D|✓) of the dataset, and coupled with a prior belief p(✓) about which variable values are appropriate, we can infer the posterior distribution p(✓|D) of the variable in light of the observed data. Generative models in science This use of a generative model sits well with physical models of the world which typically postulate how to generate observed phenomena, assuming we know the model. For example, one might postulate how to generate a time-series of displacements for a swinging pendulum but with unknown mass, length and damping constant. Using this generative model, and given only the displacements, we could infer the unknown physical properties of the pendulum, such as its mass, length and friction damping constant. Background 19 Prior, Likelihood and Posterior For data D and variable ✓, Bayes’ rule tells us how to update our prior beliefs about the variable ✓ in light of the data to a posterior belief: p(✓|D) = | {z } posterior p(D|✓) p(✓) | {z } |{z} likelihood prior p(D) | {z } evidence The evidence is also called the marginal likelihood. The term likelihood is used for the probability that a model generates observed data. Background 20 Prior, Likelihood and Posterior More fully, if we condition on the model M , we have p(D|✓, M )p(✓|M ) p(✓|D, M ) = p(D|M ) where we see the role of the likelihood p(D|✓, M ) and marginal likelihood p(D|M ). The marginal likelihood is also called the model likelihood. The MAP assignment The Most probable A Posteriori (MAP) setting is that which maximises the posterior, ✓⇤ = argmax p(✓|D, M ) = argmax p(✓, D|M ) ✓ ✓ The Max Likelihood assignment When p(✓|M ) = const., ✓⇤ = argmax p(✓, D|M ) = argmax p(D|✓, M ) ✓ Background ✓ 21 Video Break https://www.youtube.com/watch?v=lLULRlmXkKo https://www.youtube.com/watch?v=CkTftoNFeGY Time 2:50 to 4:10 Background 22 Vector algebra Vectors Let x denote the n-dimensional column vector with components 0 1 x1 B x2 C B C B .. C @ . A xn A vector can be considered a n ⇥ 1 matrix. Addition 0 B B x+y =B @ Background x1 x2 .. . xn 1 0 C B C B C+B A @ y1 y2 .. . yn 1 0 C B C B C=B A @ x 1 + y1 x 2 + y2 .. . x n + yn 1 C C C A 23 Scalar product w·x= n X w i xi = w T x i=1 The length of a vector is denoted |x|, the squared length is given by 2 |x| = xT x = x2 = x21 + x22 + · · · + x2n A unit vector x has xT x = 1. The scalar product has a natural geometric interpretation as: w · x = |w| |x| cos(✓) where ✓ is the angle between the two vectors. Thus if the lengths of two vectors are fixed their inner product is largest when ✓ = 0, whereupon one vector is a constant multiple of the other. If the scalar product xT y = 0, then x and y are orthogonal. Background 24 Linear dependence A set of vectors x1 , . . . , xn is linearly dependent if there exists a vector xj that can be expressed as a linear combination of the other vectors. If the only solution to n X ↵i x i = 0 i=1 is for all ↵i = 0, i = 1, . . . , n, the vectors x1 , . . . , xn are linearly independent. Background 25 Matrices An m ⇥ n matrix A is a collection of scalar values arranged in a rectangle of m rows and n columns. 0 1 a11 a12 · · · a1n B a21 a22 · · · a2n C B C B .. .. .. C . . @ . . . . A am1 am2 · · · amn The i, j element of matrix A can be written Aij or more conventionally ⇥ 1 ⇤ aij . Where more clarity is required, one may write [A]ij (for example A ij ). Matrix addition For two matrix A and B of the same size, [A + B]ij = [A]ij + [B]ij Background 26 Solving Linear Systems Problem Given square N ⇥ N matrix A and vector b, find the vector x that satisfies Ax = b Solution Algebraically, we have the inverse: x=A 1 b In practice, we solve solve for x numerically using Gaussian Elimination – more stable numerically. Complexity Solving a linear system is O N 3 – can be very expensive for large N . Approximate methods include conjugate gradient and related approaches. Background 27 Single-variate calculus For a function f defined on a scalar x, the derivative is df f (x + h) (x) = lim h!0 dx h f (x) df At any point, x, (x) gives dx the slope of the tangent to the function at f(x) GIF from Wikipedia: Tangent Background 28 Multivariate Calculus Partial derivative Consider a function of n variables, f (x1 , x2 , . . . , xn ) or f (x). The partial derivative of f w.r.t. xi is defined as the following limit (when it exists) @f f (x1 , . . . , xi = lim h!0 @xi 1 , xi + h, xi+1 , . . . , xn ) h f (x) Gradient vector For function f the gradient is denoted rf or g: 0 B rf (x) ⌘ g(x) ⌘ @ Background @f @x1 .. . @f @xn 1 C A 29 Interpreting the gradient vector Consider a function f (x) that depends on a vector x. We are interested in how the function changes when the vector x changes by a small amount : x ! x + , where is a vector whose length is very small: f (x + ) = f (x) + X i @f +O i @xi 2 We can interpret the summation above as the scalar product between the vector @f rf with components [rf ]i = @x and . i f (x + ) = f (x) + (rf )T + O Background 2 30 Interpreting the Gradient x1 Figure : Interpreting the gradient. The ellipses are contours of constant function value, f = const. At any point x, the gradient vector rf (x) points along the direction of maximal increase of the function. f(x) x2 The gradient points along the direction in which the function increases most rapidly. Why? Consider a direction p̂ (a unit length vector). Then a displacement, units along this direction changes the function value to f (x + p̂) ⇡ f (x) + rf (x) · p̂ The direction p̂ for which the function has the largest change is that which maximises the overlap rf (x) · p̂ = |rf (x)||p̂| cos ✓ = |rf (x)| cos ✓ where ✓ is the angle between p̂ and rf (x). The overlap is maximised when ✓ = 0, giving p̂ = rf (x)/|rf (x)|. Hence, the direction along which the function changes the most rapidly is along rf (x). Background 31 Higher derivatives The ‘second-derivative’ of an n-variable function is defined by ✓ ◆ @ @f i = 1, . . . , n; j = 1, . . . , n @xi @xj which is usually written @2f , i 6= j @xi @xj @2f , i=j 2 @xi If the partial derivatives @f /@xi , @f /@xj and @ 2 f /@xi @xj are continuous, then @ 2 f /@xi @xj exists and @ 2 f /@xi @xj = @ 2 f /@xj @xi . This is also denoted by rrf . These n2 second partial derivatives are represented by a square, symmetric matrix called the Hessian matrix of f (x). 0 1 2 2 @ f @ f . . . 2 @x1 @xn B @x.1 C . C . . Hf (x) = B . . @ A 2 2 @ f @ f . . . @x1 @xn @xn 2 Background 32 Chain rule Let each xj be parameterized by u1 , . . . , um , i.e. xj = xj (u1 , . . . , um ). n X @f @xj @f = @u↵ @xj @u↵ j=1 or in vector notation @ @x(u) T f (x(u)) = rf (x(u)) @u↵ @u↵ Background 33 Resources for more advanced information on matrices and densities • Matrix cookbook in resources section in OnCourse: https://resources.oncourse.iu.edu/access/content/group/ SP15-BL-CSCI-B554-32906/matrixcookbook.pdf • Chapter 8 in the textbook gives a good overview of common distributions • Rule of thumb for matrix multiplication: the dimensions must match n m Background A k n AB is (mxn) times (nxk) and so is an mxk matrix B 34 Revisiting supervised learning and regression • • Given data x = age, want distribution over y = weight X Typical strategy is to parametrize the distribution N (µ = xw, 1 = 1), pw (y|x) = p exp (y 2⇡ xw)2 /2 • • Why? Believe y is a function of x, i.e., f(x) = y • So, y = xw + epsilon, where the uncertainty in y is captured by the random variable ✏ ⇠ N (0, 1) Y For each pair of instances (x,y) of random variables X,Y, we can predict y with f(x), but uncertainty in estimate due to noise, measurement errors, lack of reliability in the data, etc. Background 35 Single-variate linear regression arg max pw (y|x) = arg max log pw (y|x) . log is monotonic arg max log pw (y|x) = arg max . minimizing negative of a function w w w log pw (y|x) w equal to maximizing the function Samples (x1 , y1 ), . . . , (xN , yN ) pw (y1 , . . . , yN |x1 , . . . , xN ) = N Y i=1 p(yi |x1 , . . . , xN ) = log pw (y1 , . . . , yN |x1 , . . . , xN ) = min w d = dw N X i=1 1 log p 2⇡ log pw (y1 , . . . , yN |x1 , . . . , xN ) = min w N X i=1 Background (yi xi w)xi = N X i=1 yi x i N X 1 i=1 2 (yi N Y i=1 p(yi |xi ) log exp (yi xi w)2 /2 xi w)2 x2i w = 0 =) w = N X i=1 yi x i / N X x2i i=1 36 Multi-variate linear regression Samples (x1 , y1 ), . . . , (xN , yN ) I T 2 p log pw (yi |xi ) = log log exp( (yi xi w) /2) 2⇡ N X 1 min log pw (y1 , . . . , yN |x1 , . . . , xN ) = min (yi xTi w)2 w w 2 i=1 @ = @w N X xTi w)xi = (yi i=1 =) w = N X xi xTi i=1 where A = N X i=1 Background T xi xi ! 2R N X yi x i xi xTi w = 0 i=1 1 N X yi x i i=1 d⇥d , b= N X i=1 d xi yi 2 R , Aw = b 37 Outer-product and matrix product w= N X T xi xi i=1 where A = 2 N X ! yi x i i=1 xi xTi 2 Rd⇥d , b = i=1 x2i1 6 xi2 xi1 6 T xi xi = 6 .. 4 . xid xi1 Background 1 N X N X i=1 xi yi 2 Rd , Aw = b xi1 xi2 2 xi2 .. . ··· ··· .. . xi1 xid xi2 xid .. . xid xi2 ··· x2id 3 7 7 7 5 38 If things ever get confusing… • This course is about obtaining probabilities over random variables • Either we are trying to do • Inference: use some known (conditional) probabilities to infer other probabilities, using a combination of Bayes rule, marginalization, and setting random variables to instances from the given evidence • Learn: learn the parameters theta (or a distribution over theta) Background 39 Exercise: computational differences from conditional independence • • • Say you have 4 binary random variables, T, J, R and S • • What if p(T | J, R, S) = p(T | R, S)? Now how many? p(T,J,R,S) = P(T | J,R,S)P(J | R, S) P(R | S) P(S) How many entries do we need to specify all these conditional probabilities? Hint: p(R |S) requires p(R=1|S=0) and p(R=1 | S=1) to be fully specified. What if p(J | R, S) = p(J| S)? Background 40
© Copyright 2024 Paperzz