10/01/2017 Cost Functions in Machine Learning Kevin Swingler Motivation • Given some data that reflects measurements from the environment • We want to build a model that reflects certain statistics about that data • Something as simple as calculating the mean • Or as complex as a multi-valued, non-linear regression model 1 10/01/2017 Cost function • One class of approach is to define a cost function that compares the output of the model to the observed data • The task of designing the model becomes the task of minimising the cost associated with the model • Why cost? Why not error, for example • Cost is more generic, as we shall see … Simple Example – Calculate the Mean • Ever wonder how the equation = ∑ came about? • First, let’s define the mean in terms of a cost function. We want to calculate a mean, such that ∑ ( − ) is minimised • That is to say, we want to find a value that minimises the squared error between and the data 2 10/01/2017 Find • So, we write = argmin ( − ) • Which means, the value of that minimises the summed squared differences between and each number in our sample How To Minimise the Error? • ∑ ( − ) is a quadratic function in • Its minimum is where the slope is zero 3 10/01/2017 Solve Analytically What if We Can’t Solve Analytically? This is the point we need to find 4 10/01/2017 Gradient Descent We have seen that the gradient of the squared error cost function is 2 ( − ) So we can pick a starting point and follow the gradient down to the bottom. The true mean is zero, in this example Gradient Descent A simple version: 1. Pick one data point at a time 2. Move the mean down the error curve a little 3. Repeat Pick the first data point, let’s say =5 so we start off with =5 5 10/01/2017 Gradient Descent Then pick the next point, let’s say =3 So the derivative is 2( − ) = 2 3 − 5 = −4 Now, we only want to take small steps so we use a learning rate, = 0.1 Update rule is = + 2( − ) Gradient Descent So = 5 − 0.1 ∗ 4 = 4.6 6 10/01/2017 Gradient Descent Then perhaps = 4.6 − 0.1 ∗ 6 =4 And so on … Gradient Descent And so on … 7 10/01/2017 Gradient Descent And so on … Gradient Descent And so on until it hovers around the true mean. To get really precise when close, we might need to take smaller steps, perhaps let = 0.05 Then = 0.01 8 10/01/2017 Batch or Stochastic Descent • In the last example, we updated the estimate once for every data point, one at a time • This is known as stochastic gradient descent • This process might need to be repeated several times, using each point more than once • An alternative is to use a batch approach, where the estimate is updated once per complete pass through the data Batch Gradient Descent • Calculate the average cost gradient across the whole data sample • Make one change to the estimate based on that average cost gradient • Repeat until some criterion is met • Batch descent is smoother than SGD • But can be slower and doesn’t work if data is streamed one point at a time 9 10/01/2017 Mini Batches • A good compromise is to use mini-batches • Smooths out some of the variation that SGD produces • Not as inefficient as full batch update Stopping Criteria • Each data point will cause a small move in the estimate, so when do we stop? • Can choose: – Fixed number of iterations – Target error – Fixed number of iterations where average improvement is smaller than a threshold 10 10/01/2017 Pros and Cons • Gradient descent is useful when local gradients are available, but the global minimum cannot be found analytically • They can suffer from a problem known as local minima … Local Minima If we are unlucky and start here 11 10/01/2017 Local Minima We will end up with our estimate here Some Solutions • Several re-starts • Momentum to jump over small dips 12 10/01/2017 Isn’t That All a bit Pointless? • For calculating a mean, yes it is • There is no need to use gradient descent for it • But there are other examples where you need to • We will meet neural networks soon, which make use of gradient descent during learning Another Cost Function: Likelihood • What if we want to estimate the parameters of a probability distribution – what is a good cost function? • The problem is to take a sample, … and estimate the probability distribution # $ , usually in some parametrised form. • Squared error cannot be used as we do not ever know the true value of any # 13 10/01/2017 Calculating Likelihood • The likelihood associated with a model and a given data set is calculated as the product of the probability estimates made by the model across the examples from the data set: % = & '(( ) • We use '(( ) to mean the estimate made by the model Log Likelihood • Probabilities can be small and multiplying many of them together can make very small numbers • So the log likelihood is often used )= log '( ( ) 14 10/01/2017 Simple Example • Let’s say we toss a coin 100 times and get 75 heads and 25 tails • We now want to model that coin with a , ( ) that takes = - or discrete function' = . as input and outputs the associated probability (0.75 or 0.25, in this case) • Again, the example is trivial and we know the , ( = -) =75/100 and ' , ( = .) answer is ' =25/100 (Bayesians look away now) Simple Example • But let’s say we don’t know that, or need a method that can cope in more complex situations where that can’t be used 15 10/01/2017 Maximise Likelihood Negative Log Likelihood 16 10/01/2017 Or Gradient Descent • Similarly, we could use an iterative approach and try to find the parameter with the largest likelihood by iteratively moving the estimate along the likelihood gradient P Likelihood gradient 0.5 0.6 0.7 0.72 0.75 0.76 100 62.5 23.81 14.88 0 -5.48 Other Optimisations • There are many other methods for taking a cost function and trying to find its global minimum • Some follow gradients, other use different algorithms or heuristics • We will see more of them during the course 17 10/01/2017 Summary • Many machine learning methods involve optimising some form of cost function • Sometimes, it is possible to optimise the cost analytically, for example multiple linear regression does so • Other times, you need to use an iterative approach such as gradient descent, for example when training a neural network 18
© Copyright 2026 Paperzz