Cost Functions in Machine Learning Motivation

10/01/2017
Cost Functions in Machine
Learning
Kevin Swingler
Motivation
• Given some data that reflects measurements
from the environment
• We want to build a model that reflects certain
statistics about that data
• Something as simple as calculating the mean
• Or as complex as a multi-valued, non-linear
regression model
1
10/01/2017
Cost function
• One class of approach is to define a cost
function that compares the output of the
model to the observed data
• The task of designing the model becomes the
task of minimising the cost associated with the
model
• Why cost? Why not error, for example
• Cost is more generic, as we shall see …
Simple Example – Calculate the Mean
• Ever wonder how the equation = ∑
came about?
• First, let’s define the mean in terms of a cost
function. We want to calculate a mean, such
that ∑ ( − ) is minimised
• That is to say, we want to find a value that
minimises the squared error between and
the data
2
10/01/2017
Find
• So, we write
= argmin
(
− )
• Which means, the value of that minimises
the summed squared differences between
and each number in our sample
How To Minimise the Error?
• ∑ ( − ) is a quadratic function in
• Its minimum is where the slope is zero
3
10/01/2017
Solve Analytically
What if We Can’t Solve Analytically?
This is the point
we need to find
4
10/01/2017
Gradient Descent
We have seen that the gradient of
the squared error cost function is
2
(
− )
So we can pick a starting point and
follow the gradient down to the
bottom.
The true mean is zero, in this
example
Gradient Descent
A simple version:
1. Pick one data point at a time
2. Move the mean down the
error curve a little
3. Repeat
Pick the first data point,
let’s say
=5
so we start off with
=5
5
10/01/2017
Gradient Descent
Then pick the next
point, let’s say
=3
So the derivative is
2( − )
= 2 3 − 5 = −4
Now, we only want to
take small steps so we
use a learning rate,
= 0.1
Update rule is
= + 2( − )
Gradient Descent
So
= 5 − 0.1 ∗ 4
= 4.6
6
10/01/2017
Gradient Descent
Then perhaps
= 4.6 − 0.1 ∗ 6
=4
And so on …
Gradient Descent
And so on …
7
10/01/2017
Gradient Descent
And so on …
Gradient Descent
And so on until it
hovers around the
true mean.
To get really precise
when close, we might
need to take smaller
steps, perhaps let
= 0.05
Then
= 0.01
8
10/01/2017
Batch or Stochastic Descent
• In the last example, we updated the estimate
once for every data point, one at a time
• This is known as stochastic gradient descent
• This process might need to be repeated
several times, using each point more than
once
• An alternative is to use a batch approach,
where the estimate is updated once per
complete pass through the data
Batch Gradient Descent
• Calculate the average cost gradient across the
whole data sample
• Make one change to the estimate based on
that average cost gradient
• Repeat until some criterion is met
• Batch descent is smoother than SGD
• But can be slower and doesn’t work if data is
streamed one point at a time
9
10/01/2017
Mini Batches
• A good compromise is to use mini-batches
• Smooths out some of the variation that SGD
produces
• Not as inefficient as full batch update
Stopping Criteria
• Each data point will cause a small move in the
estimate, so when do we stop?
• Can choose:
– Fixed number of iterations
– Target error
– Fixed number of iterations where average
improvement is smaller than a threshold
10
10/01/2017
Pros and Cons
• Gradient descent is useful when local
gradients are available, but the global
minimum cannot be found analytically
• They can suffer from a problem known as local
minima …
Local Minima
If we are unlucky and start here
11
10/01/2017
Local Minima
We will end up with our estimate here
Some Solutions
• Several re-starts
• Momentum to jump over small dips
12
10/01/2017
Isn’t That All a bit Pointless?
• For calculating a mean, yes it is
• There is no need to use gradient descent for it
• But there are other examples where you need
to
• We will meet neural networks soon, which
make use of gradient descent during learning
Another Cost Function: Likelihood
• What if we want to estimate the parameters
of a probability distribution – what is a good
cost function?
• The problem is to take a sample, … and
estimate the probability distribution # $ ,
usually in some parametrised form.
• Squared error cannot be used as we do not
ever know the true value of any #
13
10/01/2017
Calculating Likelihood
• The likelihood associated with a model and a
given data set is calculated as the product of the
probability estimates made by the model across
the examples from the data set:
% = & '(( )
• We use '(( ) to mean the estimate made by the
model
Log Likelihood
• Probabilities can be small and multiplying
many of them together can make very small
numbers
• So the log likelihood is often used
)=
log '( ( )
14
10/01/2017
Simple Example
• Let’s say we toss a coin 100 times and get 75
heads and 25 tails
• We now want to model that coin with a
, ( ) that takes = - or
discrete function'
= . as input and outputs the associated
probability (0.75 or 0.25, in this case)
• Again, the example is trivial and we know the
, ( = -) =75/100 and '
, ( = .)
answer is '
=25/100 (Bayesians look away now)
Simple Example
• But let’s say we don’t know that, or need a
method that can cope in more complex
situations where that can’t be used
15
10/01/2017
Maximise Likelihood
Negative Log Likelihood
16
10/01/2017
Or Gradient Descent
• Similarly, we could use an iterative approach
and try to find the parameter with the largest
likelihood by iteratively moving the estimate
along the likelihood gradient
P
Likelihood gradient
0.5
0.6
0.7
0.72
0.75
0.76
100
62.5
23.81
14.88
0
-5.48
Other Optimisations
• There are many other methods for taking a
cost function and trying to find its global
minimum
• Some follow gradients, other use different
algorithms or heuristics
• We will see more of them during the course
17
10/01/2017
Summary
• Many machine learning methods involve
optimising some form of cost function
• Sometimes, it is possible to optimise the cost
analytically, for example multiple linear
regression does so
• Other times, you need to use an iterative
approach such as gradient descent, for
example when training a neural network
18