Let’s start with an example of recognizing handwritten digits. The goal is to build a
machine that will take a picture and will produce the identity of the digit 0, . . . , 9. Machine
learning approach is to use a large set of picture and their known values, typically values
are inspected individually (hand-labeled), to train (learn) same parameters. Then, by
using these parameter, we can predict which number does new picture present.
If we take a step back and think about machine learning approach a bit more abstract,
we can describe it in a following way: We have some data {xi }N
i=1 and results of this data
N
{ti }i=1 . We want to build a machine (program, function), that a new instance of data x
maps to result t.
Through these eyes, we can look on machine learning as in the picture [sinus and
randomly generated points]. We have ten points {(xi , ti )}N
i=1 , and we are looking for a value
of any new given point. I believe, all of us thought about interpolation / approximation.
Word about points: they are generated from the function sin(2πx) with random noise. It
is instructive to consider a synthetically generated data because then we know the precise
process that generated the data for comparison against any learned model. Because we
have random noise in our variables, we are capturing a property of many real data sets:
individual observations are corrupted by random noise. Reason is that we do not measure
variables precise or sources of variables are unobserved . . .
Let’s fit data, using polynomial (linear function in a)
p(x, a) = a0 + a1 x + a2 x2 + . . . , aM xM .
The values of the coefficients will be determined by fitting the polynomial to the training
data. We will use least squares approach: minimize error function
E(a) =
N
X
(p(xn , a) − tn )2 .
n=1
There remains the problem of choosing the degree M of the polynomial. The concept of
choosing M is called model comparison or model selection.
We notice M = 0 and M = 1 give rather poor results: polynomials are rather poor
representations of the function sin(2πx). Choice M = 3 seems to give the best fit to
sin(2πx). When we go to a much higher degree (for example M = 9), we obtain excellent
fit to the training data (we interpolate points). Error function is equal zero. However,
polynomial oscillates wildly and gives very poor representation of the function sin(2πx).
In machine learning we “do not care” if we get a good results on training data (basically
we already have a good approximation of training data, the data is given). We want to
have good approximation of unseen data. This is important. This behavior is common
in machine learning so it has its own name: over-fitting. (Do not know which one of
overfitting and over-fitting is correct.)
If we take a look at error of training data (given points) and testing data (100 uniformly
taken points from the function sin(2πx)), computed as
N
X
|p(xn , a) − tn |/N,
n=1
we see that both errors behave the same until M = 8, 9. Then training data error explodes.
We ask ourselves: why. Our experiences are telling us, that value of coefficients of the
polynomial are rising, when we have degrees 8 and 9. In order to minimize coefficients,
we modify error function adding a term
(1)
E(a) =
N
X
(p(xn , a) − tn )2 + λkwk2 .
n=1
1
We are punishing polynomials with absolute large coefficients. Parameter λ is telling us
how important is minimization of coefficients. For different values, we get different results.
Using extra parameters is common technique in machine learning.
But has equation (1) any theoretical meaning? It has, but we must equip ourselves with
probabilistic tools. Let us remember Gaussian distribution (it is a bump)
N x | µ, σ
2
1
1
=
exp − 2 (x − µ)2 .
1/2
2
2σ
(2πσ )
Gaussian distribution has two parameters: mean µ and variance σ 2 . We can express
our uncertainty over the value of the target variable (tn ) using probability. We assume
that, given the value x, the corresponding value of t has a Gaussian distribution with a
mean y(x, a) of the polynomial curve. Thus we have p(t | x, a, β) = N (t | y(x, a), β −1 ).
Probability of all target values, given polynomial (coefficients), values (x) and the variance
is
p(t | x, a, β) =
N
Y
N (tn | y(xn , a), β −1 )
n=1
If we insert probability density function, we get
N
Y
β 1/2
β
p(t | x, a, β) =
exp − (tn − y(xn , a))2 .
1/2
2
(2π)
n=1
We want to maximize this probability (the targets are generated with respect to the
Gaussian distribution with high probability). Because a logarithm is monotone ascending
function, we can maximize
log p(t | x, a, β) = −
N
N
N
β X
(y(xn , a) − tn )2 +
log β −
log(2π).
2 n=1
2
2
It is convenient to use a logarithm of probability for several reasons: if the distribution
is in “exponent family” computation is very simplified; logarithm transforms product into
sums, we do not have problems with underflow (probability is between 0 and 1). We want
to maximize the log of probability with respect to coefficients: maxa log p(t | x, a, β). We
see, that right hand side of the equation does not depend on a, and we can multiply right
hand side by arbitrary positive constant. Maximizing −f (x) is the same as minimizing
f (x), so we get
min log p(t | x, a, β) =
a
N
X
(y(xn , a) − tn )2 .
n=1
This is the same as least square technique. (By minimizing we get a solution: aM L .) We
can also use the same technique to determine the β parameter:
max log p(t | x, aM L , β) = −
β
N
β X
N
N
(y(xn , aM L ) − tn )2 +
log β −
log(2π).
2 n=1
2
2
We differentiate the equation, equal the derivative with zero, and get
1
βM L
=
N
1 X
(y(xn , aM L ) − tn )2 .
N n=1
Now we can make predictions for new value of x by
−1
p(t | x, aM L , βM L ) = N (t | y(x, aM L ), βM
L ).
Now let us take a step towards a more Bayesian approach. Bread and butter is Bayes’
Theorem
p(Y | X) p(X)
p(X | Y ) =
.
P (Y )
2
This theorem is used so frequently, that each term got its name: posterior is equal to prior
times likelihood divided by evidence. We can state it Without the normalization constant
as: posterior ∝ likelihood · prior. This Theorem is core of the machine learning, because
it transforms uncomputable distributions to manageable ones. For example
p(a | x, t) ∝ p(t | a, x) · p(a | x).
If we maximize the log of the posterior, we must maximize each term separately. We
already know how to maximize first term on the right hand side. For second term, we
introduce prior on the coefficients. Multidimensional Gaussian distribution is
1
1
1
T −1
N (x | µ, Σ) =
exp
−
(x
−
µ)
Σ
(x
−
µ)
,
2
(2π)D/2 |Σ|1/2
so we make prior on the coefficients
α (M +1)/2
α
p(a | α) = N (a | 0, α I) =
exp − aT a .
2π
2
Mean of the distribution is 0, this means we want coefficients to be around zero, with
covariance matrix I (coefficients are not correlated), and parameter α−1 (how far from
0 coefficients can go). If we maximize the log of prior, put everything together (using:
max −f (x) = min f (x)), we get
−1
β
N
X
(y(xn , a) − tn )2 + α aT a.
n=1
If we define λ as α/β, we get Equation (1).
Acknowledgment: this paper (and corresponding slides) are based on the Book [1]
Literatura
1. Christopher M. Bishop, Pattern recognition and machine learning (information science and statistics),
Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.
3
© Copyright 2026 Paperzz