Cross-entropy for multiple classes

unit #7/8
Giansalvo EXIN Cirrincione
ERROR FUNCTIONS
part one
Goal for REGRESSION: to model the conditional distribution of the
output variables, conditioned on the input variables.
Goal for CLASSIFICATION: to model the posterior probabilities of
class membership, conditioned on the input variables.
ERROR FUNCTIONS
Basic goal for TRAINING: model the underlying
generator of the data for generalization on new data.
The most general and complete description of the generator of the data is
in terms of the probability density p(x,t) in the joint input-target space.
For a set of training data {xn, tn} drawn
independently from the same distribution:
modelled by the
feed-forward
neural network
ERROR FUNCTIONS

E   ln p t n x n

n
it determines the error function
Sum-of-squares error
OLS
approach
 c target variables tk
 the distributions of the target variables are independent
 the distributions of the target variables are Gaussian
 error k  N( 0,  );  doesn’t depend on x or on k

E   ln p t n x n
n

Sum-of-squares error
Sum-of-squares error
minimize E,
computed at w*,
w.r.t. 
w* minimizes E
The optimum value of 2 is proportional to the residual value
of the sum-of-squares error function at its minimum.
Of course, the use of a sum-of-squares error doesn’t require the target
data to have a Gaussian distribution. However, if we use this error, then
the results cannot distinguish between the true distribution and any other
distribution having the same mean and variance.
Sum-of-squares error
training
over all N patterns
in the training set
validation
over all N ’ patterns
in the test set
 E = 0  perfect prediction of the test data
 E = 1  it is predicting the test data in the mean
~
z j x; w
linear output units
MLP,RBF
M
linear output units
SVD
T nk  ~tk n
W kj  wkj
Z nj  ~z jn
linear output units
wL
wL
fast
fast
slow
slow
~
w
~
w
~
~ , w
~


min
E
w
w
L
~
w
 reduction of the number of iterations (smaller search space)
 greater cost per iteration
linear output units
Suppose the TS target patterns satisfy an exact linear relation:
u 0  u T t
If the final-layer weights are determined by OLS, then the outputs
will satisfy the same linear constraint for arbitrary input patterns.
Interpretation of network inputs
For a network trained by minimizing
a sum-of-squares error function, the
outputs approximate the conditional
averages of the target data
Consider the limit in which the size N of the TS goes to infinity
Interpretation of network inputs
Interpretation of network inputs
regression of tk conditioned on x
Interpretation of network inputs
This result doesn’t depend on the choice of
network architecture or even if using a neural
network at all. However, anns provide a
framework for approximating arbitrary
nonlinear multivariate mappings and can
thattherefore
it approximates
an infinitetheTS
in principle approximate
conditional average to arbitrary accuracy.
KEY ASSUMPTIONS
• the TS must be sufficiently large
• the network output must be sufficiently general (weights for minimum)
• training in such a way as to find the appropriate minimum of the cost
Interpretation of network inputs
zero-mean
Interpretation of network inputs
example
Interpretation of network inputs
example
t  x  0.3sin 2x  
 is a RV drawn from a
uniform distribution in
the range (-0.1,0.1)
• MLP 1-5-1
• sum-of squares error
Interpretation of network inputs
The sum-of-squares error function cannot
distinguish between the true distribution
and a Gaussian distribution having the
same x-dependent mean and average
variance.
ERROR FUNCTIONS
part two
PCk x
Goal for REGRESSION: to model the conditional distribution of the
output variables, conditioned on the input variables.
Goal for CLASSIFICATION: to model the posterior probabilities of
class membership, conditioned on the input variables.
We can exploit a number of results ...
Minimum error-rate decisions
Note that the network outputs need not be
close to 0 or 1 if the class-conditional
density functions are overlapping.
Goal for CLASSIFICATION: to model the posterior probabilities of
class membership, conditioned on the input variables.
Minimum error-rate decisions
We can exploit a number of results ...
Outputs sum to 1
It can be enforced explicitly as part
of the choice of network structure.
The average of each output over all patterns in the TS should
approximate the corresponding prior class probabilities.
These estimated priors can be compared with the sample estimates
of the priors obtained from the fractions of patterns in each class
within the TS. Differences are an indication that the network is not
modelling the posterior probabilities accurately.
Minimum error-rate decisions
Outputs sum to 1
We can exploit a number of results ...
Compensating for different priors
Case: priors expected when the network is in
use differ from those represented by the TS.
P1 Ck x  
px Ck P1 Ck 
p1 x 
P2 Ck x  
px Ck P2 Ck 
p2 x 
1 
• 1 = TS
• 2 = use
P1 Ck x 
P1 Ck 

px Ck 
px Ck 
p2 x 
 1P2 Ck  
P2 Ck   P2 Ck x 
  x P2 Ck x 
p1 x 
p1 x 
Changes in priors can P C x   1  P C 
1 2
k
be accomodated without 2 k
 x 
retraining
normalization factor
p1 x 
sum-of-squares for classification
1-of-c coding
every input vector in the TS is labelled
by its class membership, represented
by a set of target values tkn
x n  Cl , t kn   kl
discrete RV
c
ptk x     tk   kl PCl x 
l 1
ptk x 
tk
1
PCk x 
p
1  PCk x
x0
x
sum-of-squares for classification
every input vector in the TS is labelled
by its class membership, represented
by a set of target values tkn
c
ptk x     tk   kl PCl x 
l 1
yk x   t k x   t k pt k x  dt k
yk x   PCk x 
1-of-c coding
x n  Cl , t kn   kl
sum-of-squares for classification
The s-o-s error is not the most appropriate for classification because it is
derived from ML on the assumption of Gaussian distributed target data.
in the case of a 1-of-c coding scheme, the target values sum to unity
for each pattern, and so the outputs will satisfy the same constraint
for a network with linear output units and s-o-s error, if the target values
satisfy a linear constraint, then the outputs will satisfy the same
constraint for an arbitrary input
if the outputs represent
probabilities, they should
lie in the range (0,1) and
should sum to 1
no guarantee that the outputs
lie in the range (0,1)
yk x   PCk x 
sum-of-squares for classification
two class problem
two output units
1-of-c coding
alternative approach
single output
t n  1 if x n  C1
t n  0 if x n  C2
pt x   t  1PC1 x   t PC2 x
yx   PC1 x 
PC2 x  1  yx
Interpretation of hidden units
T nk  ~tk n
W kj  wkj
Z nj  ~z jn
WT  Z  T
Total covariance matrix
for the activations at the
output of the final
hidden layer w.r.t. TS
linear output units
Interpretation of hidden units
Between-class
covariance matrix
linear output units
Interpretation of hidden units
Nothing is specific to MLP or indeed to anns. The same result is
obtained regardless of the functions (of the weights) zj and applies to
any generalized linear discriminant in which the kernels are adaptive.
min
max
linear output units
Interpretation of hidden units
The weights in the final layer are adjusted to produce an optimum
discrimination of the classes of input vectors by means of a linear
transformation. Minimizing the error of this linear discriminant
requires the input data undergo a nonlinear transformation into the
space spanned by the activations of the hidden units in such a way
as to maximize the discriminant function J.
min
max
linear output units
Interpretation of hidden units
Strong weighting
of the feature
extraction
criterion in
favour of classes
with larger
number of
patterns
min
1-of-c
max
linear output units
Cross-entropy for two classes
t n  1 if x n  C1
t  0 if x  C2
n
wanted
n
target coding scheme
y  P C1 x 
P C2 x   1  y
• Hopfield (1987)
• Baum and Wilczek (1988)
• Solla et al. (1988)
• Hinton (1989)
• Hampshire and Pearlmutter (1990)
cross-entropy error function
Cross-entropy for two classes
ga  g a1  g a  y1  y 
logistic activation
function for the
output
Absolute
minimum
yn  t n
n
BP
Natural pairing
 sum-of-squares + linear output units
 cross-entropy + logistic output unit
Cross-entropy for two classes
=0
it doesn’t vanish when t n is continuous in
the range (0,1) representing the probability
of the input xn belonging to class C1
1-of-c
coding
min at 0
Cross-entropy for two classes
example
•
•
•
•
•
MLP
one input unit
five hidden units (tanh)
one output unit (logistic)
cross-entropy
BFGS
Class-conditional pdf’s
used to generate the TS
(equal priors)
dashed = Bayes
sigmoid
single-layer
x
x
exponential family
activation functions
of distributions
(e.g. Gaussian,
hidden unit
multi-layer
binomial,
output
Bernoulli,
Poisson)
x
x
x
x
x
The network output is given by a logistic sigmoid activation function acting on a weighted linear
combination of the outputs of those hidden units which send connections to the output unit.
Extension to the hidden units: provided such units use logistic sigmoids,
their outputs can be interpreted as probabilities of the presence of
corresponding features conditioned on the inputs to the units.
properties of the cross-entropy error
the cross-entropy error
function performs better
than s-o-s at estimating
small probabilities
the s-o-s error function depends on the absolute errors
(its minimization tends to result in similar absolute errors for each pattern)
the error function depends on the relative errors of the outputs
(its minimization tends to result in similar relative errors on both small and large targets)
yn  t n   n
properties of the cross-entropy error
Manhattan error function
 n small
yn  t n   n
t n  1 if x n  C1
t  0 if x  C2
n
n
target coding scheme
compared with s-o-s:
 much stronger weight to smaller errors
 better for incorrectly labelled data
justification of the cross-entropy error
as for s-o-s, the output
of the network
approximates the
conditional average of
the target data for the
given input
y x   t x
set the functional derivative w.r.t. y(x) to zero
infinite data limit
justification of the cross-entropy error
y x   t x
t n  1 if x n  C1
t  0 if x  C2
n
n
target coding scheme
Multiple independent attributes
Determine the probabilities of the presence or absence of a
number of attributes (which need not be mutually exclusive).
Assumption
independent
attributes
x
p t x    ykt k 1  yk 
c
k 1
1 t k
multiple
outputs yk represents the
probability that
the kth attribute is
present
With this choice of error function, the outputs should
each have a logistic sigmoid activation function
Multiple independent attributes
HOME
WORK
Show that the entropy measure E, derived for targets tk =0, 1, applies
also in the case where the targets are probabilities with values in (0,1).
Do this by considering an extended data set in which each pattern tkn is
replaced by a set of M patterns of which a fraction M tkn is set to 1 and
the remainder is set to 0, and then applying E to this extended TS.
Cross-entropy for multiple classes
mutually exclusive classes
One output yk
for each class
1-of-c coding
x n  Cl , t kn   kl
The probability of observing the set of target
values tkn = kl, given an input vector xn, is just: p Cl x n   yl
The {yk} are not independent as a
result of the constraint Sk yk = 1
discrete
Kullback-Leibler
distance
The absolute minimum w.r.t.{ykn} occurs when ykn = tkn k, n
Cross-entropy for multiple classes
If the output values are to be interpreted as probabilities,
they must lie in the range (0,1) and sum to unity.
normalized exponential
softmax
generalization
of the logistic
sigmoid
Cross-entropy for multiple classes
As with the logistic sigmoid, we can give a general motivation for the
softmax by considering the posterior probability that a hidden unit
activation z belongs to class Ck .
w k  θk
wk 0  Aθ k   ln PCk 
The outputs can be interpreted as probabilities
of class membership, conditioned on the
outputs of the hidden units.
Cross-entropy for multiple classes
BP training
over the inputs to all output units



Natural pairing
sum-of-squares + linear output units
2-class cross-entropy + logistic output unit
c-class cross-entropy + softmax output units
Consider the cross-entropy error function for multiple classes,
together with a network whose outputs are given by a softmax
activation function, in the limit of an infinite data set. Show that the
network output functions yk(x) which minimize the error are given
by the conditional averages of the target data t k x
hint
Since the outputs are not independent, consider the
functional derivative w.r.t. ak(x) instead.