Introduction to learning

Neuro-Computing
Lecture 5
Committee Machine
Suggested Reading
Neural Networks Ensembles
-Haykin: Chapter 7
- Kuncheva: Combining Pattern Classifiers Methods and Algorithms, Wiley, 2004.
Pattern Classification
Processing
Feature extraction
Classification
Fork
spoon
The power of Parliament
- Many `experts' arguing about the same problem
- The `experts' can locally disagree, but overall,
all of them try to solve the same global problem
- The collective answer is usually more stable to
fluctuations in intelligence/emotions/biases of
individual `experts‘
- There exist many types of formalizations of this
idea.
Combining neural classifiers
Averaging Results: Mean Error for Each Network
Suppose we have L trained experts with outputs yi(x) for a regression problem to
approximate h(x) each with an error of ei. Then we can write:
yi ( x)  h( x)  ei
Thus the sum of squares error for network yi is:


Ei    yi ( x)  h( x)   [ei ]
2
2
Where [.] denotes the expectation (average or mean value).
Thus the average error for the networks acting individually is:
E AV
 
1 L
1 L
2
  E i    ei
L i 1
L i 1
Averaging Results: Mean Error for Committee
Suppose instead we form a committee by averaging the outputs yi to get the committee
prediction:
1 L
yCOM ( x)   yi ( x)
L i 1
This estimate will have error:
ECOM
2
2
L
L



1

1
 
2
 ( yCOM ( x)  h( x))    yi ( x)  h( x)       ei  
 
 L i 1
 L i 1  
Thus, by Cauchy’s inequality:
E COM
 1 L  2  1 L
2
    ei      ei  E AV
 L i 1   L i 1
 
Indeed, if the errors are uncorrelated ECOM = EAV / L but this is unlikely in practice as
errors tend to be correlated
Methods for constructing Diverse Neural Classifiers
• Different learning machines
• Different representation of patterns
• Partitioning the training set
• Different labeling in learning
Methods for constructing Diverse Neural Classifiers
• Different learning machines
1.Using different learning algorithms.
For example we can use multi-layer perceptron, radial basis
function network and decision trees in an ensemble to make a
composite system (Kittler et al. 1997).
2.Using certain algorithms with different
complexity.
For example using networks with different number of nodes and
layers or nearest neighbour classifiers with different number of
prototypes.
3. Using different learning parameters.
For example in an MLP, different initial weights, number of
epochs, learning rate and momentum or cost function can affect the
generalization of the network (Windeatt and Ghaderi 1998).
Methods for constructing Diverse Neural Classifiers
• Different representation of patterns
1. Using different feature sets or rule sets.
This method is useful particularly when more than one source of information is
available (Kittler et al. 1997).
2. Using decimation techniques
Even when only one feature set exists, we can produce different feature sets (Turner
and Ghosh 1996) to be used in training the different classifiers by removing different
parts of this set.
3. Feature set partitioning in a multi-net system
This method can be useful if patterns include parts which are independent. For
example different parts of an identification form or a human face. In this case,
patterns can be divided into sub-patterns each one can be used to train a classifier.
One of theoretical properties of neural networks is the fact that they do not need
special feature extraction for classification. Feature sets can be the same real valued
measurements (like gray levels in image processing). The number of these
measurements can be high so if we apply all of them to a single network, the curse of
dimensionality will occur. To avoid this problem they can be divided into parts each
one used as the input of a sub-network (Rowley 1999, Rowley et al. 1998) which are
independent.
Methods for constructing Diverse Neural Classifiers
• Partitioning the training set
1. Random partitioning such as:
•Cross validation, (Friedman 1994) is a statistical method in which the training set is
divided into k subsets of approximately equal size. A network is trained k times. Each
time one of subsets is "left-out", so changing the "left-out" subset we will have k
different trained networks. If k equals the number of samples, it is called " leave-oneout"
•Bootstrapping, in which the mechanism is essentially the same as cross validation,
but subsets (patterns) are chosen randomly with replacement (Jain et al. 1987, Efron
1983). This method has been used in the well known technique of Bagging (Breiman
1999, Bauer and Kohavi 1999, Maclin and Optiz 1997, Quinlan 1996). It has yielded
significant improvement in many real problems.
2.Partitioning based on spatial similarities
Mixtures of Experts is based on the divide and conquer strategy (Jacobs 1995, Jacobs
et al. 1991), in which the training set is partitioned according to the spa¬tial similarity
rather than random partitioning in the former methods. During training, different
classifiers (experts) try to deal with the different parts of the input space. There is a
competitive gating mechanism that localizes the experts, using a gating network.
Methods for constructing Diverse Neural Classifiers
• Partitioning the training set
3 . Partitioning based on experts' ability
•Boosting is a general technique for constructing a classifier with good
performance and small error from some individual classifiers which have
performance just a little better than free guess (Drucker 1993).
a. Boosting by filtering. This approach involves filtering the training examples by
different versions of a weak learning algorithm. It assumes the availability of a
large (in theory, infinite) source of examples, with the examples being either
discarded or kept during training. An advantage of this approach is that it allows for
a small memory requirement compared to the other two approaches.
b. Boosting by subsampling. This second approach works with a training sample
of fixed size. The examples are "resampled according to a given probability
distribution during training. The error is calculated with respect to the fixed training
sample.
c. Boosting by reweighting. This third approach also works with a fixed training
sample, but it assumes that the weak learning algorithm can receive "weighted“
examples.The error is calculated with respect to the weighted examples.
Methods for constructing Diverse Neural Classifiers
• Different labeling in learning
If we redefine the classes for example by giving the same label to a group
of classes, the classifiers which are made by different defining of classes can
generalize diversely. Methods like Error Correction Output Codes, ECOC,
(Dietterich 1991, Dietterich and Bakiri 1991), binary clustering (Wilson 1996) and
pair wise coupling (HaKtie and Tibsliirani 1996) use this technique.
Combination strategies
Static combiners
Dynamic combiners
Combination strategies
Static combiners
Non-trainable
Voting, Sum, Product,
Max, Min,Averaging
Borda Counts
Trainable
Weighted Averaging
{DP-DT, Genetic,…}
Stacked Generalization
Dynamic combiners
Mixture of Experts
Stacked Generalization
- In stacked generalization, the output pattern of an ensemble of
trained experts serves as an input to a second-level expert
Feature
Expert10
Vector (x )
0
2
Expert
y2
Expertk0
Output
y1
Signal (T )
Expert1
yk
Wolpert (1992)
Oi   x j ij
Mixture of Experts
i  1,...,N
(1)
i  1,...,N
(2)
j
Og i   x j wij
j
gi 
exp( Og i )
N
 exp( Og j )
i  1,..., N
(3)
1
g i exp(  ( y  oi )T ( y  oi ))
2
hi 
1
g
exp(

( y  o j )T ( y  o j ))
 j
2
j 1
OT   Oi g i
i  1,..., N
(4)
 i  e hi ( y  oi ) X T
(6)
wi  g (hi  gi ) X T
(7)
(5)
i
Feature
Expert1
Vector (x )
Expert2
Expertk
o1
o2
Output
g1

g2
Signal (T )
ok
g3
Gating
Network
Jacobs et al. (1991)
Combining classifiers for face recognition
Zhao, Huang, Sun (2004) Pattern Recognition Letters
Combining classifiers for face recognition
Jang et al (2004) LNCS
Combining classifiers for view-independent face recognition
Qi   Cij (x)
(1)
Qi  min Cij ( x)
j
j
Q i   Cij (x)
(2)
j
Qi  median Cij (x) (5)
j
j
Qi  max Cij ( x)
(4)
(3)
Q i   w j Cij (x)
(6)
j
Kim, Kittler (2006) IEEE Trans. Circuits and Systems for Video Technology