M l - Fizyka UMK

Competent Undemocratic
Committees
Włodzisław Duch, Łukasz Itert and Karol Grudziński
Department of Informatics,
Nicholas Copernicus University, Torun, Poland.
http://www.phys.uni.torun.pl/kmk
Motivation
Combining information from different models is known as:
ensemble learning, mixture of experts, voting classification
algorithms, or committees of models:
Important and popular subject in machine learning, with
conferences and special issues of journals.
Useful for solving real problems, such as predicting the
glucose levels of diabetic patients – many classifiers are
available for this problem; 10 heads are wiser than one?
Committees:
1) improve the accuracy of a single model
2) decrease the variance, stabilizing the results.
Variability
Committees need different models.
Variability of committee models comes from:
1) Different samples taken from the same data
Crossvalidation training, boosting, bagging, arcing …
Bagging: train on bootstrap samples, randomly draw a fixed number of
training data vectors from the pool containing all training vectors.
AdaBoost (Adaptive Boosting): assign weights to training instances,
higher for incorrectly classified.
Arcing: simplified weighting of the training vectors.
2) Bias of models, due to the change of their complexity.
The number of neurons, training parameters, pruning ...
Voting
Let P(Ci|X;Ml), be the posterior probability estimation for
l=1..m models for i=1..K classes.
How to determine the committee decision?
1. Majority voting – go with the crowd.
2. Average results of all models.
3. Select a model that gives the largest probability
4.
5.
(highest confidence).
Set a threshold to select models with highest
confidence and use majority voting for these models.
Make linear combination of results.
More on voting
Each model does not need to be accurate for all data,
but should account well for a different subset of data.
• Krogh and Vedelsby: generalization error is small if highly
accurate classifiers disagreeing with each other are used.
• Xin Yao: diversify models, create negative correlation
between individual models and average results (no GA).
• Jacobs: mixture of experts, neural architecture with a
gating network to select the most competent model.
• Ortega et al: a ``referee meta-model'' deciding which
model should contribute to the final decision.
Competent Models: idea
Democratic voting: all models always try to contribute.
Undemocratic voting: only experts on local issues should vote.
For each model identify feature space regions where it is
incompetent.
Use penalty factor to decrease the influence of incompetent model
during voting.
Biological inspiration:
• Only a small subset of cortical modules are used by the brain for a
given task.
• Incompetent modules are inhibited by rough evaluation of inputs at
the thalamic level.
• Conclusion: weights should be input dependent!
Competent Models: idea
Linear meta-model gives m additional parameters:
m
p(Ci | X; M )   Wil P(Ci | X; M l )
l 1
Models that have small weights may still be useful in some areas of
the feature space. Use input-dependent weights:
m
p(Ci | X; M )  Wil  X  P(Ci | X; M l )
l 1
to inhibit voting of the model Ml for class Ci around specific X.
Similarity Based Models use reference vectors; determine the areas
of the input space where a given model is competent (makes a few
errors) and where it fails.
Committees of Competent Models
1. Optimize parameters for all models Ml , l = 1...m on the training set
using a cross-validation procedure.
2. For each model l = 1...m:
a) for all training vectors Ri generate predicted classes Cl (Ri);
b) if Cl (Ri)C(Ri), i.e. model Ml makes an error for vector Ri ,
determine the area of incompetence of the model, finding the
distance di,j to the nearest vector that Ml has correctly
classified;
c) set parameters of the incompetence factor F(||XRi||;Ml) in
such a way that its value decreases significantly for
||XRi||di,j/2
3. The incompetence function for the model F(X;Ml) is a product of
factors F(||XRi||;Ml) for all training vectors that have been
incorrectly handled
CCM
1 in all areas of correct classification
F ( X; M l )  
0 in all areas of incorrect classification
F(||XRi||;Ml) examples:
1. Gaussian function F(||XRi||;Ml)=1 G(||XRi||a;i),
where a coefficient is used to flatten the function.
2. F(||XRi||;Ml)=1/(1+||XRi||-a), similar to Gaussian
3. Sum of two logistic functions
(||XRi||  di,j/2) + (||XRi||  di,j/2)
Vectors that cannot be correctly classified show up as errors
that all model make, but some vectors that are erroneously
classified by one model may be correctly handled by another.
CCM voting
Use confidence factors to modify the linear weights.
m
p (Ci | X; M )   Wil  X  P  Ci | X; M l 
l 1
m
m
  Wil F  X; M l  P  Ci | X; M l 
l 1 m
Confidence factors are products over local regions.

F  X; M l    F X  Rk  M l 
k

Numerical experiment: data
Dataset: Telugu vowel data,
871 vectors, 3 features (dominant formants),
6 classes (vowels)
[1] Pal, S.K. and Mitra S. (1999)
Neuro-Fuzzy Pattern Recognition. J. Wiley, New York
Models included in the committee:
1. k=10, Euclidean (M1)
2. k=13, Manhattan (M2)
3. k=5, Euclidean (M3)
4. k=5, Manhattan (M4)
Accuracy of models
Accuracy of all models for each class, in %:
Class
M1
M2
M3
M4
C1
C2
C3
C4
C5
C6
50.0
88.8
84.3
85.4
91.3
90.6
45.8
91.0
84.3
84.8
88.4
92.8
65.3
87.6
84.9
90.1
90.3
90.1
62.5
89.9
84.7
88.1
90.1
90.4
Average
85.1
84.6
86.1
86.0
Comparison of result
Dataset: Telugu vowel (871 vectors, 3 features)
System
Accuracy
Remarks
CUC committee
kNN
88.2%0.6%
86.1%0.6%
MLP
Fuzzy MLP
Bayes Classifier
Fuzzy Kohonen
84.6%
84.2%
79.2%
73.5%
2xCV (our calculation)
k=3, Euclidean, 2xCV
(our calculation)
2xCV, 10 neurons [1]
2xCV, 10 neurons [1]
2xCV, [1]
2xCV, [1]
Comparison of committees results
Results for Telugu vowel data:
Class
Majority Confidence
Combination + Competence
C1
C2
C3
C4
C5
C6
54.2
88.8
84.3
86.8
92.3
90.6
58.3
88.8
84.9
88.1
92.8
92.2
62.5
88.8
84.3
88.1
92.3
91.7
65.3
89.9
84.9
88.1
93.8
93.3
Average
85.9
87.0
87.0
88.2
Conclusions
• Assigning competence factors in various voting
procedures is an attractive idea.
• Learning becomes modular: each model specializes in
different subproblems.
Some ideas:
• Combine DT, kNN, NN models.
• Use CCM with adaptive boosting.
• Use ROC curves to increase the AUC area by providing a
convex combination of individual ROC curves.
• Diversify models by adding explicit negative correlation.
• Use constructive approach: add new models to
committee that classify correctly remaining vectors.