Learning Bayesian Networks through evolution

Rotem Golan
Department of Computer Science
Ben-Gurion University of the Negev, Israel
 Competition overview
 What is a Bayesian Network?
 Learning Bayesian Networks through evolution
 ECOC and Recursive entropy-based discretizaion
 Decision trees and C4.5
 A new prediction model
 Boosting and K-fold cross validation
 References
Competition overview
 A database of 60 music performers has been prepared




for the competition.
The material is divided into six categories: classical
music, jazz, blues, pop, rock and heavy metal.
For each of the performers 15-20 music pieces have
been collected.
All music pieces are partitioned into 20 segments and
parameterized.
The feature vector consists of 191 parameters.
Competition overview (Cont.)
 Our goal is to estimate the music genre of newly given
fragments of music tracks.
 Input:
 A training set of 12,495 vectors and their genre
 A test set of 10,269 vectors without their genre
 Output: 10,269 labels (Classical, Jazz, Rock, Blues,
Metal or Pop). One for each vector in the test set.
 The metric used for evaluating the solutions is
standard accuracy, i.e. the ratio of the correctly
classified samples to the total number of samples.
 Competition overview
 What is a Bayesian Network?
 Learning Bayesian Networks through evolution
 ECOC and Recursive entropy-based discretizaion
 Decision trees and C4.5
 A new prediction model
 Boosting and K-fold cross validation
 References
A long story
 You have a new burglar alarm installed at home. It is fairly




reliable at detecting a burglary, but also responds on
occasion to minor earthquakes.
You also have two neighbors, John and Mary, who have
promised to call you at work when they hear the alarm.
John always calls when he hears the alarm, but sometimes
confuses the telephone ringing with the alarm and calls
then, too.
Mary, on the other hand, likes rather loud music and
sometimes misses the alarm altogether.
Given the evidence of who has or has not called, we would
like to estimate the probability of a burglary.
A short representation
Observations
 In our algorithm, all the values of the network are known





except the genre value, which we would like to estimate.
The variables in our algorithm are continuous and not
Discrete (except the genre variable).
We divide the possible values of each variables into fixed
size intervals.
The number of intervals is changed throughout the
evolution.
We refer to this process as the discretization of the variable.
We refer to the Conditional Probability Table of each
variable (node) as CPT
Naïve Bayesian Network
Bayesian Network construction
 Once we determined the chosen variables (amount
and choice), their fixed discretization and the
structure of the graph, we can easily compute the CPT
values for each of the nodes in the graph (according to
the training set).
 For each vector in the training set, we will update all
the network’s CPTs by increasing the appropriate entry
by one.
 After this process, we will divide each value with the
sum of its row (Normalization).
Exact Inference in Bayesian
Networks
 For each vector in the test set, we compute six different
probabilities (Multiplying the appropriate entries of all
the network’s CPTs) and chose the highest one as the
genre of this vector.
 Each probability is for a different assumption on the
genre variable value (Rock, Pop, Blues, Jazz, Classical
and Metal).
 Competition overview
 What is a Bayesian Network?
 Learning Bayesian Networks through evolution
 ECOC and Recursive entropy-based discretizaion
 Decision trees and C4.5
 A new prediction model
 Boosting and K-fold cross validation
 References
Preprocessing
 I divided the training set into two sets.
 A training set – used for constructing each Bayesian
Network in the population.
 A validation set – used for computing the fitness of each
network in the population.
 These sets has the same amount of vectors for each
category (Rock vectors, Pop vectors, etc.)
The three dimensions of the
evolutionary algorithm
 The three dimensions are:
 Variables amount.
 Variables choice.
 Fixed discretization of the variables.
 Every network in the population is a Naïve Bayesian
Network, which means that its structure is already
determined.
Fitness function
 In order to compute the fitness of a network, we
estimate the genre of each vector in the validation set,
and compare it to it’s known genre.
 The metric used for computing the fitness is standard
accuracy, i.e. the ratio of the correctly classified vectors
to the total number of vectors in the validation set.
Selection
 In each generation, we choose population_size/2




different networks at most.
We prefer networks that have the highest fitness and
are distinct from each other.
After choosing these networks we use them to build a
fully sized population by mutating each one of them.
We use bitwise mutation to do so.
Notice that we may use a mutated network to generate
a new mutated network.
Mutation
 Bitwise mutation.
 Parent:
 BitSet
 Dis
1
1
0
0
0
1
4
9
0
0
0
18
 Child:
 BitSet
 Dis
1
0
1
0
0
1
10
0
15
0
0
18
Crossover
 Single point crossover.
 Parent 1:
 Parent 2:
 Child 1:
 Child2:
1
1
0
0
0
1
4
9
0
0
0
18
1
1
1
1
0
1
5
5
10
10
0
15
1
1
1
1
0
1
4
9
10
10
0
15
1
1
0
0
0
1
5
5
0
0
0
18
Results (Cont.)
 Model - Naive Bayesian
 Population size - 40
 Generations - 400
 Variables - [1,191]
0.85
 discretization - [5,15]
0.84
0.83
 Best score - 0.8415
0.82
 Test Set score - 0.7323
 Website’s score:

Preliminary result - 0.7317

Final result - 0.73024
 “Zeroes” = cpt_min/10
Fitness
 First population score - 0.7878
0.81
0.8
0.79
0.78
0.77
0
50
100
150
200
Generation
250
300
350
400
Observation
 Notice that there’s approximately 10% difference
between my score and the website’s score.
 We will discuss this issue (over fitting) later on.
Adding the forth dimension
 The forth dimension is the structure of the Bayesian
Network
 Now, the population includes different Bayesian
Networks. Meaning, networks with different
structures, variables choice, variables amount and
Discretization array.
Evolution operations
 The selection process is the same as in the previous
algorithm.
 The crossover and mutation are similar.
 First, we start like the previous algorithm (Handling the
BitSet and the discretization array)
 Then, we add all the edges we can from the parent (mutation)
or parents (crossover) to the child’s graph.
 Finally, we make sure that the child’s graph is a connected
acyclic graph.
Results







Model - Bayesian Network
Population size – 20
Generations – Crashed on generation 104
Variables - [1,191]
discretization - [2,6]
First population score - 0.4920
Best score - ~0.8559
Website’s score :
 It Crashed
0.9
0.85
0.8
0.75
Fitness

0.7
0.65
0.6
0.55
0.5
0
20
40
60
Generation
80
100
Memory problems
 The program was executed on amdsrv3, with a 4.5 GB
memory limit.
 Even though the discretization interval is [2-6], the
program has crashed due to java heap space error.
 As a result I decided to decrease the population size to
10 instead of 20.
Results (Cont.)
 Model - Bayesian Network
 Population size – 10
 Generations – 800
 Variables - [1,191]
 discretization - [2,10]
0.9
 First population score - 0.5463
0.85
0.8
 Best score - 0.8686
 Preliminary score - 0.7085
0.75
Fitness
 Website’s score:
0.7
0.65
0.6
0.55
0.5
0
100
200
300
400
Generation
500
600
700
800
Results (Cont.)
 Model - Bayesian Network
 Population size – 10
 Generations – 800
 Variables - [1,191]
 discretization - [2,20]
0.9
 First population score - 0.5978
0.85
 Best score - 0.8708
 Preliminary score - 0.6972
0.75
Fitness
 Website’s score:
0.8
0.7
0.65
0.6
0.55
0.5
0
100
200
300
400
Generation
500
600
700
800
Overfitting
 As we increase the discretization interval, my score
increases and the website’s score decreases.
 One explanation can be that increasing the search
space may cause the algorithm to find patterns with
strong correlation to the specific input data I received.
While these patterns has no correlation at all to the
real life data.
 One possible solution is using k-fold cross validation
Final competition scores
My previous score
 Competition overview
 What is a Bayesian Network?
 Learning Bayesian Networks through evolution
 ECOC and Recursive entropy-based discretizaion
 Decision trees and C4.5
 A new prediction model
 Boosting and K-fold cross validation
 References
ECOC
 ECOC stands for Error Correcting Output Codes
 ECOC is a technique for using binary classification
algorithms to solve multi-class problems.
 Each class is assigned a unique code word – a binary
string of length 𝑛.
 A set of n binary classifiers is then trained, one for
each bit position in the code word.
 New instances are classified by evaluating all binary
string classifiers to generate a new n-bit string, which
is then compared to all the code words using
Humming distance.
ECOC properties
 The ECOC codes are generated so that their pairwise Hamming
distances are maximized.
 In general, a code with minimum pairwise Hamming distance 𝑑
𝑑−1
is able to correct up to
individual bit (classifier) errors.
2
 In our case:
 𝐾=6
 2𝑘−1 − 1 = 31
 𝑑 = 16

𝑑−1
2
=7
ECOC (Cont.)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Rock
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Pop
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Blues
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
Jazz
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
Classical
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
Metal
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
Entropy
Recursive minimal entropy
partitioning (Fayyad & Irani - 1993)
 The goal of this algorithm is to discretizes all numeric
attributes in the dataset into nominal attributes.
 The discretization is performed by selecting a bin
boundary minimizing the entropy in the induced
partitions.
 The method is then applied recursively for both new
partitions until a stopping criterion is reached.
 Fayyad and Irani make use of the Minimal-Description
length principle to determined a stopping criteria.
RMEP (Cont.)
 Given a set of instances S, a feature A, and a partition
boundary T, the class information entropy of the
partition induced by T is given by:
 𝐸 𝐴, 𝑇; 𝑆 =
 𝐸𝑛𝑡 𝑆 = −
𝑆1
𝐸𝑛𝑡 𝑆1
𝑆
𝐶
𝑗=1 𝑝 𝑆, 𝑗
+
𝑆2
𝑆
𝐸𝑛𝑡 𝑆2
∗ log 2 (𝑝(𝑆, 𝑗))
 For a given feature A, the boundary 𝑇𝑚𝑖𝑛 which
minimizes the entropy function over all possible
partition boundaries is selected as a binary
discretization boundary.
RMEP (Cont.)
 The stopping criteria is:
 𝐺𝑎𝑖𝑛 𝐴, 𝑇; 𝑆 <
log2 (𝑁−1)
Δ(𝐴,𝑇;𝑆)
+
𝑁
𝑁
 𝐺𝑎𝑖𝑛 𝐴, 𝑇; 𝑆 = 𝐸𝑛𝑡 𝑆 − 𝐸 𝐴, 𝑇; 𝑆
 Δ 𝐴, 𝑇; 𝑆 = log 2 3𝑘
Results
 These improvements resulted in a standard accuracy
score of 75.17% for the test set.
 This score is for a population of 40 networks and 40
generations
75.5
Standard accuracy (%)
75
74.5
74
73.5
73
72.5
72
71.5
71
70.5
0
20
40
60
80
Number of generations
100
120
140
 Competition overview
 What is a Bayesian Network?
 Learning Bayesian Networks through evolution
 ECOC and Recursive entropy-based discretizaion
 Decision trees and C4.5
 A new prediction model
 Boosting and K-fold cross validation
 References
Example of a decision tree
C4.5 algorithm
 Check for base cases which are:
 All instances are from the same class
 gain ratio < 𝑠𝑜𝑚𝑒 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑
 For each attribute 𝑎
 Find the maximal 𝑔𝑎𝑖𝑛 𝑟𝑎𝑡𝑖𝑜 from splitting on 𝑎 using
all possible thresholds
 Let 𝑎_𝑏𝑒𝑠𝑡 be the attribute with the highest 𝑔𝑎𝑖𝑛 𝑟𝑎𝑡𝑖𝑜
 Create a decision 𝑛𝑜𝑑𝑒 that splits on 𝑎_𝑏𝑒𝑠𝑡
 Recurse on the sublists obtained by splitting
on 𝑎_𝑏𝑒𝑠𝑡, and add those nodes as children of 𝑛𝑜𝑑𝑒
Splitting criteria
 𝐺𝑎𝑖𝑛 𝐷, 𝑇 = 𝐼𝑛𝑓𝑜 𝐷 −
 𝐼𝑛𝑓𝑜 𝐷 = −
𝐶
𝑗=1 𝑝
 𝑆𝑝𝑙𝑖𝑡 𝐷, 𝑇 = −
 𝑔𝑎𝑖𝑛 𝑟𝑎𝑡𝑖𝑜 =
𝐷𝑖
𝑘
𝑖=1 𝐷
∗ 𝐼𝑛𝑓𝑜(𝐷𝑖 )
𝐷, 𝑗 ∗ log 2 (𝑝(𝐷, 𝑗))
𝐷𝑖
𝑘
𝑖=1 𝐷
𝐷𝑖
∗ log 2 (
𝐷
)
𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑔𝑎𝑖𝑛
𝑠𝑝𝑙𝑖𝑡 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛
 Notice that 𝑠𝑝𝑙𝑖𝑡 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 tends to increase with
the number of outcomes of a split.
 As a result, splitting on a variable with maximal bin
boundaries will get a low 𝑔𝑎𝑖𝑛 𝑟𝑎𝑡𝑖𝑜
Results of c4.5 alone
 As noted above, 𝐶4.5 uses binary discretization.
 Using 𝑐4.5 alone (after ECOC) yields a standard
accuracy score of 74.15% on the test set.
 Competition overview
 What is a Bayesian Network?
 Learning Bayesian Networks through evolution
 ECOC and Recursive entropy-based discretizaion
 Decision trees and C4.5
 A new prediction model
 Boosting and K-fold cross validation
 References
A new prediction model
 For each instance 𝑣 in the test set
 Let 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 be a 31-bit string
 For 𝑖 = 0 to 30


If 𝑡𝑟𝑎𝑖𝑛𝑒𝑑_𝑏𝑎𝑦𝑒𝑠_𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 > 0.93
 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛[𝑖] = 𝑡𝑟𝑎𝑖𝑛𝑒𝑑_𝑏𝑎𝑦𝑒𝑠_𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛
Else
 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛[𝑖] = 𝐶4.5_𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛
 Let 𝑙𝑎𝑏𝑒𝑙 be the closest genre code word to 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛
according to their Humming distance
 Return 𝑙𝑎𝑏𝑒𝑙
Results
 The new prediction model yields a standard accuracy
score of 76.85% for the test set which puts me in the
39’𝑡ℎ place (estimation) out of 144 active teams.
 Registered teams: 293 (with 358 members)
 Active teams: 144
 Total number of submitted solutions: 8910
 This score is for a population of 40 networks and 60
generations
Results (Cont.)
77
76.8
Standard accuracy (%)
76.6
76.4
76.2
76
75.8
75.6
75.4
75.2
75
0
20
40
60
80
100
120
Number of generations
140
160
180
200
My new score
 Competition overview
 What is a Bayesian Network?
 Learning Bayesian Networks through evolution
 ECOC and Recursive entropy-based discretizaion
 Decision trees and C4.5
 A new prediction model
 Boosting and K-fold cross validation
 References
Boosting (AdaBoost)
 I’ve tried to use boosting as a tool for building an
ensemble of Naïve Bayesian networks.
 Each of these networks were trained on different
training set weights according to the AdaBoost
algorithm.
 Intuitively, AdaBoost updated the training set weights
in correlation with the performance of previous
trained networks. The algorithm reduces the weights
of instances which were correctly predicated by
previous networks and increases the weights for
instances which hadn’t been predicted correctly.
AdaBoost - training
 Choose 𝑇
 Initialization:
1
 𝐷1 (𝑖) = (𝑖 = 1. . 𝑚)
𝑚
 For 𝑡 = 1 to 𝑡 = 𝑇
 find the classifier ℎ𝑡 that minimizes 𝜀𝑡 (using 𝐷𝑡 )
 Compute 𝛼𝑡
 Compute 𝐷𝑡+1 (using ℎ𝑡 and 𝛼𝑡 )
 If (𝜀𝑡 ≥ 0.5)

break
 𝑒𝑙𝑠𝑒


𝑡++
Add ℎ𝑡 to the ensemble classifier with 𝛼𝑡 as its factor
AdaBoost - testing
 We use 𝐻 𝑥 for deciding the genre of the test
instances.
 𝐻 𝑥 = 𝑠𝑖𝑔𝑛( 𝑇
𝑡=1 𝛼𝑡 ℎ𝑡 𝑥 )
 I’ve also tried to combine all the different classifiers in
a sequential order with the 𝐶4.5 tree as the final
classifier (just like the new prediction model described
before but with T networks and one decision tree) but
the result wasn’t as good as I thought it will.
AdaBoost - parameters
 𝑡 = 1. . 𝑇
 𝜀𝑡 =
 𝛼𝑡 =
𝑚
𝑖=1 𝐷𝑡 𝑖 𝐼(𝑦𝑖
1−𝜀𝑡
0.5 ln
𝜀𝑡
 𝐷𝑡+1 (𝑖) =
 𝑍𝑡 =
≠ ℎ𝑡 𝑥𝑖 ) - prerequisite: 𝜀𝑡 < 0.5
𝐷𝑡 𝑖 𝑒 −𝛼𝑡𝑦𝑖 ℎ𝑡 𝑥𝑖
𝑍𝑡
𝑚
𝑖=1 𝐷𝑡
𝑖 𝑒 −𝛼𝑡𝑦𝑖ℎ𝑡
𝑥𝑖
- the normalization factor
AdaBoost
 Note that the equation to update the distribution 𝐷𝑡 is
constructed so that:
 −𝛼𝑡 𝑦𝑖 ℎ𝑡 𝑥𝑖
< 0,
=
> 0,
𝑦𝑖 = ℎ𝑡 𝑥𝑖
𝑦𝑖 ≠ ℎ𝑡 𝑥𝑖
 This means that AdaBoost is adaptive in the sense that
subsequent classifiers built are tweaked in favor of
those instances misclassified by previous classifiers
K-fold Cross validation
 Cross-validation is a technique for assessing how the
results of a statistical analysis will generalize to an
independent data set
 In k-fold cross-validation, the original sample is
randomly partitioned into k subsamples.
 Of the k subsamples, a single subsample is retained as
the validation data for testing the model, and the
remaining k − 1 subsamples are used as training data.
 The cross-validation process is then repeated k times
(the folds), with each of the k subsamples used exactly
once as the validation data.
K-fold Cross validation (Cont.)
 I expected that using cross-validation will reduce the
over-fitting I have and, as a result, yield a more
generalized prediction model with a higher score on
the test set.
 Unfortunately, running cross-validation with 𝐾 = 5,10
on the genre’s data set resulted in a score smaller then
the best one I got so far.
 Competition overview
 What is a Bayesian Network?
 Learning Bayesian Networks through evolution
 ECOC and Recursive entropy-based discretizaion
 Decision trees and C4.5
 A new prediction model
 Boosting and K-fold cross validation
 References
References
 Artificial Intelligence – A Modern Approach, Stuart Russell and






Peter Norvig (Second edition)
Contest website: http://tunedit.org/challenge/musicretrieval/genres
Improved use of Continuous Attributes in 𝐶4.5 (1996) - J.R.
Quinlan
Multi-Interval Discretization of Continuous-Valued attributes
for Classification Learning (1993) – M. Fayyad & B. Irani
Solving Multiclass Learning Problems via Error Correcting
Output Codes (1995)– G. Dietterich & G. Bakiri
Supervised and Unsupervised Discretization of Continuous
Features (1995) – J. Dougherty, R. Kohavi & M. Shamai
Boosting and Naive Bayesian Learning (1997) – C. Elkan