Comparing Machine Learning Algorithms with the Wilcoxon Signed

Case Study 2: Comparing Machine Learning
Algorithms with the Wilcoxon Signed Rank Test
Noel Lopes
Polytechnic Institute of Guarda
University of Coimbra
Workshop on Statistical Hypothesis Tests
for Engineering Applications
July 18, 2011
Workshop on Statistical Hypothesis Tests for Engineering
Comparing Algorithms with the Wilcoxon Signed Rank Test
Outline
• Problem definition
• Preliminary data analysis
• Selection of the test
• Applying the Wilcoxon test
• Results and Conclusions
Workshop on Statistical Hypothesis Tests for Engineering
Comparing Algorithms with the Wilcoxon Signed Rank Test
Problem definition
• We developed a new learning algorithm for classification.
• Compare the new algorithm with other established and
successful algorithms.
Workshop on Statistical Hypothesis Tests for Engineering
Comparing Algorithms with the Wilcoxon Signed Rank Test
Benchmark selection
• Avoid bias
• Broad spectrum coverage
• Constraints
• Presence of missing values
• Number of inputs (features)
• Number of samples (instances)
• Facilitate comparison with other algorithms
• How well is the benchmark known in the literature?
Workshop on Statistical Hypothesis Tests for Engineering
Comparing Algorithms with the Wilcoxon Signed Rank Test
Benchmarks
Database
Breast cancer
Ecoli
German credit data
Glass identification
Haberman’s survival
Heart - Statlog
Ionosphere
Iris
Pima diabetes
Sonar
Tic-Tac-Toe
Vehicle
Wine
Yeast
Samples
569
336
1000
214
306
270
351
150
768
208
958
946
178
1484
Workshop on Statistical Hypothesis Tests for Engineering
Inputs
30
7
59
9
3
20
34
4
8
60
9
18
13
8
Classes
2
8
2
6
2
2
2
3
2
2
2
4
3
10
Comparing Algorithms with the Wilcoxon Signed Rank Test
Experiments conducted
Partition 1
Each experiment was run 30
times using different random
5-fold stratified cross validation
Partition 2
Partition 3
Partition 4
Partition 5
Workshop on Statistical Hypothesis Tests for Engineering
Comparing Algorithms with the Wilcoxon Signed Rank Test
Metrics - Confusion Matrix
Real class
Negative
Positive
Predicted
Positive
Negative
True
Positive
(tp)
False
Negative
(f n)
False
Positive
(f p)
True
Negative
(tn)
Workshop on Statistical Hypothesis Tests for Engineering
Comparing Algorithms with the Wilcoxon Signed Rank Test
Metrics
precision =
recall =
F -measure =
tp
tp + f p
tp
tp + f n
2 × precision × recall
precision + recall
Workshop on Statistical Hypothesis Tests for Engineering
Comparing Algorithms with the Wilcoxon Signed Rank Test
Preliminary data analysis
Classification performance (F-measure (%) macro-average) for
the test datasets of the UCI benchmark experiments
Database
Breast cancer
Ecoli
German credit data
Glass identification
Haberman’s survival
Heart - Statlog
Ionosphere
Iris
Pima diabetes
Sonar
Tic-Tac-Toe
Vehicle
Wine
Yeast
1-NN
95.15 ± 0.41
66.04 ± 0.82
64.38 ± 0.96
68.77 ± 1.63
55.53 ± 2.04
75.30 ± 1.60
85.90 ± 0.69
95.70 ± 0.69
66.95 ± 1.06
85.60 ± 1.76
49.47 ± 0.47
69.35 ± 0.76
95.90 ± 0.51
56.32 ± 1.04
Workshop on Statistical Hypothesis Tests for Engineering
Hypersphere’s
g=1
g=2
96.07 ± 0.30 96.45 ± 0.36
67.51 ± 0.72 68.03 ± 0.78
63.98 ± 0.95 63.55 ± 0.95
70.30 ± 2.20 69.81 ± 2.23
55.26 ± 2.35 56.36 ± 1.92
75.92 ± 1.28 76.19 ± 1.27
90.98 ± 0.54 92.55 ± 0.47
95.71 ± 0.61 96.04 ± 0.64
68.41 ± 1.00 70.09 ± 0.97
85.63 ± 1.79 87.03 ± 1.50
73.43 ± 0.54 81.21 ± 0.83
69.46 ± 0.71 68.78 ± 0.93
96.80 ± 0.44 96.93 ± 0.64
57.73 ± 1.12 58.75 ± 0.86
Comparing Algorithms with the Wilcoxon Signed Rank Test
Preliminary data analysis
Classification performance (F-measure (%) macro-average) for
the test datasets of the UCI benchmark experiments
Database
Breast cancer
Ecoli
German credit data
Glass identification
Haberman’s survival
Heart - Statlog
Ionosphere
Iris
Pima diabetes
Sonar
Tic-Tac-Toe
Vehicle
Wine
Yeast
1-NN
95.15 ± 0.41
66.04 ± 0.82
64.38 ± 0.96
68.77 ± 1.63
55.53 ± 2.04
75.30 ± 1.60
85.90 ± 0.69
95.70 ± 0.69
66.95 ± 1.06
85.60 ± 1.76
49.47 ± 0.47
69.35 ± 0.76
95.90 ± 0.51
56.32 ± 1.04
Workshop on Statistical Hypothesis Tests for Engineering
Hypersphere’s
g=1
g=2
96.07 ± 0.30 96.45 ± 0.36
67.51 ± 0.72 68.03 ± 0.78
63.98 ± 0.95 63.55 ± 0.95
70.30 ± 2.20 69.81 ± 2.23
55.26 ± 2.35 56.36 ± 1.92
75.92 ± 1.28 76.19 ± 1.27
90.98 ± 0.54 92.55 ± 0.47
95.71 ± 0.61 96.04 ± 0.64
68.41 ± 1.00 70.09 ± 0.97
85.63 ± 1.79 87.03 ± 1.50
73.43 ± 0.54 81.21 ± 0.83
69.46 ± 0.71 68.78 ± 0.93
96.80 ± 0.44 96.93 ± 0.64
57.73 ± 1.12 58.75 ± 0.86
Comparing Algorithms with the Wilcoxon Signed Rank Test
Preliminary data analysis
Classification performance (F-measure (%) macro-average)
DB
BC
EC
GC
GL
HA
HE
IO
IR
PD
SO
TT
VE
WI
YE
F-Measure (test)
IB3
Hypersphere’s
93.47 ± 1.02
93.64 ± 0.97
63.80 ± 3.41
65.30 ± 1.88
55.91 ± 2.20
56.33 ± 2.05
35.63 ± 2.19
51.43 ± 3.04
44.85 ± 2.96
54.10 ± 3.84
79.53 ± 1.58
76.68 ± 2.39
75.21 ± 4.48
81.04 ± 2.77
93.87 ± 1.68
93.60 ± 1.78
66.60 ± 2.33
64.68 ± 1.81
48.26 ± 7.37
60.62 ± 4.83
61.56 ± 3.80
61.99 ± 1.57
62.26 ± 1.13
60.90 ± 1.74
94.03 ± 1.28
93.22 ± 1.43
37.52 ± 3.68
47.21 ± 1.25
Workshop on Statistical Hypothesis Tests for Engineering
F-Measure (overall)
IB3
Hypersphere’s
94.35 ± 0.66
94.66 ± 0.70
60.96 ± 3.28
83.06 ± 1.58
60.17 ± 1.78
61.49 ± 0.84
41.50 ± 3.19
62.05 ± 1.68
46.21 ± 3.62
57.33 ± 2.14
80.72 ± 0.82
79.60 ± 1.62
77.30 ± 3.81
82.87 ± 2.59
94.55 ± 1.10
95.92 ± 1.20
69.22 ± 1.78
68.01 ± 1.10
49.71 ± 7.26
62.05 ± 3.39
63.81 ± 4.00
66.67 ± 0.85
68.15 ± 0.94
68.20 ± 1.05
94.93 ± 0.88
95.59 ± 0.84
45.00 ± 4.51
61.70 ± 0.90
Comparing Algorithms with the Wilcoxon Signed Rank Test
Selection of the test
• Paired sample (dependent samples)
• Each data point in one sample is matched to a unique data
point in the other sample
• Parametric
• T-Student
• Non-parametric (do not rely on assumptions that data
belongs to any particular distribution)
• Sign Test
• Wilcoxon signed rank test
Workshop on Statistical Hypothesis Tests for Engineering
Comparing Algorithms with the Wilcoxon Signed Rank Test
Wilcoxon signed rank test
• Uses magnitude and sign of the paired difference ranks
• Steps (assuming pairs of observations (Xi , Yi )):
• 1. Calculate Di = (Yi − Xi )
• 2. Exclude the observations for which Di = 0
• 3. Sort the values of |Di | in ascending order and assign a
rank R0 i according to its position. The observation with the
lowest |Di | will have a value of 1, the next one a value of 2
and so on. The mean rank is assigned to tied |Di | scores.
Di
• 4. Calculate the rank Ri = |D
R0 i
i|
Workshop on Statistical Hypothesis Tests for Engineering
Comparing Algorithms with the Wilcoxon Signed Rank Test
Wilcoxon signed rank test
• If ties exist or the number of observations is "big" enough
(10 to 25 depending on the author) then
◦
P
• T = √PRi 2 ∩ N (0, 1)
Ri
• Otherwise
• Sum the ranks separately for the positive and negative
differences
• The smaller of the two sums is the Wilcoxon T statistic for a
two-sided test
• For a one-sided test the smaller sum must be associated
with the directionality of the null hypothesis
• The critical values of T according to sample size and level
of significance are defined by a table.
• The null hypothesis is rejected if the T statistic is smaller
than the critical value given in the table.
Workshop on Statistical Hypothesis Tests for Engineering
Comparing Algorithms with the Wilcoxon Signed Rank Test
Applying the Wilcoxon test
Workshop on Statistical Hypothesis Tests for Engineering
Comparing Algorithms with the Wilcoxon Signed Rank Test
Applying the Wilcoxon test
• We reject the null hypotheses of 1-NN yielding better or
equal results than our algorithm at a significance level of
0.5% (0.005).
• We reject the null hypotheses of IB3 yielding better or
equal results than our algorithm, for the overall data, at a
significance level of 0.5% (0.005).
Workshop on Statistical Hypothesis Tests for Engineering
Comparing Algorithms with the Wilcoxon Signed Rank Test
Conclusions
• Statistical benchmarking is becoming increasingly
important
• The quality of the samples is a fundamental aspect to
consider
• Minimize bias
• Size
• Selecting and properly applying the appropriate test is
crucial
Workshop on Statistical Hypothesis Tests for Engineering
Comparing Algorithms with the Wilcoxon Signed Rank Test
References
• Noel Lopes and Bernardete Ribeiro, An Incremental Class
Boundary Preserving Hypersphere Classifier, International
Conference on Neural Information Processing, 2011
• Elizabeth Reis, Paulo Melo, Rosa Andrade and Teresa
Calapez, Estatística Aplicada, Vol.2, Edições Sílabo, 1997
• Leonard J. Kazmier, Schaum’s Outline of Business
Statistics, Fourth Edition, McGraw-Hill, 2009
• A. Frank and A. Asuncion, UCI machine learning
repository, http://archive.ics.uci.edu/ml
Workshop on Statistical Hypothesis Tests for Engineering
Comparing Algorithms with the Wilcoxon Signed Rank Test
Thank you for your interest
• Noel Lopes ([email protected])
Workshop on Statistical Hypothesis Tests for Engineering
Comparing Algorithms with the Wilcoxon Signed Rank Test