Case Study 2: Comparing Machine Learning Algorithms with the Wilcoxon Signed Rank Test Noel Lopes Polytechnic Institute of Guarda University of Coimbra Workshop on Statistical Hypothesis Tests for Engineering Applications July 18, 2011 Workshop on Statistical Hypothesis Tests for Engineering Comparing Algorithms with the Wilcoxon Signed Rank Test Outline • Problem definition • Preliminary data analysis • Selection of the test • Applying the Wilcoxon test • Results and Conclusions Workshop on Statistical Hypothesis Tests for Engineering Comparing Algorithms with the Wilcoxon Signed Rank Test Problem definition • We developed a new learning algorithm for classification. • Compare the new algorithm with other established and successful algorithms. Workshop on Statistical Hypothesis Tests for Engineering Comparing Algorithms with the Wilcoxon Signed Rank Test Benchmark selection • Avoid bias • Broad spectrum coverage • Constraints • Presence of missing values • Number of inputs (features) • Number of samples (instances) • Facilitate comparison with other algorithms • How well is the benchmark known in the literature? Workshop on Statistical Hypothesis Tests for Engineering Comparing Algorithms with the Wilcoxon Signed Rank Test Benchmarks Database Breast cancer Ecoli German credit data Glass identification Haberman’s survival Heart - Statlog Ionosphere Iris Pima diabetes Sonar Tic-Tac-Toe Vehicle Wine Yeast Samples 569 336 1000 214 306 270 351 150 768 208 958 946 178 1484 Workshop on Statistical Hypothesis Tests for Engineering Inputs 30 7 59 9 3 20 34 4 8 60 9 18 13 8 Classes 2 8 2 6 2 2 2 3 2 2 2 4 3 10 Comparing Algorithms with the Wilcoxon Signed Rank Test Experiments conducted Partition 1 Each experiment was run 30 times using different random 5-fold stratified cross validation Partition 2 Partition 3 Partition 4 Partition 5 Workshop on Statistical Hypothesis Tests for Engineering Comparing Algorithms with the Wilcoxon Signed Rank Test Metrics - Confusion Matrix Real class Negative Positive Predicted Positive Negative True Positive (tp) False Negative (f n) False Positive (f p) True Negative (tn) Workshop on Statistical Hypothesis Tests for Engineering Comparing Algorithms with the Wilcoxon Signed Rank Test Metrics precision = recall = F -measure = tp tp + f p tp tp + f n 2 × precision × recall precision + recall Workshop on Statistical Hypothesis Tests for Engineering Comparing Algorithms with the Wilcoxon Signed Rank Test Preliminary data analysis Classification performance (F-measure (%) macro-average) for the test datasets of the UCI benchmark experiments Database Breast cancer Ecoli German credit data Glass identification Haberman’s survival Heart - Statlog Ionosphere Iris Pima diabetes Sonar Tic-Tac-Toe Vehicle Wine Yeast 1-NN 95.15 ± 0.41 66.04 ± 0.82 64.38 ± 0.96 68.77 ± 1.63 55.53 ± 2.04 75.30 ± 1.60 85.90 ± 0.69 95.70 ± 0.69 66.95 ± 1.06 85.60 ± 1.76 49.47 ± 0.47 69.35 ± 0.76 95.90 ± 0.51 56.32 ± 1.04 Workshop on Statistical Hypothesis Tests for Engineering Hypersphere’s g=1 g=2 96.07 ± 0.30 96.45 ± 0.36 67.51 ± 0.72 68.03 ± 0.78 63.98 ± 0.95 63.55 ± 0.95 70.30 ± 2.20 69.81 ± 2.23 55.26 ± 2.35 56.36 ± 1.92 75.92 ± 1.28 76.19 ± 1.27 90.98 ± 0.54 92.55 ± 0.47 95.71 ± 0.61 96.04 ± 0.64 68.41 ± 1.00 70.09 ± 0.97 85.63 ± 1.79 87.03 ± 1.50 73.43 ± 0.54 81.21 ± 0.83 69.46 ± 0.71 68.78 ± 0.93 96.80 ± 0.44 96.93 ± 0.64 57.73 ± 1.12 58.75 ± 0.86 Comparing Algorithms with the Wilcoxon Signed Rank Test Preliminary data analysis Classification performance (F-measure (%) macro-average) for the test datasets of the UCI benchmark experiments Database Breast cancer Ecoli German credit data Glass identification Haberman’s survival Heart - Statlog Ionosphere Iris Pima diabetes Sonar Tic-Tac-Toe Vehicle Wine Yeast 1-NN 95.15 ± 0.41 66.04 ± 0.82 64.38 ± 0.96 68.77 ± 1.63 55.53 ± 2.04 75.30 ± 1.60 85.90 ± 0.69 95.70 ± 0.69 66.95 ± 1.06 85.60 ± 1.76 49.47 ± 0.47 69.35 ± 0.76 95.90 ± 0.51 56.32 ± 1.04 Workshop on Statistical Hypothesis Tests for Engineering Hypersphere’s g=1 g=2 96.07 ± 0.30 96.45 ± 0.36 67.51 ± 0.72 68.03 ± 0.78 63.98 ± 0.95 63.55 ± 0.95 70.30 ± 2.20 69.81 ± 2.23 55.26 ± 2.35 56.36 ± 1.92 75.92 ± 1.28 76.19 ± 1.27 90.98 ± 0.54 92.55 ± 0.47 95.71 ± 0.61 96.04 ± 0.64 68.41 ± 1.00 70.09 ± 0.97 85.63 ± 1.79 87.03 ± 1.50 73.43 ± 0.54 81.21 ± 0.83 69.46 ± 0.71 68.78 ± 0.93 96.80 ± 0.44 96.93 ± 0.64 57.73 ± 1.12 58.75 ± 0.86 Comparing Algorithms with the Wilcoxon Signed Rank Test Preliminary data analysis Classification performance (F-measure (%) macro-average) DB BC EC GC GL HA HE IO IR PD SO TT VE WI YE F-Measure (test) IB3 Hypersphere’s 93.47 ± 1.02 93.64 ± 0.97 63.80 ± 3.41 65.30 ± 1.88 55.91 ± 2.20 56.33 ± 2.05 35.63 ± 2.19 51.43 ± 3.04 44.85 ± 2.96 54.10 ± 3.84 79.53 ± 1.58 76.68 ± 2.39 75.21 ± 4.48 81.04 ± 2.77 93.87 ± 1.68 93.60 ± 1.78 66.60 ± 2.33 64.68 ± 1.81 48.26 ± 7.37 60.62 ± 4.83 61.56 ± 3.80 61.99 ± 1.57 62.26 ± 1.13 60.90 ± 1.74 94.03 ± 1.28 93.22 ± 1.43 37.52 ± 3.68 47.21 ± 1.25 Workshop on Statistical Hypothesis Tests for Engineering F-Measure (overall) IB3 Hypersphere’s 94.35 ± 0.66 94.66 ± 0.70 60.96 ± 3.28 83.06 ± 1.58 60.17 ± 1.78 61.49 ± 0.84 41.50 ± 3.19 62.05 ± 1.68 46.21 ± 3.62 57.33 ± 2.14 80.72 ± 0.82 79.60 ± 1.62 77.30 ± 3.81 82.87 ± 2.59 94.55 ± 1.10 95.92 ± 1.20 69.22 ± 1.78 68.01 ± 1.10 49.71 ± 7.26 62.05 ± 3.39 63.81 ± 4.00 66.67 ± 0.85 68.15 ± 0.94 68.20 ± 1.05 94.93 ± 0.88 95.59 ± 0.84 45.00 ± 4.51 61.70 ± 0.90 Comparing Algorithms with the Wilcoxon Signed Rank Test Selection of the test • Paired sample (dependent samples) • Each data point in one sample is matched to a unique data point in the other sample • Parametric • T-Student • Non-parametric (do not rely on assumptions that data belongs to any particular distribution) • Sign Test • Wilcoxon signed rank test Workshop on Statistical Hypothesis Tests for Engineering Comparing Algorithms with the Wilcoxon Signed Rank Test Wilcoxon signed rank test • Uses magnitude and sign of the paired difference ranks • Steps (assuming pairs of observations (Xi , Yi )): • 1. Calculate Di = (Yi − Xi ) • 2. Exclude the observations for which Di = 0 • 3. Sort the values of |Di | in ascending order and assign a rank R0 i according to its position. The observation with the lowest |Di | will have a value of 1, the next one a value of 2 and so on. The mean rank is assigned to tied |Di | scores. Di • 4. Calculate the rank Ri = |D R0 i i| Workshop on Statistical Hypothesis Tests for Engineering Comparing Algorithms with the Wilcoxon Signed Rank Test Wilcoxon signed rank test • If ties exist or the number of observations is "big" enough (10 to 25 depending on the author) then ◦ P • T = √PRi 2 ∩ N (0, 1) Ri • Otherwise • Sum the ranks separately for the positive and negative differences • The smaller of the two sums is the Wilcoxon T statistic for a two-sided test • For a one-sided test the smaller sum must be associated with the directionality of the null hypothesis • The critical values of T according to sample size and level of significance are defined by a table. • The null hypothesis is rejected if the T statistic is smaller than the critical value given in the table. Workshop on Statistical Hypothesis Tests for Engineering Comparing Algorithms with the Wilcoxon Signed Rank Test Applying the Wilcoxon test Workshop on Statistical Hypothesis Tests for Engineering Comparing Algorithms with the Wilcoxon Signed Rank Test Applying the Wilcoxon test • We reject the null hypotheses of 1-NN yielding better or equal results than our algorithm at a significance level of 0.5% (0.005). • We reject the null hypotheses of IB3 yielding better or equal results than our algorithm, for the overall data, at a significance level of 0.5% (0.005). Workshop on Statistical Hypothesis Tests for Engineering Comparing Algorithms with the Wilcoxon Signed Rank Test Conclusions • Statistical benchmarking is becoming increasingly important • The quality of the samples is a fundamental aspect to consider • Minimize bias • Size • Selecting and properly applying the appropriate test is crucial Workshop on Statistical Hypothesis Tests for Engineering Comparing Algorithms with the Wilcoxon Signed Rank Test References • Noel Lopes and Bernardete Ribeiro, An Incremental Class Boundary Preserving Hypersphere Classifier, International Conference on Neural Information Processing, 2011 • Elizabeth Reis, Paulo Melo, Rosa Andrade and Teresa Calapez, Estatística Aplicada, Vol.2, Edições Sílabo, 1997 • Leonard J. Kazmier, Schaum’s Outline of Business Statistics, Fourth Edition, McGraw-Hill, 2009 • A. Frank and A. Asuncion, UCI machine learning repository, http://archive.ics.uci.edu/ml Workshop on Statistical Hypothesis Tests for Engineering Comparing Algorithms with the Wilcoxon Signed Rank Test Thank you for your interest • Noel Lopes ([email protected]) Workshop on Statistical Hypothesis Tests for Engineering Comparing Algorithms with the Wilcoxon Signed Rank Test
© Copyright 2026 Paperzz