Estimating the accuracy of a hypothesis Setting • Assume a binary classification setting • Assume input/output pairs (x, y) are sampled from an unknown probability distribution D = p(x, y) • Train a binary classifier h (hypothesis) on a sample Strain of pairs drawn independently (of one another) and identically distributed according to D (i.i.d sample) • Task is to estimate the generalization error (or true error or expected risk) of h over the entire distribution p: Z R[h] = δ(y, h(x))p(x, y)dxdy X ×Y Estimating the accuracy of a hypothesis Problem • D is unknown ⇒ need to approximate with a sample error (or empirical error): RS [h] = r 1 X δ(y, h(x)) = |S| n (x,y)∈S where r is the number of errors, n = |S|. • bias in estimate: accuracy on the training sample Strain is an optimistically biased estimate ⇒ need an independent i.i.d test sample Stest • variance in estimate: measured accuracy can be different from true accuracy because of the specificity of the (even unbiased) sample employed ⇒ need to estimate such variance Probability distribution of empirical error estimator Binomial distribution • RS [h] = r/n can be modeled as a random variable, measuring the probability of having r of “positive” outcomes (errors) in n trials (with p true unknown probability of error). • Such random variable obeys a Binomial distribution (E[x] = np, Var[x] = np(1 − p)) Unbiased estimator • The estimation bias of an estimator Y of a parameter p is E[Y ] − p • The empirical error is an unbiased estimator of the true error p: E[r/n] − p = E[r]/n − p = np/n − p = 0 Probability distribution of empirical error estimator Estimator standard deviation • The standard deviation of n independently drawn tests of a random variable with standard deviation σ is: r p(1 − p) σ (for Binomial distribution) σn = = n n • Empirical standard deviation can be approximated replacing true error p with predicted error RS [h] r RS [h](1 − RS [h]) σRS [h] ≈ n Confidence intervals Definition A N % confidence interval for some parameter p is the interval expected to contain p with probability N % Confidence interval for error estimation • Supply the estimated error RS [h] with a N % confidence interval (typical value is 95%) • The true error R[h] will fall within such interval with N % probability Confidence intervals Problem It is difficult to compute confidence intervals for the Binomial distribution Solution • For large enough n, the Binomial distribution can be approximated with a Normal distribution with same mean and variance • Alternative rules of thumb for valid approximation: – n ≥ 30 – np(1 − p) ≥ 5 Computing 95% confidence interval 2 • Two-sided bound ⇒ 2.5% left-out probability on each tail of the distribution • Recover value z such that p(x ≥ z) ≤ 2.5% from N (0, 1) distribution table • multiply z by standard deviation obtaining: r RS [h] ± z RS [h](1 − RS [h]) n Hypothesis testing Test statistic null hypothesis H0 default hypothesis, for rejecting which evidence should be provided test statistic Given a sample of n realizations of random variables X1 , . . . , Xn H1 , a test statistic is a statistic T = h(X1 , . . . , Xn ) whose value is used to decide wether to reject H0 or not. Example Given a set of measurements X1 , . . . , Xn , decide wether the actual value to be measured is zero. null hypothesis the actual value is zero test statistic sample mean: n T = h(X1 , . . . , Xn ) = 1X Xi n i=1 Hypothesis testing Glossary tail probability probability that T is at least as great (right tail) or at least as small (left tail) as the observed value t. p-value the probability of obtaining a value T at least as extreme as the one observed t , in case H0 is true. Type I error reject the null hypothesis when it’s true Type II error accept the null hypothesis when it’s false significance level the largest acceptable probability for committing a type I error critical region set of values of T for which we reject the null hypothesis critical values values on the boundary of the critical region Hypothesis test example Setting • A measuring device takes three speed measurements and returns their average. • The measurement error can be modelled as a Gaussian N (0, 4). • The maximum speed limit is 120. • The acceptable percentage of wrongly fined drivers is 5% (significance level 0.05) 3 Is a measured speed of 121 significantly over the limit ? Z-Test • compute probability that the test statistic T is greater or equal than 121 in case the actual value is 120: – Standardize distribution: T −µ T − 120 √ Z=p = 2 2/ 3 σ /n – Compute probability from N (0, 1) distribution table: P (T ≥ 121) = P ( T − 120 121 − 120 √ ≥ √ ) = P (Z ≥ 0.87) 2/ 3 2/ 3 • Verify if probability is greater or equal to the significance level: P (Z ≥ 0.87) = 0.1922 ≥ 0.05 What is the critical value for the speed limit test ? Procedure compute minimal value c such that P (T ≥ c) = 0.05: P (Z ≥ c − 120 √ ) = 0.05 2/ 3 c − 120 √ = 1.645 2/ 3 (from distribution table) 2 c = 120 + 1.645 √ 3 t-test Setting Given a set of values for X1 , . . . , Xn i.i.d random variables Decide wether to reject the null hypothesis that their distribution has mean µ0 . Problem The variance of the distribution is unknown Solution Replace the unknown variance with an unbiased estimate: n Sn2 = 1 X (Xi − X̄n )2 n − 1 i=1 4 t-test The test • The test statistics is given by the standardized (also called studentized) mean: T = X̄n − µ0 √ Sn / n • Assuming the dataset comes from an unknown Normal distribution, the test statistics has a tn−1 distribution under the null hypothesis • The null hypothesis can be rejected at significance level α if: T ≤ −tn−1,α/2 or t-test 5 T ≥ tn−1,α/2 6 tn−1 distribution • bell-shaped distribution similar to the Normal one • wider and shorter: reflects greater variance due to using Sn2 instead of the true unknown variance of the distribution. • n − 1 is the number of degrees of freedom of the distribution (related to number of independent events observed) • tn−1 tends to the standardized normal z for n → ∞. 7 Comparing learning algorithms Task Given two learning algorithm LA and LB , we want to estimate if they are significantly different in learning function f: ES∈D [R[LA (S)] − R[LB (S)]] that is the expected difference between generalization errors when training on a random sample is significantly different from zero. Comparing learning algorithms Problem • D is unknown ⇒ we must rely on a sample D0 . Solution: k-fold cross validation • Split D0 in k equal sized disjoint subsets Ti . • For i ∈ [1, k] – train LA and LB on Si = D0 \ Ti – test LA and LB on Ti – compute δi ← RTi [LA (Si )] − RTi [LB (Si )] • return δ̄ = k 1X δi k i=1 Comparing learning algorithms: t-test Null hypothesis the mean error difference is zero t-test at significance level α: where: δ̄ ≤ −tk−1,α/2 Sδ̄ v u q u 2 Sδ̄ = Sk /k = t or δ̄ ≥ tk−1,α/2 Sδ̄ k X 1 (δi − δ̄)2 k(k − 1) i=1 Note paired test the two hypotheses where evaluated over identical samples two-tailed test if no prior knowledge can tell the direction of difference (otherwise use one-tailed test) 8 t-test example 10-fold cross validation • Test errors: Ti T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 RTi [LA (Si )] 0.81 0.82 0.84 0.78 0.85 0.86 0.82 0.83 0.82 0.81 RTi [LB (Si )] 0.80 0.77 0.70 0.83 0.80 0.78 0.75 0.80 0.78 0.77 δi 0.01 0.05 0.14 -0.05 0.05 0.08 0.07 0.03 0.04 0.04 • Average error difference: 10 δ̄ = 1 X δi = 0.046 10 i=1 t-test example 10-fold cross validation • Unbiased estimate of standard deviation: v u 10 u 1 X (δi − δ̄)2 = 0.0154344 Sδ̄ = t 10 · 9 i=1 • Standardized mean error difference: δ̄ 0.046 = = 2.98 Sδ̄ 0.0154344 • t distribution for α = 0.05 and k = 10: tk−1,α/2 = t9,0.025 = 2.262 < 2.98 • Null hypothesis rejected, classifiers are different t-test example 9 t-Distribution Table t The shaded area is equal to α for t = tα . df t.100 t.050 t.025 t.010 t.005 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 32 34 36 38 ∞ 3.078 1.886 1.638 1.533 1.476 1.440 1.415 1.397 1.383 1.372 1.363 1.356 1.350 1.345 1.341 1.337 1.333 1.330 1.328 1.325 1.323 1.321 1.319 1.318 1.316 1.315 1.314 1.313 1.311 1.310 1.309 1.307 1.306 1.304 1.282 6.314 2.920 2.353 2.132 2.015 1.943 1.895 1.860 1.833 1.812 1.796 1.782 1.771 1.761 1.753 1.746 1.740 1.734 1.729 1.725 1.721 1.717 1.714 1.711 1.708 1.706 1.703 1.701 1.699 1.697 1.694 1.691 1.688 1.686 1.645 12.706 4.303 3.182 2.776 2.571 2.447 2.365 2.306 2.262 2.228 2.201 2.179 2.160 2.145 2.131 2.120 2.110 2.101 2.093 2.086 2.080 2.074 2.069 2.064 2.060 2.056 2.052 2.048 2.045 2.042 2.037 2.032 2.028 2.024 1.960 31.821 6.965 4.541 3.747 3.365 3.143 2.998 2.896 2.821 2.764 2.718 2.681 2.650 2.624 2.602 2.583 2.567 2.552 2.539 2.528 2.518 2.508 2.500 2.492 2.485 2.479 2.473 2.467 2.462 2.457 2.449 2.441 2.434 2.429 2.326 63.657 9.925 5.841 4.604 4.032 3.707 3.499 3.355 3.250 3.169 3.106 3.055 3.012 2.977 2.947 2.921 2.898 2.878 2.861 2.845 2.831 2.819 2.807 2.797 2.787 2.779 2.771 2.763 2.756 2.750 2.738 2.728 2.719 2.712 2.576 Gilles Cazelais. Typeset with LATEX on April 20, 2006. 10
© Copyright 2026 Paperzz