F73DA2 INTRODUCTORY DATA ANALYSIS ANALYSIS OF VARIANCE 50 45 35 30 25 20 y 40 regression: x is a quantitative explanatory variable 1.0 1.5 2.0 x 2.5 3.0 50 45 40 35 30 25 20 y type is a qualitative variable (a factor) 1.0 1.5 2.0 type 2.5 3.0 Illustration Company 1: 36 28 32 43 30 21 33 37 26 34 Company 2: 26 21 31 29 27 35 23 33 Company 3: 39 28 45 37 21 49 34 38 44 50 45 40 sum 35 30 25 20 1.0 1.5 2.0 company 2.5 3.0 Explanatory variable qualitative i.e. categorical - a factor Analysis of variance linear models for comparative experiments 1.0 1.5 2.0 type 2.5 3.0 20 25 30 35 company 40 45 50 Using Factor Commands ► The display is different if “type” is declared as a factor. 1 2 type 3 20 25 30 35 company 40 45 50 ► We could check for significant differences between two companies using t tests. ► t.test(company1,company2) ► This calculates a 95% Confidence Interval for difference between means Includes 0 so no significant difference Instead use an analysis of variance technique Taking all the results together Taking all the results together We calculate the total variation for the system which is the sum of squares of individual values – 32.59259 ► We can also work out the sum of squares within each company This sums to 1114.431 ► The total sum of squares of the situation must be made up of a contribution from variation WITHIN the companies and variation BETWEEN the companies. ► This means that the variation between the companies equals 356.0884 ► This can all be shown in an analysis of variance table which has the format: Source of variation Degrees of freedom Sum of squares Mean squares Between treatments k1 SSB SSB /(k 1) Residual (within treatments) nk SSRES SSRES /(n k) Total n1 SST F Source of variation Degrees of freedom Sum of squares Mean squares Between treatments k1 356.088 SSB /(k 1) Residual (within treatments) nk 1114.431 SSRES /(n k) Total n1 1470.519 F ► Using the R package, the command is similar to that for linear regression Theory Data: yij is the jth observation using treatment i Model: Yij i ij i ij , i 1,2,..., k; j 1,2,..., ni where the errors ij are i.i.d. N(0,s2) The response variables Yij are independent Yij ~ N(µ + τi , σ2) Constraint: n i i i 0 Derivation of least-squares estimators ni S yij i k 2 i 1 j 1 ni ni k S S 2 yij i , 2 yij i i i 1 j 1 j 1 Normal equations for the estimators: y nˆ and yi ni ˆ niˆi with solution ˆ y , ˆi yi y The fitted values are the treatment means Ŷij Yi Partitioning the observed total variation y ij i y j 2 y ij i SST yi 2 n y j SSRES SST = SSB + SSRES i i y i SSB 2 The following results hold SST 2 ~ SS RES 2 2 n 1 ~ 2 nk SS B 2 ~ 2 k 1 SS B / k 1 ~ Fk 1,n k SS RES / n k Back to the example ˆ 880/ 27 32.593 ˆ1 320/10 880/ 27 0.593, ˆ2 225/8 880/ 27 4.468 ˆ3 335/ 9 880/ 27 4.630 Fitted values: Company 1: 320/10 = 32 Company 2: 225/8 = 28.125 Company 3: 335/9 = 37.222 Residuals: Company 1: 1j = y1j - 32 Company 2: 2j = y2j - 28.125 Company 3: 3j = y3j - 37.222 SST = 30152 – 8802/27 = 1470.52 SSB = (3202/10 + 2252/8 + 3352/9) – 8802/27 = 356.09 SSRES = 1470.52 – 356.09 = 1114.43 ANOVA table Source of variation Degrees of freedom Sum Mean of squares squares Between treatments Residual 2 356.09 178.04 24 1114.43 46.44 Total 26 1470.52 F 3.83 Testing H0 : τi = 0 , i = 1,2,3 v H1 : not H0 (i.e. τi 0 for at least one i) Under H0 , F = 3.83 on 2,24 df. P-value = P(F2,24 > 3.83) = 0.036 so we can reject H0 at levels of testing down to 3.6%. Conclusion Results differ among the three companies (P-value 3.6%) The fit of the model can be investigated by examining the residuals: the residual for response yij is eij ˆij y ij y i this is just the difference between the response and its fitted value (the appropriate sample mean) Plotting the residuals in various ways may reveal ● a pattern (e.g. lack of randomness, suggesting that an additional, uncontrolled factor is present) ● non-normality (a transformation may help) ● heteroscedasticity (error variance differs among treatments – for example it may increase with treatment mean: again a transformation – perhaps log - may be required) 1.0 1.5 2.0 type 2.5 3.0 20 25 30 35 company 40 45 50 1 2 type 3 20 25 30 35 company 40 45 50 ► In this example, samples are small, but one might question the validity of the assumptions of normality (Company 2) and homoscedasticity (equality of variances, Company 2 v Companies 1/3). 0 5 10 15 Index 20 25 -15 -10 -5 0 5 residuals(lm(company ~ type)) 10 ► plot(residuals(lm(company~type))~ fitted.values(lm(company~type)),pch=8) 10 5 0 -5 residuals(lm(company ~ type)) -10 -15 28 30 32 34 fitted.values(lm(company ~ type)) 36 plot(residuals(lm(company~type))~ fitted.values(lm(company~type)),pch=8) ► abline(h=0,lty=2) ► 10 5 0 -5 residuals(lm(company ~ type)) -10 -15 28 30 32 34 fitted.values(lm(company ~ type)) 36 ► It is also possible to compare with an analysis using “type” as a qualitative explanatory variable ► type=c(rep(1,10),rep(2,8),rep(3,9)) ► No “factor” command 1.0 1.5 2.0 type 2.5 3.0 20 25 30 35 company 40 45 50 Note low R2 The equation is company = 27.666+2.510 x type A B C D spray E F 0 5 10 15 count 20 25 10 5 0 residuals(lm(count ~ spray)) -5 5 10 fitted.values(lm(count ~ spray)) 15 Example A school is trying to grade 300 different scholarship applications. As the job is too much work for one grader, 6 are used. Example A school is trying to grade 300 different scholarship applications. As the job is too much work for one grader, 6 are used. The scholarship committee would like to ensure that each grader is using the same grading scale, as otherwise the students aren't being treated equally. One approach to checking if the graders are using the same scale is to randomly assign each grader 50 exams and have them grade. To illustrate, suppose we have just 27 tests and 3 graders (not 300 and 6 to simplify data entry). Furthermore, suppose the grading scale is on the range 1-5 with 5 being the best and the scores are reported as: grader 1 4 3 4 5 2 3 4 5 grader 2 4 4 5 5 4 5 4 4 grader 3 3 4 2 4 5 5 4 4 The 5% cut off for F distribution with 2,21 df is 3.467. The null hypothesis cannot be rejected. No difference between markers. Another Example Class I 151 II 152 III 175 IV 149 V 123 VI 145 168 141 155 148 132 131 128 129 162 137 142 155 167 120 186 138 161 172 134 115 148 169 152 141 Source of variation Between treatments df Residual 24 5766.8 Total 29 8813.5 5 Sum Mean of squares squares 3046.7 609.3 240.3 F 2.54 6 5 4 companynumber 3 2 1 120 130 140 150 160 170 180 Normality and homoscedasticity (equality of variance) assumptions both seem reasonable We now wish to Calculate a 95% confidence interval for the underlying common standard deviation , using SSRES/ 2 as a pivotal quantity with a 2 distribution. It can easily be shown that the class III has the largest value of 165.20 and that Class II has the smallest value of 131.40 Consider performing a t test to compare these two classes There is no contradiction between this and the ANOVA results. It is wrong to pick out the largest and the smallest of a set of treatment means, test for significance, and then draw conclusions about the set. Even if H0 : " all equal" is true, the sample means would differ and the largest and smallest sample means perhaps differ noticeably.
© Copyright 2026 Paperzz