Chi-Square Test 1 Chi-square analysis methods are among the most commonly used of all statistical techniques The method introduced will be used to examine “frequency” or “count” data The methods are conceptually simple although manual computation can be tedious 2 Goodness of Fit Test 3 Goodness of Fit Suppose we don’t know the underlying distribution of the population We wish to test the hypothesis that a particular distribution will be satisfactory as a population model The purpose of goodness-of-fit tests is to evaluate whether a particular probability distribution is adequate for modeling the behavior of the process under consideration Hypotheses • • H0 : p1 = p*1, p2 = p*2, ... , pk = p*k H1 : at least one pi ≠ p*i 4 Test Statistics Suppose we take a random sample of size n from the population whose probability distribution is unknown These n observations are arranged in a histogram, having k cells or class intervals Let Oi be the observed frequency in the ith cell Let Ei be the expected frequency in the ith cell computed from the hypothesized probability distribution Each Ei should not be less than 5. If a cell has an Ei <5, group it with an another cell. Let m be the number of parameters of the hypothesized distribution estimated by sample statistic Then has approximately a chi-square distribution with k-m-1 degrees of freedom Reject H0 if X02 > c2a, k-m-1 5 Example Day Mon Tue Wed Thr Fri sum No. of accidents 65 43 48 41 73 270 6 Example Suppose that historic failure rates are: Due to A: .20 Due to B: .35 Due to C: .30 Due to D: .15 The manufacturer has worked on A, B, and C and believes that failures due to these causes has been reduced, so that, while fewer failure will occur, it is more likely that when one occurs, it will be due to D To examine this claim the manufacturer will sample 200 failed disk drives manufactured since process changes were made If the changes had no impact then the number of these failed drives that were due to causes A, B, C, and D that would be EXPECTED would be: EA = npA0 = 200(.20) = 40 EC = npC0 = 200(.30) = 60 EB = npB0 = 200(.35) = 70 ED = npD0 = 200(.15) = 30 7 H0: pA = .20, pB = .35, pC = .30, pD = .15 H1: at least one pi ≠ pi0 for i = A, B, C, D We can conclude that the historic failure mode distribution no longer applies (reject H0 in favor of H1) So how has the distribution changed? The answer is embedded in the individual category contributions to X02 ... larger contributions indicate where the changes have occurred: reductions in A and C, no obvious change in B, the various failures that make-up D now comprise a proportionally larger amount of the failures 8 Example A sample of 120 minutes selected during rush periods at a store gave the following number of customers arriving during each of those 120 minutes No. of customers 0 1 2 3 4 5 6 Observed Frequency 25 44 35 8 5 2 1 Question 1: Is this data consistent with a Poisson distribution with a mean of 1.7 customers per minute? Test the appropriate hypothesis at the a = .10 level of significance 9 Recall that Poisson Distribution with mean λ is Customers/ minute probability Observed Expected (O-E)2/E 0 P(X=0)=0.1827 25 21.922 0.4322 1 P(X=1)=3106 44 37.267 1.2163 2 P(X=2)=0.2640 35 31.677 0.3485 3 P(X=3)=0.1496 8 17.950 5.5158 4 or more P(X≥4)=0.0913 8 10.958 1.1467 sum 8.6595 DF=k-m-1=5-0-1=4 This data is not consistent with a Poisson distribution with a mean of 1.7 10 Question 2: Is this data consistent with a Poisson distribution ? Test the appropriate hypothesis at the a = .10 level of significance. In this case, we need to estimate the mean of the number of customers per minute. In this case the degree of freedom for the chi-square distribution is DF=k-m-1=5-1-1=3 Customers/ minute probability Observed Expected (O-E)2/E 0 P(X=0)=0.2346 25 28.148 0.3522 1 P(X=1)=0.3401 44 40.815 0.2485 2 P(X=2)=0.2466 35 29.591 0.9887 3 P(X=3)=0.1192 8 14.302 2.7771 4 or more P(X≥4)=0.0588 8 7.051 0.1735 sum 4.5400 11 Example Oil & Gas Exploration is both expensive and risky The average cost of a “dry hole” is in excess of $20 million New technologies are always under development in an effort to reduce the likelihood of drilling a “dry hole” with the result being increased profitability Suppose an experimental technology has been developed that claims to have an 80% success rate This technology was tested by drilling four holes and counting the number of productive wells This was done 100 times, each time counting the number of productive wells The data is recorded below: 12 Number of productive wells Observed Frequency 0 1 2 3 4 3 6 22 41 28 Test the appropriate hypothesis at the a = .01 level of significance H0: the new technology delivers success according to a binomial distribution with p = .8 or equivalently H0: p(0 or 1) = .0272 p (2) = .1536 p (3) = .4096 p (4) = .4096 H1: the new technology does not deliver success according to a binomial distribution with p =.8 13 No. of productive wells Observed (Oi) Expected (Ei) (O-E)2/E 0 or 1 9 2.72 14.50 2 22 15.36 2.87 3 41 40.96 0.00 4 28 40.96 4.10 21.47 That is, we do not have enough evidence to say that the new technology does not deliver success according to a binomial distribution with p = .8 14 If p were unknown, then it would have to be estimated from the data as df = k – m - 1 = 4-1-1=2 (a lost a DF) The modified calculations follow 15 Example The following data are tensile strength of concrete 320 380 340 410 380 340 360 350 320 370 350 340 350 360 370 350 380 370 300 420 370 390 390 440 330 390 330 360 400 370 320 350 360 340 340 350 350 390 380 340 400 360 350 390 400 350 360 340 370 420 420 400 350 370 330 320 390 380 400 370 390 330 360 380 350 330 360 300 360 360 360 390 370 370 370 350 390 370 370 340 370 400 360 350 380 380 360 340 330 370 340 360 390 400 370 410 360 400 340 360 Test if normal distribution can safely be assumed for the data 16 From the data, => use these for standardization We make 10 cells and compute chi-square statistic as the following: DF=10-2-1=7 Thus, we can assume a normal distribution for this data. 17 Independence Test 18 Two-way contingency table is a set of frequencies that summarize how a set of objects simultaneously classified under two different categorizations. Two factors A and B are independent if P(A and B)=P (A) P(B) Suppose A has categories A1, A2, ..., Aa and B has categories B1, B2, ..., Bb we must have P(AiBj) = P(Ai)P (Bj) for i = 1, 2, ..., a and j = 1, 2, ..., b i.e. micro- independence to establish macro-independence The Chi-Square Test of Independence is used to determine whether two factors are related to one another and, if so, to identify the nature of the relationship 19 Hypotheses Let A = a “row” factor with a categories Let B = a “column” factor with b categories Let the probability that an observation is classified into the ith row be pi. Let the probability that an observation is classified into the jth column be p.j Let pij be the probability that an observation is classified into the ith row and jth column Given the preceding development, the concept of independence is formally expressed as: H0 : pij = pi. p.j for all i,j combinations H1: pij pi. p.j for at least one i,j combination 20 Test Statistics Let ni. be the number of items in the ith row Let n.j be the number of items in the jth column Let n be the total number of sample items Let Oij be the number of items at the intersection of the ith row & jth column pi. is estimated by ^pi. = ni. / n p.j is estimated by ^p.j = n.j / n pij is estimated by ^ pij = Oij / n if the two factors are dependent pij is estimated by ^ pij = ^ pi. ^p.j if the two factors are independent To test H0 vs H1 we compare the two estimates of pij , after each estimate has been weighted by the amount of evidence that we have, n Doing this, squaring the comparisons and standardizing the result yields the c2 statistic for independence 21 The chi-square statistic for independence is: X02 = (Oij - Eij)2/Eij ^ j = (ni.n.j)/ n is the expected number of values at the where Eij = n p^i. p. intersection of the ith row & jth column under independence Examination of X02 indicates that it will be “small” in value if Oij and Eij are close - which would support independence If X02 is “large”, it is due to discrepancy between Oij and Eij in at least one cell, and perhaps numerous cells. This supports dependence The cells which contribute most greatly to X02 likely have the most to say about the nature of any dependence structure This test has (a-1)(b-1) degrees of freedom Critical values of X02 are found from the chi-square distribution with degrees of freedom (a-1)(b-1) 22 Example Suppose that we wish to determine whether of the opinions of the voting residents of the state of Illinois concerning a new tax reform are independent of their levels of income A random sample of 1000 registered voters from the state of Illinois are classified as to whether they are in a low, medium or high income bracket and whether or not they favor a new tax reform The observed frequencies are presented below Income Level Tax Reform Low Medium High Total For 182 598*336/1000 =200.93 213 598*351/1000 =209.90 203 598*313/1000 =187.17 598 Against 154 402*336/1000 =135.07 138 402*351/1000 =141.10 110 402*313/1000 =125.83 402 Total 336 351 313 1000 23 DF=(3-1)(2-1)=2 That is, there’s no evidence that the opinions of the voting residents of the state of Illinois concerning a new tax reform are not independent 24 Example The organization responsible for administration of the Customer Satisfaction Index in Sweden examined customer satisfaction and employee empowerment indices for a sample of 500 Swedish companies Categories for each index were “very low”, “low”, “moderate”, “high” and “very high” Employ Customer Empower Satisfaction Very low Low Moderate High Very high Total Very Low 13 E=54*40/500 =4.32 11 8 5 3 40 Low 10 18 19 12 6 65 Moderate 18 32 42 44 34 170 High 12 16 34 57 61 180 Very High 1 3 8 14 17 45 Total 54 80 111 132 123 500 Is there a discernable relationship between customer and employee satisfaction? 25
© Copyright 2026 Paperzz