15. Nonparametric Methods : chi

15. Nonparametric Methods :
chi-square applications
無母數統計方法一—
交叉表應用
1
Outline
• Chi-square distribution 卡方分配
• Chi-square tests 卡方檢定
– Goodness-of-fit test 適合度檢定 : whether the data is taken from
the population of
• A uniform distribution : 均勻分佈,有相同機率
• Some specified distribution :某特定分佈
• A normal distribution:常態分佈
– Independence test 獨立性檢定 : whether two discrete variables are
independent or not.
2
• Nonparametric or distribution-free tests
– 無須母體分佈假設。
– Recall : testing a population mean with unknown
population variance, the t-test is available only when X ~
normal.
• Chi-square tests is available for categorical data.
• A categorical data have nominal or ordinal scale.
– Nominal : classification, e.g. gender, marriage status.
– Ordinal : classification + rank, e.g. performance = {bad,
fair, good}
– A categorical data set is always summarized into a
frequency/contingency table(次數表)
3
Population:
1111...
2222
kkkk...
…..
...
The population data can be summarized into a frequency table:
category
frequency
1
f1
…
…
2
f2
k
fk
total
N
Further, the population distribution can be calculated.
category
1
2
Prob.
f
π1 = 1
N
f
π2 = 2
N
…
k
total
...
f
πk = k
N
1
…..
1
2 3
k
4
Population:
1111...
2222...
kkkk...
…..
sampling
Sample:
1111
...
22
2...
kkkk...
…
A sample data can be also summarized into a frequency table.
category
frequency
1
fo1
2
fo2
…
…
k
fok
total
n
5
A goodness-of-fit test: whether the population distribution is equal
to a specific null distribution.
Population distribution
population
1
2
…..
…..
k
1
2 3
k
vs.
sampling
H0 : some distribution
Testing
sample
1
2
…
k
…..
1
2 3
k
6
2
χ
Chi-square distribution
• The chi-square distribution is the sampling distribution of
the chi-square test statistic.
– The statistic is always positive. (統計量恆為正值)
– There is a family of chi-square distributions.
• Depending on the degree of freedom.
– Chi-square distribution is positively skewed.
– Theoretically,
2
X1, …, Xn ~ independent N(0,1), ΣX2 ~ χ n
7
8
I. Goodness-of-fit(GOF) test: 適合度檢定
• Data :
– A sample of n categorical observations is summarized to a
frequency/contingency table with k categories.
– 將樣本中的n個觀測值整理成k個類別的次數分配表。
• Goals : to test
– whether the data is sampled from a specified distribution,
– whether the population has a specified distribution.
– 是否樣本資料來自某特定的母體分佈?
• Hypotheses :
– H 0 : the population of the data follows a specified distribution
– H 1 : H0 is not true.
9
Data : An observed frequency table,
{foi, i=1,...,k}
觀測次數
H0分配
期望次數
Category
Observed
Frequency
Hypothetical
distribution
Expected
frequency
1
2
fo2
…
…
f o1
k
f ok
Sum
n
π1
π2
…
πk
1
f e1 = nπ1
f e 2 = nπ2
…
f ek = nπk
n
Expected frequency : when H0 is true, the expected
frequency in each category,
f ei = n × πi
10
• Strategy :
– Compare observed frequencies to expected frequencies.
– 比較每一類別中觀測次數與期望次數
• Test statistic : chi-square test statistic
2
⎧
⎫
−
(
f
f
)
2
o
e
χ = ∑ ⎨
⎬
fe
category ⎩
⎭
– When null hypothesis is true, the test statistic has chi-square
distribution with (k-1-p) degrees of freedom.
– k = category number, 類別數
– p = number of estimates for calculating the expected frequencies.
(在計算期望次數fe中的估計值個數)
11
• Why (k-1-p) degrees of freedom ? In the test statistic,
– Observed frequencies : with total sample size n, k categories
• There are k observed frequencies,
f o1 ,..., f ok
• However, there is one restriction.
k
∑f
i =1
oi
=n
• Thus, the d.f. = k-1 Å 在k個觀測次數中,總共有k-1個自由度。
– Expected frequencies : all are determined by p estimates
• Thus, the d.f. = p Å k個期望次數由p個估計值所計算得出。
• In summary, the d.f. = (k-1) - p
12
Decision rule :
• Rejection region : an right-tailed test!
– H0 should be rejected if the test statistic
χ 2 is significantly large.
• Critical value = ?
2
χ
– Under null hypothesis,
~ a chi-square distribution with (k-1-p) d.f.
Æ check the chi-square distribution table in Appendix I.
– At significance level α, H0 is rejected if
χ 2 ≥ χ (2k −1− p ,α )
χ (2k −1− p ,α )
Where
can be found in Appendix I with d.f. (k-1-p) and
right-tailed probability α
13
Example. Appendix I. Table 15-3
χ (25, 0.05) = ? χ (210, 0.1) = ? χ (29, 0.01) = ?
14
Example. Chart 15-1
χ (25, 0.05) = 11.070
15
• Types of problems with GOF tests :
– Categorical data:
• Equal expected frequencies :
– The population has a “uniform”distribution.(均勻分佈)
H 0 : π1 = π 2 = ... = π k = 1 / k
…..
1 2 3
k
• Unequal expected frequencies : general cases.
H 0 : π1 = π10 , π 2 = π 20 ,..., π k = π k 0
…..
1
2 3
k
16
• Types of problems with GOF tests :
– Continuous data:
• Normal distribution. 常態分配
– Before advance analysis, testing the normality assumption.
• Example. The following cases need the normality assumption.
– Testing the single population mean, unknown variance. Small
sample size. T-test
– Testing the difference of two population mean with common
unknown variance, small sample sizes. T- test
– ANOVA
– Linear regression
17
Type I: Categorical data
• Step 1. State hypotheses
– H 0 : π1 = π10 , π 2 = π 20 ,..., π k = π k 0
• Step 2. Select the significance level, α=0.05
• Step 3. Select the test statistic : chi-square test
2
⎧
⎫
−
(
f
f
)
2
o
e
χ = ∑ ⎨
⎬
fe
category ⎩
⎭
– foi : observed frequency of the ith category
– fei : expected frequency of the ith category = n×πi0
18
– Step 4. Formulate the decision rule :
•
•
•
•
•
Since under H0, the distribution is completely determined.
No estimate is needed for calculating expected frequencies.
Thus, p=0.
Then, df = k-1
H0 is rejected if χ 2 ≥ χ 2k −1,α
– Step 5. Collect data, calculate the test statistic and draw
conclusion.
19
Example. P523 Equal expected frequencies
Jan plans to begin a series of sport cards. One of the problems is the
selection of the former players. At the end of a weekend she sold a
total of n=120 cards. The number of cards sold for each player is
given. Can she conclude the sales are not the same for each player?
X= player’s card={T, N, Ty, G, H, J}=a nominal-type variable
Player
Card sold
Observed
Frequency
Hypothetical
distribution
T
13
N
33
Ty
14
G
7
H
36
1/6
1/6
1/6
1/6
1/6
nπ1
Expected
Frequency
nπ 2
nπ 3
nπ 4
nπ 5
J
17
1/6
nπ 6
Total
120
1
120
= 120(1 / 6) = 120(1 / 6) = 120(1 / 6) = 120(1 / 6) = 120(1 / 6) = 120(1 / 6)
= 20
= 20
= 20
= 20
= 20
= 20
20
• Step 1. State hypotheses
– H0 : equal sales proportions. H1 : H0 is not true.
– H 0 : πT = π N = πTy = πG = π H = π J = 1 / 6
• Step 2. Select the significance level, α=0.05
2
χ
• Step 3. Select the test statistic : chi-square test
• Step 4. Formulate the decision rule :
–
–
–
–
k=6
p = 0, no estimate is needed for expected frequencies.
df = k-1-p = 6-1 = 5
H0 is rejected if
χ 2 ≥ χ 52, 0.05 = 11.07
21
22
• Step 5. Draw conclusion :
Player
T
N
Ty
G
H
J
fo
13
33
14
7
36
17
Observed
Frequency
Hypothetical
1/6
1/6
1/6
1/6
1/6
1/6
distribution
nπ3
nπ5
nπ6
nπ1
nπ2
nπ4
fe
= 120(1 / 6) = 120(1 / 6) = 120(1 / 6) = 120(1 / 6) = 120(1 / 6) = 120(1 / 6)
Expected
= 20
= 20
= 20
= 20
= 20
Frequency = 20
(f0-fe)
-7
13
-6
-13
16
-3
(f0-fe)2
49
169
36
169
256
9
2
(f0-fe) /fe
2.45
8.45
1.80
8.45
12.80
0.45
– Since χ
α=0.05.
2
Total
120
1
120
0
34.40
is 34.40 > 11.07, the null hypothesis is rejected at
23
Example. P530unequal expected frequencies
Table 15-4 gives a result of the AHAA report on the admissions of senior
citizens to hospitals in a one-year period. The community Bartow
Estates would like to have a comparison with AHAA’s result. A
sample of n=150 local senior citizens were selected. The number of
admissions of each individual was surveyed. See Table 15-5.
Use significance level α=0.05 to determine whether there is difference
between national and local pattern.
X=the number of admissions={0, 1, 2, 3 or more}=a discrete variable
24
Table 15-4 Summary of study by AHAA and a survey of Bartow Estates Residents
Number of times
admitted
AHAA(%)
Number of Bartow Expected number of
residents (fo)
residents
0
1
40
30
55
50
60
45
2
3 or more
20
10
32
13
30
15
Total
100
150
150
25
• Step 1. State hypotheses
– H0 : no difference between local and national H1 : H0 is not true.
– H : π = 0.4, π = 0.3, π = 0.2, π = 0.1
0 0
1
2
≥3
• Step 2. Select the significance level, α=0.05
• Step 3. Select the test statistic : chi-square test
χ2
• Step 4. Formulate the decision rule :
– k=4
– p = 0, no estimate is needed for expected frequencies.
– df = k-1-p = 4-1 = 3
2
2
– H0 is rejected if χ ≥ χ (3,0.05) = 7.815
26
27
• Step 5. Draw conclusion.
No. of
admissions
fo
Observed
Frequency
π
Hypothetical
distribution
fe
Expected
Frequency
(f0-fe)
(f0-fe)2
(f0-fe)2/fe
0
1
2
55
50
32
13
0.40
0.30
0.20
0.10
Total
≧3
nπ 0
nπ1
nπ 2
nπ≥3
= 150(0.4)
= 60
= 150(0.3)
= 45
= 150(0.2)
= 30
= 150(0.1)
= 15
-5
25
0.4167
5
25
0.5556
2
4
0.1333
-2
4
0.2667
150
1
150
0
1.3723
• Since χ 2 = 1.3723 < 7.815
, the null hypothesis of no
difference is not rejected at level 0.05.
28
Limitations of Chi-square test: P531
• Chi-square test is an approximate/asymptotic test.
• The approximation is valid only with sufficiently
large n and thus the expected frequencies, fe.
–
–
–
卡方檢定為一個近似方法,當樣本數n及各類別的期
望次數夠大,此近似才準確。
When some expected frequencies are small, we would
have very large value of chi-square test. It might result
in an erroneous conclusion.
若期望次數過小(統計量的分母小),統計量的值
會變的很大,不穩定,進而產生錯誤結論。
29
•
判斷準則:When is insufficient ?
1. If k=2 cells, one of any fe < 5.
2. If k>2 cells, if more than 1/5 of the cells that fe < 5.
Then, a chi-square test should not be used or the
“combination” is needed.
•
What to do when some fe are not large?
–
–
Combining categories.
合併類別。
30
Example. 531-532
Page 560
Level of management
Foreman
Supervisor
Manager
Middle management
Assistant vice president
Vice president
Senior vice president
fo
30
110
86
23
5
5
4
Total 263
•Before combining categories,
fe
32
113
87
24
2
4
1
263
fo-fe
-2
-3
-1
-1
3
1
3
(fo-fe)2
4
9
1
1
9
1
9
(fo-fe)2/fe
0.125
0.080
0.011
0.042
4.500
0.250
9.000
14.008
% of chi-square
0.89
0.57
0.08
0.30
32.12
1.78
64.25
100.00
χ 2 = 14.008 > χ (26, 0.05) = 12.592
•p-value=0.0295, H0 is rejected at α=0.05
• However, about 98%(32.1%+1.8%+64.3%) of variation are due to the three
vice president categories. Also, their fe<5.
31
Example. P532 after combination
Page 561
Level of management
Foreman
Supervisor
Manager
Middle management
Vice president
fo
30
110
86
23
14
Total 263
fe
32
113
87
24
7
263
fo-fe
-2
-3
-1
-1
7
(fo-fe)2
4
9
1
1
49
(fo-fe)2/fe
0.125
0.080
0.011
0.042
7.000
7.258
% of chi-square
1.72
1.10
0.16
0.57
96.45
100.00
• After combining three president categories,
χ 2 = 7.258 < χ (24, 0.05) = 9.488 , p-value=0.1229,
• H0 is not rejected at α=0.05
32
Type II: Continuous data(Optional)
• Data : continuous variable 連續型資料
• Check the assumption on population distribution,
–
–
–
–
檢定母體分佈假設。
Normal
T-distribution
F-distribution
33
• Strategy :
– Data : 資料處理
• Summarize these continuous-scale observations
into a frequency table.
• Original continuous data Æ grouping Æcontingency
table. (Ch. 2)
• 原始連續資料Æ分類Æ次數表Æfo
– Hypothetical distribution : 假設分佈
• Calculate the probability of each group under the
assumed distribution. (計算假設分佈下,每一類
別之機率)
• The expected frequency = n×prob.每一類別期望次34
數Æfe
Example. Normal assumption
The president of Duval University collected data on the
annual salaries of full professors at 160 colleges.
The sample mean salary = 54.03 K and the sample standard
deviation = 13.76K.
With proper grouping, the frequency distribution of these
annual salaries is given in Table 15-7.
Do the observed frequencies coincide with the expected
frequencies based on the normal distribution? Can we
conclude that the distribution of salary is normal ?
35
• Step 1. State hypotheses
– H0 : X ~ normal H1 : H0 is not true.
• Step 2. Select the significance level, α=0.05
• Step 3. Select the test statistic : chi-square test χ 2
– fo : obtained by summarizing original data into a frequency table
See next page.
– fe : estimated under null hypothetical distribution.
36
Summarize original data into a frequency table
Table 15-7
Salary
20~30
30~40
40~50
50~60
60~70
70~80
80~90
90~100
Total
fo
4
20
41
44
29
16
2
4
160
37
How to estimate fe ?
1.
Estimate the hypothetical probability in each cell.
Table 15-7-1
Salary
~30
30~40
40~50
50~60
60~70
70~80
80~
Total
Hypothetical probability
P(X<30)
P(30<X<40)
P(40<X<50)
P(50<X<60)
P(60<X<70)
P(70<X<80)
P(80<X)
P(a < X < b) =?
38
How to estimate P(a <X < b) ?
Under H0, X~Normal, if μ,σare known, then after
standardization, the probability can be found in
Appendix D.
P (a < X < b )
b−µ⎞
⎛ a −µ
= P⎜
<Z<
⎟
σ ⎠
⎝ σ
Standardization!
Check the table!
39
How to estimate P(a <X < b) ?
However, μ,σare unknown. We use the sample mean and
sample standard deviation to estimate the true values.
• Example. x = 54.03 ≈ µ, s = 13.76 ≈ σ
P(70 < X < 80)
80 − µ ⎞
⎛ 70 − µ
<Z<
= P⎜
⎟
σ ⎠
⎝ σ
80 − 54.03 ⎞
⎛ 70 − 54.03
≈ P⎜
<Z<
⎟
13.76 ⎠
⎝ 13.76
= P(1.16 < Z < 1.89)
= 0.4706 − 0.3770 = 0.0936
40
41
Table 15-7-2
Salary
Estimated Hypothetical probability
P(X<30)≒P(Z<-1.75)=0.0401
<30
30~40
P(30<X<40)≒P(-1.75<Z<-1.02)=0.1138
40~50
P(40<X<50)≒P(-1.02<Z<-0.29)=0.2320
50~60
P(50<X<60)≒P(-0.29<Z<0.43)=0.2805
60~70
P(60<X<70)≒P(0.43<Z<1.16)=0.2106
70~80
P(70<X<80)≒P(1.16<Z<1.89)=0.0936
P(80<X)≒P(1.89<Z)=0.0294
80~
Total
1.0000
42
For n=160,
Table 15-7-3
Salary
Estimated
Hypothetical
probability
<30
0.0401
30~40
0.1138
40~50
0.2320
50~60
0.2805
60~70
0.2106
70~80
0.0936
80~
0.0294
Total
1.0000
fe=160×prob.
Expected frequency
6.416
18.208
37.120
44.880
33.696
14.976
4.704
160.000
43
• Step 4. Formulate the decision rule.
– Rejection region and critical value.
– Since k = 7 categories in the frequency table.
– Use x , s in the sample to estimate population
further, the expected frequencies. Thus p=2.
– The d.f. = k-1-p = 7-1-2=4
– At level α=0.05, the null hypothesis is rejected if
µ, σ
and,
χ 2 ≥ χ (24, 0.05) = 9.488
44
• Step 5. Draw conclusion
Table 15-9
Salary
<30
30~40
40~50
50~60
60~70
70~80
80<
Total
fo
4
20
41
44
29
16
6
160
fe
6.416
18.208
37.120
44.880
33.696
14.976
4.704
160.000
fo - fe
-2.416
1.792
3.880
-0.880
-4.696
1.024
1.296
(fo-fe)2
5.837
3.211
15.054
0.774
22.052
1.049
1.680
(fo-fe)2/fe
0.910
0.176
0.406
0.017
0.654
0.070
0.357
2.590
2
2
χ
=
2
.
59
<
χ
• Since
( 4 , 0.05 ) = 9.488 thus the null hypothesis
is not rejected at α=0.05. The distribution of full
professors’ salaries follows a normal distribution.
45
How to obtain X, S
with grouped data?
• Sometimes, the data is grouped already.
• When the original data has been grouped, in Chap. 3 and 4,
x=
∑x f
i i
i:class
n
k
∑ fi (x i − x )
s = i =1
n −1
2
2 (∑ fx )
∑ fi x i − n
i =1
k
2
=
n −1
• Where xi=class mark, fi=frequency
46
II. Independence test : a r×c contingency table analysis
•
Data:
–
–
For each subject, two discrete variables,X1, X2, are studied.
The data are summarized by a r×c contingency table.
•
For example. 省籍(4-level)vs. 黨派(2-level)
黨派
•
省籍
藍
綠
total
本省
400
400
800
外省
180
20
200
客家
50
50
100
原住民
50
50
100
total
480
520
1000
Research question:
“Is there a relationship between X1, X2?”
47
•
Example. (P534) “Does a released prison make a
different adjustment to civilian life if
1. He returns to his hometown, or
2. He goes elsewhere
•
to live ?
Is there a relationship between place of
residence(X1) and adjustment to civilian life(X2)
after release from prison?
48
•
•
Data : a 2×4 table by X1=“residence”, X2=“adjustment”
Determine whether
–
–
Is there a relationship between “residence” and “adjustment”?
Is “residence” independent with “adjustment”?
Table 15-10 Adjustment to civilian life and place of residence
adjustment to civilian life
X1
X2
Residence
outstanding
good
fair
unsatisfactory
total
hometown
27
35
33
25
120
elsewhere
13
15
27
25
80
total
40
50
60
50
200
49
Data : r×c contingency table
Whether there is a relationship between ,X1, X2?
X2
X1
1
1
….
….
C
total
fo11
…
…
fo1c
n11
…
…
…
…
…
r
for1
…
…
forc
n1r
total
n21
…
…
n2C
n
50
• Step 1. Hypotheses
– H0 : ,X1, X2 are independent
– H1 : ,X1, X2 are not independent
• Step 2. Significance level α
• Step 3. Test statistic : chi-square test
2
−
(
f
f
)
oi
ei
χ2 = ∑
f ei
i:category
– Where foi = observed frequency;
and fei = expected frequency = ?
51
Expected frequency, fe,
• fei = expected frequency
– Under the null hypothesis of independence.
• Recall : If A, B are independent events
P(A∩B) = P(A, B 同時發生) = P(A) P(B)
• If X1, X2 are independent, the probability of each cell is
– Joint prob. = (row marg. prob.)×(column marg. prob.)
⎛ n1i ⎞⎛⎜ n 2 j ⎞⎟
P(X1 = i, X 2 = j) = P(X1 = i)P(X 2 = j) ≈ ⎜
⎟⎜
⎝ n ⎠⎝ n ⎟⎠
– Where n1i = row marginal total
– And n2j = column marginal total
52
• Thus, the expected frequency fe = n × (cell joint
prob.)
⎛ n1i ⎞⎛ n 2 j ⎞ (row
⎟⎟ =
f e = n × ⎜ ⎟⎜⎜
⎝ n ⎠⎝ n ⎠
total) (column total)
n
53
Degree of freedom : k – 1 – p
• For a chi-square test, df = k-1-p
• If r = row no.=列數, c = column no.=行數, then
k = total cells number = (row no.)×(column no.) = r × c
• And p = number of estimates for calculation of fe = ?
–
–
–
–
All fe’s are determined by row and column marginal probabilities.
For r rows, there are (r-1) row marginal probabilities.
For c columns, there are (c-1) column marginal probabilities.
Thus, p = (r-1) + (c-1)
• Thus, df = k - 1 – p = k – 1 – (r –1) – (c – 1) = (r-1)(c-1)
54
• Step 4. Rejection region
– A one-sided Chi-square test with df = (r-1)(c-1)
– H0 is rejected at level α if
2
2
χ ≥ χ ( r −1)(c −1),α
• Step 5. Conclusion : calculate chi-square test statistic and
draw conclusion based on decision rule in Step 4.
55
Example. P566
• Step 1. Hypotheses
– H0 : “Residence” and “Adjustment” are independent
– H1 : “Residence” and “Adjustment” are not independent
• Step 2. Significance level α=0.01
• Step 3. Test statistic : chi-square test
2
−
(
f
f
)
oi
ei
χ2 = ∑
f ei
i:category
– Where foi = observed frequency;
and fei = expected frequency
56
• Step 4. Rejection region : r=2, c=4
– A one-sided Chi-square test with df = (r-1)(c-1) = 3
– H0 is rejected at level α=0.01 if
χ
2
2
≥ χ3,0.01
= 11.345
57
Step 5. Conclusion :
adjustment
residence
outstanding
good
fair
unsatisfactory
homehown
27
35
33
25
expected 120*40/200=24 120*50/200=30 120*60/200=36 120*50/200=30
elsewhere
13
15
27
25
expected 80*40/200=16
80*50/200=20
80*60/200=24 80*50/200=20
total
40
50
60
50
total
120
80
200
(f oi − f ei ) 2
χ = ∑
f ei
i:category
2
(27 − 24) 2 (35 − 30) 2
(25 − 20) 2
=
+
+L +
24
30
20
= 5.729 < 11.345
• H0 is not rejected at α=0.01, there is no relationship
between adjustment to civilian life and the residence.
58
Exercise.
• GOF test:
– Uniform distribution : 19
– Some specified distribution : 21, 23
– Poisson distribution : 22 (optional)
• Independence test: 25, 27
59
Bonus (+1%)
台北市銀行於今年春節時發行『幸運水果』吉時樂彩券,宣稱
中獎率為30%。
1. 小明隨機購買n=90張彩券,其中有38張中獎,在顯著水準
α=5%下,請問上述宣稱是否屬實?p-value=?
2. 如果今天小明改變調查的方式,每次隨機買3張彩券,記
錄其中中獎的張數,以下為30次隨機實驗的結果,
0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,3
在顯著水準α=5%下,請問上述宣稱是否屬實?
Hint : 在中獎率為30%的宣稱下, 令X=三張彩券中獎張數,
X ~ Binomial ( 3, 0.3)
60