download soal

Chi Square Tests
page 76
> chisq.test(rbind(res.fair,res.bias))
Pearson’s Chi-squared test
data: rbind(res.fair, res.bias)
X-squared = 10.7034, df = 5, p-value = 0.05759
Notice the small p-value, but by some standards we still accept the null in this numeric example.
If you wish to see some of the intermediate steps you may. The result of the test contains more information than
is printed. As an illustration, if we wanted just the expected counts we can ask with the exp value of the test
> chisq.test(rbind(res.fair,res.bias))[[’exp’]]
1 2
3 4
5
6
res.fair 33.33333 20 28.66667 34 32.66667 51.33333
res.bias 16.66667 10 14.33333 17 16.33333 25.66667
Problems
12.1 In an effort to increase student retention, many colleges have tried block programs. Suppose 100 students are
broken into two groups of 50 at random. One half are in a block program, the other half not. The number of
years in attendance is then measured. We wish to test if the block program makes a difference in retention.
The data is:
Program
Non-Block
Block
1 yr
18
10
2 yr.
15
5
3 yr
5
7
4yr
8
18
5+ yrs.
4
10
Do a test of hypothesis to decide if there is a difference between the two types of programs in terms of retention.
12.2 A survey of drivers was taken to see if they had been in an accident during the previous year, and if so was it
a minor or major accident. The results are tabulated by age group:
AGE
under 18
18-25
26-40
40-65
over 65
None
67
42
75
56
57
Accident Type
minor
10
6
8
4
15
major
5
5
4
6
1
Do a chi-squared hypothesis test of homogeneity to see if there is difference in distributions based on age.
12.3 A fish survey is done to see if the proportion of fish types is consistent with previous years. Suppose, the 3 types
of fish recorded: parrotfish, grouper, tang are historically in a 5:3:4 proportion and in a survey the following
counts are found
observed
Parrotfish
53
Type of Fish
Grouper
22
Tang
49
Do a test of hypothesis to see if this survey of fish has the same proportions as historically.
12.4 The R dataset UCBAdmissions contains data on admission to UC Berkeley by gender. We wish to investigate
if the distribution of males admitted is similar to that of females.
To do so, we need to first do some spade work as the data set is presented in a complex contingency table. The
ftable (flatten table) command is needed. To use it try
> data(UCBAdmissions)
> x = ftable(UCBAdmissions)
> x
Dept
A
B
# read in the dataset
# flatten
# what is there
C
D
E
F
Regression Analysis
Admit
Gender
Admitted Male
Female
Rejected Male
Female
page 77
512 353 120 138 53 22
89 17 202 131 94 24
313 207 205 279 138 351
19
8 391 244 299 317
We want to compare rows 1 and 2. Treating x as a matrix, we can access these with x[1:2,].
Do a test for homogeneity between the two rows. What do you conclude? Repeat for the rejected group.
Section 13: Regression Analysis
Regression analysis forms a major part of the statisticians tool box. This section discusses statistical inference
for the regression coefficients.
Simple linear regression model
R can be used to study the linear relationship between two numerical variables. Such a study is called linear
regression for historical reasons.
The basic model for linear regression is that pairs of data, (xi , yi), are related through the equation
y i = β 0 + β 1 xi + i
The values of β0 and β1 are unknown and will be estimated from the data. The value of i is the amount the y
observation differs from the straight line model.
In order to estimate β0 and β1 the method of least squares is employed. That is, one finds the values of (b0 , b1) which
make the differences b0 + b1 xi − yi as small as possible (in some sense). To streamline notation define yˆi = b0 + b1 xi
and ei = ybi − yi P
be the residual amount of difference for the ith observation. Then the method of least squares finds
(b0 , b1) to make
e2i as small as possible. This mathematical problem can be solved and yields values of
P
(xi − x̄)(yi − ȳ)
P
b1 =
, ȳ = b0 + b1x̄
(xi − x̄)2
Note the latter says the line goes through the point (x̄, ȳ) and has slope b1 .
In order to make statistical inference about these values, one needs to make assumptions about the errors i. The
standard assumptions are that these errors are independent, normals with mean 0 and common variance σ 2 . If these
assumptions are valid then various statistical tests can be made as will be illustrated below.
Example: Linear Regression with R
The maximum heart rate of a person is often said to be related to age by the equation
Max = 220 − Age.
Suppose this is to be empirically proven and 15 people of varying ages are tested for their maximum heart rate. The
following data14 is found:
Age
Max Rate
18 23 25 35 65 54 34 56 72 19 23 42 18 39 37
202 186 187 180 156 169 174 172 153 199 193 174 198 183 178
In a previous section, it was shown how to use lm to do a linear model, and the commands plot and abline to
plot the data and the regression line. Recall, this could also be done with the simple.lm function. To review, we
can plot the regression line as follows
>
>
>
>
x = c(18,23,25,35,65,54,34,56,72,19,23,42,18,39,37)
y = c(202,186,187,180,156,169,174,172,153,199,193,174,198,183,178)
plot(x,y)
# make a plot
abline(lm(y ~ x))
# plot the regression line
14 This data is simulated, however, the following article suggests a maximum rate of 207 − 0.7(age): “Age-predicted maximal heart rate
revisited” Hirofumi Tanaka, Kevin D. Monahan, Douglas R. Seals Journal of the American College of Cardiology, 37:1:153-156.