Assignment1

2012 --STAT1010 Assignment 2--
Prof. Tsang
1. For the following population of size N = 15 scores: 3, 4, 4, 5, 6, 6, 6, 7, 7, 8, 9, 9, 9, 10, 11;
a. Sketch a histogram showing the population distribution.
b. Find and locate the value of the population mean in your sketch.
c. Compute the variance, and standard deviation for the population.
d. Choose any sample of size n=8 (you can include any 8 of the 14 scores above in your
sample) from the population and compute the mean, variance, and standard deviation for this
sample.
e. Compare the variance, and standard deviation you got in (c) and (d). Are they the same or
which set of them is larger?
f. Now suppose you make a mistake in your calculation in (d). You divide the sum of squared
deviations by n instead of n-1 in the formula for the variance. What do you get for the sample
variance and standard deviation? Is the result larger or smaller than the result in (d)?
g. Compare the variance, and standard deviation you got in (c) and (f). Are they the same or
which set of them is larger?
h. Which set of results for sample variance, and standard deviation ( (d) or (f) ) is a better
estimate for the population variance and standard deviation?
Solution: a. 2k  15  k  4
b. Mean=6.93
c. Population Variance :  2 
1 n
( xi   ) 2  5.26

n i 1
Population S.D.    2  2.29
d. we can choose the first 8 values as a sample
3, 4, 4, 5, 6, 6, 6, 7
Statistics
1
2012 --STAT1010 Assignment 2--
Prof. Tsang
VAR00001
N
Valid
8
Missing
0
Mean
5.1250
Std. Deviation
Variance
1.35620
1.839
e. They are not same and (c) is larger.
f. Variance:
1.716
Std. Deviation: 1.310
So, the result is smaller than the result in (d).
g. the result in (c) is larger.
h. (d) is better.
2. Use the same set of population data of size N = 15 as in problem (1) to determine the following:
a. The smallest & largest scores, and range of the dataset
b. The first quartile, median and the third quartile (using interpolation)
c. The interquartile range (IQR)
d. Draw the Box-and-whisker plot for the dataset.
e. The skewness of the distribution using the formula: Sk=(mean – median)/σ
Solution: a.
Statistics
VAR00001
N
Valid
15
Missing
0
Range
8.00
Minimum
3.00
Maximum
11.00
b. Using interpolation
The first quartile: i=15*25%=3.75
the 3rd value is 4, the 4th value is 5 , Q1=4+(5-4)*0.75=4.75
median : i=15*50%=7.5
the 7th value is 6, the 8th value is 7, Median=7+(8-7)*0.5=7.5
The third quartile: i=15*0.75=11.25
The 11th value is 9, the 12th value is 9, Q3=9
c. IQR=Q3-Q1=9-4.75=4.25
2
2012 --STAT1010 Assignment 2--
Prof. Tsang
d.
e.
Sk=(mean – median)/σ=-0.249
3. The following is a table for countries with the highest CO2 emissions in 2006 and their GDP
CO2 emissions in 2006 (MTon)
China
6103
US
5752
Russia
1564
India
1510
Japan
1293
Germany
805
United Kingdom 568
Canada
544
South Korea
475
Italy
474
GDP at 2006 (in trillions US$)
2.7
13.4
1.0
0.9
4.4
2.9
2.1
1.3
1.3
1.9
a.
b.
c.
d.
e.
f.
g.
Find the mean, variance and standard deviation of CO2 emissions for these countries
Find the Coefficients of Variation for the CO2 emission data
Find the “Five-number summary” of CO2 emissions for these countries
Find the mean, variance and standard deviation of GDP for these countries
Find the Coefficients of Variation for the GDP data
Find the “Five-number summary” of GDP for these countries
Construct a scatter-plot using the GDP as the horizontal variable and the CO2 emissions as the vertical
variable. Can you observe any relationship between these 2 variables from your plot?
h. Find the covariance between the CO2 emissions and the GDP for these countries and their Correlation
Coefficient
Solution: Answer is given by SPSS.
a.
Statistics
VAR00001
N
Valid
Missing
10
0
3
2012 --STAT1010 Assignment 2-Mean
1908.80
Std. Deviation
2160.55
Variance
Prof. Tsang
4667985.51
b. Coefficients of Variation = S.D./Mean=1.151
c.
Statistics
VAR00001
N
Valid
10
Missing
0
Minimum
474.00
Maximum
6103.00
Percentiles
25
526.7500
50
1049.0000
75
2611.0000
d.
Statistics
VAR00002
N
Valid
10
Missing
Mean
0
3.190
Std. Deviation
3.7427
Variance
14.008
e.
Coefficients of Variation = S.D./Mean=1.173
f.
Statistics
VAR00002
N
Valid
Missing
10
0
Minimum
.9
Maximum
13.4
4
2012 --STAT1010 Assignment 2-Percentiles
25
1.225
50
2.000
75
3.275
Prof. Tsang
g.
There are a little relationship between these 2 variables from the plot.
h. covariance:
1
 ( xi  x )( yi  y )  5225.44
n 1
Correlation Coefficient: covariance/Sx
Sy=5225.44/2160.55/3.7427=
4. Following is a table containing the raw scores of 45 students in the mid-term test and the final examination. Use
Excel spread sheet or other software you know to:
(a) Calculate the mean, median, mode, variance, standard deviation and the coefficients of variation for the two set
of scores. What do you learn from these descriptive statistics?
(b) Construct histograms with bin-width of 10 to show the distributions of the two set of scores.
(c) Find the Five Number Summary and the IQR of the two set of scores from the histogram.
(d) Construct a scatter-plot using the final examination scores as the horizontal variable and the mid-term test
scores as the vertical variable. Can you observe any relationship between these 2 variables from your plot? Find the
covariance and correlation coefficient between these 2 variables.
(e) Transform the raw scores of the two dataset to standardized distributions using the z-scores and compare the 2
distributions. What conclusion can you draw from it?
(f) The student with ID=10 got a raw score of 85 in both the mid-term and final exam. Can you determine if the
student has improved his/her standing in the class or not?
Mid-term
Student ID
Final Exam
test
1
83.3
93.5
2
89.5
95
3
82.9
89
4
97.4
86
5
71.7
77.5
6
81.0
82.7
7
94.0
95
5
2012 --STAT1010 Assignment 2--
Prof. Tsang
8
82.2
96.2
9
84.3
95
10
85.0
85
11
73.6
66
12
77.4
93.0
13
87.2
83.9
14
87.7
73
15
78.0
87.6
16
84.4
92.5
17
81.4
85
18
89.6
82.5
19
82.5
81.2
20
86.4
100
21
80.0
79.2
22
91.0
92.5
23
78.9
81.4
24
87.4
82.5
25
73.3
81.2
26
76.5
77.5
27
89.1
76.2
28
69.0
75
29
75.0
80.0
30
71.0
63.7
31
84.1
54
32
70.7
87.5
33
65.8
70
34
65.6
73.7
35
68.5
76.2
36
61.8
58
37
53.4
66.2
38
59.8
83.0
39
44.0
65
40
59.3
56.2
41
46.8
57.5
42
52.2
72.5
43
49.0
75
44
64.0
57
45
36.0
43.8
6
2012 --STAT1010 Assignment 2--
Prof. Tsang
Solution:
a.
Statistics
Mid-term test
N
Valid
Final Exam
45
45
0
0
Mean
74.482
78.320
Median
78.000
81.200
36.0a
95.0
Std. Deviation
14.3558
13.0473
Variance
206.088
170.233
Missing
Mode
a. Multiple modes exist. The smallest value is shown
coefficients of variation: for Mid-term test: 14.3558/74.482=0.193
for Final Exam: 13.0473/78.320=0.167
We learn: (i)students’ score improved a bit in the final exam, (ii)variation in scores also decreased
b.
Left-skewed
7
2012 --STAT1010 Assignment 2--
Prof. Tsang
Left-skewed
c.
Statistics
Mid-term test
N
Valid
Final Exam
45
45
0
0
Minimum
36.0
43.8
Maximum
97.4
100.0
25
65.700
71.250
50
78.000
81.200
75
84.700
87.550
Missing
Percentiles
d.
There is some relationship between these 2 variables from the plot. In general, students’ scores in the mid-term and
8
2012 --STAT1010 Assignment 2--
Prof. Tsang
the final are proportional to each other. Student with better mid-term score will tend to have better final score.
Covariance: 130.09
Correlation coefficient :0.695
(e)
9
2012 --STAT1010 Assignment 2--
Prof. Tsang
More students are one σ above the mean value in the final exam.
f. In the midterm, zm 
85  74.482
 0.733
14.3558
In the final exam, z f 
85  78.320
 0.512
13.0473
Since z f  z m , we can get that the student has not improved his/her standing in the class.
5. A manufacturer of tires wants to advertise a mileage interval that excludes no more than 5% of the mileage on
tires he sells (at least 95% of the tires he sells will be able to achieve a mileage in that interval) . Suppose that μ =
25000 and σ = 4000. What interval would you would suggest by applying the Chebyshev's theorem?
What would be the answer if it is known that the mileage follows the normal distribution? Give your best estimate.
1
 0.95  k  20
k2
The interval is [25000  4000 20 ]  [7111.46,42888.54]
If the mileage follows the normal distribution, more than 95% of the tires will be able to achieve a mileage in
the interval [µ-2σ, µ+2σ], or [17000, 33000].
Solution: Applying the Chebyshev's theorem, 1 
10