METR702_lab2

METR702
YILIN LU
LAB 2
FALL 2014
Lab 2 – Exploring the Central Limit Theorem
Part A: Thought experiments with synthetic distributions:
1. To get an extreme value as average of two numbers, the two numbers has to close to the
same extreme side (either close to 1 or close to 0). However, the chance to choose 2
numbers randomly and both 2 numbers are extreme at the same side is very small. As a
result, the extreme values are under-represented and the middle ones are more widely
represented when we average 2 random numbers.
2. The shapes of the distributions of “avg of 10” and “avg of 20” becomes more bell-shape
and more narrow than the “avg of 2”.
Distributions
Parent distribution
0
.1
.2
.3
Normal (0.54223, 0.30044)
.4
.5
.6
.7
.8
.9
1
Normal (Mean, Stdev)
avg of 2
0
.1
.2
.3
.4
.5
.6
.7
.8
.9
1
Normal (0.55077, 0.20357)
Normal (Average of Possible Means, SE)
avg of 10
0
.1
.2
.3
Normal (0.52916, 0.09646)
.4
.5
.6
.7
.8
.9
1
avg of 20
0
.1
.2
.3
.4
.5
.6
.7
.8
.9
1
Normal (0.54038, 0.0697)
3. The means are around 0.5, not substantially different. It because the mean of the
measurements from the uniform parent distribution/samples is already close to the true mean.
4. The standard deviations are decreasing because the shapes of distributions get narrower when
we average more points from the samples.
5. Stdev(x) of parent distribution = 0.30044
SE (m-avg mean) = Stdev(x) / (m) 1/2
m=2, SE (2-avg mean) = 0.30044 / (2) 1/2 = 0.2124, close to 0.20357
m=10, SE (10-avg mean) = 0.30044 / (10) 1/2 = 0.0950, close to 0.09646
m=20, SE (20-avg mean) = 0.30044 / (20) 1/2 = 0.0672, close to 0.0697
The averages seem pretty close to the prediction of the Central Limit Theorem rule.
6. Distributions
Parent distribution
0
.1
.2
.3
.4
.5
.6
.7
.8
.9
1
.4
.5
.6
.7
.8
.9
1
Normal (0.22402,0.26943)
Normal (Mean, Stdev)
avg of 2
0
.1
.2
.3
Normal (0.21219,0.18546)
Normal (Average of Possible Means, SE)
avg of 10
0
.1
.2
.3
.4
.5
.6
.7
.8
.9
1
.4
.5
.6
.7
.8
.9
1
Normal (0.23125,0.08641)
avg of 20
0
.1
.2
.3
Normal (0.21925,0.05751)
7. The distribution of the 2-point average does not look as a normal distribution – not bellshaped. It is skewed to the left. The 10-point average is more like a normal distribution and the
20-point average is more acceptably “normal”. It might because as number of points in the
average increase, the distributions become more precise/narrow in the peak.
8. Stdev(x) of parent distribution = 0.22402
m=2, SE (2-avg mean) = 0.22402/ (2) 1/2 = 0.1584, close to 0.18546
m=10, SE (10-avg mean) = 0.22402/ (10) 1/2 = 0.0708, close to 0.08641
m=20, SE (20-avg mean) = 0.22402/ (20) 1/2 = 0.05, close to 0.05751
The averages seem close to the prediction of the Central Limit Theorem rule, but not as good as
the ones with the uniform parent distribution.
9. Yes, I do think averaging more measurements makes the distribution more “normal”.
Part B: Can you estimate the average flow from a watershed?
10. Distributions
Flow (L/s) - Parent Distribution
0
Normal (1491.85,2052.6)
Normal (Mean, Stdev)
10 000
avg of 10 (L/s)
0
10 000
Normal (1487.53,636.159)
Normal (Average of Possible Means, SE)
avg of 50 (L/s)
0
Normal (1485.76,290.327)
10 000
When the number of samples being averaged increases, the peaks of distribution get more and
more narrow while the distributions get more and more “normal”.
11. CLT Rule: SE (m-avg mean) = Stdev(x) / SQRT(m)
Stdev(x) = 2052.6 [L/s]
Number of the points in
the average (m)
Stander deviation
error (SE) [L/s]
1
10
20
50
100
2052.6
636.159
460.992
290.327
197.17303
Stander deviation error
expected from the CLT
Rule [L/s]
2052.6
649.0891125
458.9753131
290.2814758
205.26
The rule seems still reasonable accurately, with plus/minus 10 L/s uncertainty.
12. Roughly, I need 100 samples in the average to get standard deviation error in about 200 L/s.
13. SE (m-avg mean) = Stdev(x) / SQRT(m)
SQRT(m) = Stdev(x) / SE (m-avg mean)
m = [ Stdev(x) / SE (m-avg mean) ] 2
Stdev(x) = 2052.6 [L/s]
SE (m-avg mean) = 15 [L/s]
m = [ 2052.6 (L/s) / 15 (L/s) ] 2
= 18725
So, to get standard deviation error in about 15 L/s (1%), we need about 18700 samples in the
average.
14. This result implies that to detect the effects of climate change or land-use changes in a high
precise, I need to have a lot of samples. So, I need to continuously repeat the random
measurement, which is very hard to process by one person alone or by hands, and I don’t even
know if the samples of measurements are representative of the whole population. I can detect the
effects without a such high percentage of standard deviation error, which will give a much less
precise and maybe an inaccurate average respected to the true mean if the parent distribution (the
distribution of population which we don’t know) is not uniform; that is, there is a risk to get
results with bias.