Unit 3 Summary Statistics (Descriptive Statistics)

Unit 6
Sampling Distributions
and Statistical Inference - 1
FPP Chapters 16-18, 20-21, 23
The Law of Averages (Ch 16)
Box Models (Ch 16)
Sampling Distribution
Probability Histogram (Ch 17)
Sampling Distribution
Central Limit Theorem (Ch 17, 18)
Expected Value (Ch 17, 18)
for average (mean), sum, percentage, count
Standard Error (Ch 17, 18)
for average (mean), sum, percentage, count
Chance Error
Confidence Intervals (Ch 21)
6-1
Stats
A.05
The Law of Averages
•Toss a coin 10,000 times.
•At each toss we expect 50% to be
heads.
•At each toss let’s note
–the number of heads
–the percentage of heads
6-2
Stats
-10
-20
-30
-40
-50
Number of heads-Expected number of heads
0
Number of heads
10
50
100
500
1000
5000
Number of tosses
6-3
Stats
10000
5
0
-5
-10
Percentage of heads - 50%
10
Percentage of heads
10
50
100
500
1000
5000
Number of tosses
6-4
Stats
10000
The Law of Averages
With a large number of tosses, the
percentage of heads is likely to be
close to 50%, although it is not
likely to be exactly equal to 50%.
6-5
Stats
The Law of Averages
does NOT say …
“The ___________________ team
has had such a long string of losses,
they are due to get a win. Therefore
their chances of winning the next
game are greater.”
“I have tossed a coin many times,
and now have a string of 5 heads.
So the chances of getting tails on the
next toss must be greater than 50%.”
6-6
Stats
Number of Heads,
Chance Error
•Number of heads =
50% of the number of tosses +
chance error
•Can we assess what the chance error is?
6-7
Stats
Coin toss example
•It turns out that
- after 100 tosses, chance error = 5
- after 10,000 tosses, chance error = 50
- increasing the number of tosses by 100
times, chance error increases _______
times.
•Why does the percentage go to
50%?
6-8
Stats
100  10
Example
We have the choice of tossing a coin
10 times or 100 times. We win if
–we get more than 60% heads.
–we get more than 40% heads.
–we get between 40% and 60% heads.
–we get exactly 50% heads.
Should we toss 10 or 100 times?
6-9
Stats
Baseball series
•Team A believes that on any day
they have a 60% chance of beating
Team B.
•They have the option of playing
–1 game, or
–best 2 out of 3
•Which format should they choose?
6-10
Stats
Where we are headed
•We want to perform a political
survey and randomly sample
citizens.
•We want to quantify the chance
variability of our sample. (We don’t
want all to be republican).
•We can solve variability questions
like these by analogy with drawing
from a box.
6-11
Stats
Making a Box Model
In specifying a box model, we would
like to know
- What numbers go into the box
- How many of each kind
- How many draws (sample size)
In practice, what do we really know /
not know?
Why do we make box models?
6-12
Stats
Variability in
the box model
1 2 3 4 5 6
•Sample 25 tickets with replacement.
•Record the sum of the 25 tickets.
32326
46515
61531
35242
26534
•Their sum is 89.
6-13
Stats
Try again
44614
16152
14521
45225
43326
•sum is 83
32351
44651
21521
24346
16313
•sum is 78
•Other tries: 82, 92, 71, 73, 90
•Range is 25 to 150 but we only
observed 71 to 92.
6-14
Stats
Roulette
•A roulette wheel has 38 pockets
–18 red numbers
–18 black numbers
–2 green (0 and 00)
•We put a dollar on red. What are
the chances of winning?
•What numbers are in the box?
6-15
Stats
Net gain
•Net gain is the amount that we have
won or lost.
•Let’s play 10 times…
R R
+1 +1
+1 +2
R B
+1 –1
+3 +2
G R
–1 +1
+1 +2
R B
+1 –1
+3 +2
B R
–1 +1
+1 +2
6-16
Stats
So, Our Box Model is …
6-17
Stats
6-18
Stats
Which game?
You win if you draw a “1”.
•A box has 1 “0” ticket and 9 “1”
tickets.
Or
•A box has 10 “0” ticket and 90 “1”
tickets.
Or
•You draw 10 times with
replacement. If the sum is 10 then
you win.
6-19
Stats
Our Box Model is …
6-20
Stats
Expected Value
Chapt 17
“The expected value for the sum of draws
made at random with replacement from a
box”
equals
the expected value for a sample sum
equals
A sample sum is likely to be around its
expected value, but to be off by a chance
error similar in
6-21
Stats
size to the standard error for sum.
Standard Error for Sum
The standard error for sum, SE(sum), for a
random sample of a given sample size
is
sample size  (population SD)
.
In FPP, this is
number of draws (SDof. box )
6-22
Stats
A Sample Sum is Likely ...
The sample sum is likely to be around
____________, give or take
____________or so.
The expected value for the sum, EV(sum),
fills the first blank.
The standard error for sum, SE(sum), fills
the second blank.
Observed values are rarely more than 2 or
3 SE’s away from the expected value.
6-23
Stats
A Reminder
The formulas here are for simple random
samples. They likely do not apply to
other kinds of samples.
6-24
Stats
Example - Keno
In Keno, if you bet on one number, if you
win you get $2, if you lose you lose $1.
The chance of winning is ¼________.
What does the box model look like?
What is the expected net gain after 100
plays?
6-25
Stats
6-26
Stats
Example
Washington State Lottery
In MegaMillions,you pay $1 to play. You
select 5 numbers between 1 and 56,
and one MegaBall number between 1
and 46. If you match all 5 numbers AND
the MegaBall number, you win the
jackpot (starts at $12 million).
The chance of winning is ¼_____.
What does the box model look like?
What is the expected net gain after 100
plays?
6-27
Stats
6-28
Stats
Washington State Lottery
continued
Today’s jackpot is ___________.
Suppose you play 10 times.
We want to know about your net gain.
What is the relevant box model?
6-29
Stats
Washington State Lottery
continued
What is the expected net gain if you buy
100 tickets?
What does that mean?
What is the standard error for your net
gain?
What does that tell us?
6-30
Stats
Probability histogram
Earlier in the course we displayed
data in histograms.
0.0
0.1
0.2
0.3
0.4
• Probability
histograms
represent the true
(as opposed to
the data) chance
of an outcome.
• Example: rolling a
die
1
2
3
4
x
5
6
6-31
Stats
0.30
0.30
Sum of two die
1,000
0.20
0.10
0.0
0.0
0.10
0.20
100
2
4
6
8
10
12
2
4
6
0.20
0.10
4
6
8
x
12
truth
0.0
0.20
0.0
0.10
10,000
2
10
x
0.30
0.30
x
8
10
12
2
4
6
8
x
6-32
Stats
10
12
Empirical vs. truth
After rolling 100 times we see that we
never rolled a 2. But we know a 2 is
possible.
After rolling 1,000 times the
distribution seems more symmetric
After 10,000 the histogram is
symmetric.
The empirical histogram converges to
the true histogram.
6-33
Stats
Caution
There are two counts that may be
confused
–the number of things added together
–the number of repetitions of the
experiment
As the number of repetitions increases,
the empirical distribution converges to
the true histogram.
What happens when the number of things
added together increases?
6-34
Stats
Expected Value
Chapt 23
“The expected value for the average of
draws made at random with replacement
from a box”
equals
the expected value for a sample mean
equals
A sample average (mean) is likely to be
around its
expected value, but to be off by a chance
error similar in
size to the standard error for average.
6-35
Stats
Standard Error for
Average
The standard error for average,
SE(avg), for a random sample of a
given sample size is
population SD .
sample size
In FPP, this is
SDofbox
.
number of draws

6-36
Stats
A Sample Average is Likely ...
The sample average is likely to be around
__________ _, give or take
____________or so.
The expected value for the average,
EV(avg), fills the first blank.
The standard error for average, SE(avg),
fills the second blank.
Observed values are rarely more than 2 or
3 SE’s away from the expected value.
6-37
Stats
A Warning
The formulas here are for simple random
samples. They likely do not apply to other
kinds of samples.
6-38
Stats
Probability histograms
and the normal curve
0.08
Toss a coin 100 times
0.06
Average = 50
0.0
0.02
0.04
SD = 5
35
40
45
50
55
60
6-39
Stats
65
Using the Normal
• A coin is tossed 100 times. Use the
normal curve to estimate the chances of
–exactly 50 heads (7.96%)
–between 45 and 55 heads inclusive
(72.87%)
–between 45 and 55 heads exclusive
(63.19%)
• Probability histograms can be difficult to
compute but the normal curve is easy.
6-40
Stats
Drawing from
a lopsided box
0.0
0.0
0.004
0.2
0.008
0.4
0.012
0.6
Assume that the box has tickets
1,9,5,5,5
2
4
6
8
400
500
550
x
0.002
0.004
x
450
0.0
6-41
Stats
4800
4900
5000
x
5100
5200
5300
6-42
Stats
=
Central Limit
Theorem
When drawing
• a LARGE sample
• at random
• with replacement from a box,
And computing
the sample sum of draws (net gain),
the sample count (# heads),
the sample average, or
the sample percent,
the probability histogram will follow a
normal curve.
6-43
Stats
Central Limit Theorem
When the sample size is large enough, to
use a normal curve to make probability
calculations we simply need
–the expected value of the sum
–(This can tell us about the
)
–the standard error of the sum
–(This can tell us about the
)
6-44
Stats
Central Limit Theorem
When drawing
• a LARGE sample
• at random
• with replacement from a box,
the probability histogram for the sample
sum will follow a normal curve.
The average of this probability histogram
is the
EV(sum),
and the SD of this probability histogram is
SE(sum).
6-45
Stats
Central Limit Theorem
When drawing
• a LARGE sample
• at random
• with replacement from a box,
And computing
the average of draws,
the probability histogram for the sample
average (mean) will follow a normal curve.
The average of this probability histogram
is the
EV(avg) = the population mean,
and the SD of this probability histogram is
SE(avg).
6-46
Stats
Using the normal curve
In practice
68% of the time the observed sum will
be between expected value  1 SE
95% of the time the observed sum will
be between expected value  2 SEs
6-47
Stats
Using Normal Curves
to figure probabilities
Example: Roulette
There are 161 students, 3 TA’s, and one
professor for this course.
Suppose that we each play ten $1 games
of roulette, always betting on red.
Recall that a roulette wheel has 18 red, 18
black, and 2 green pockets.
If the balls lands in a red pocket, we get
back our $1 and win an additional $1.
If the ball lands in a black or green pocket,
we lose our $1.
6-48
Stats
Roulette example
• Box model
• Expected value of sum
• Standard error
• Probability
6-49
Stats
A short cut to SE
When there are only two different
numbers in the box
small  fraction w ith fraction w ith
 big
 
SD  


big number small number
 number number 
SE  number of draws  SD
6-50
Stats
Classifying & Counting
For percentages or counts (number of
occurrences of something), we can use a
special Box Model.
For classifying and counting (looking at
percentages or counts) use a box with 0’s
and 1’s on the tickets.
Tickets marked ‘1’ signify a “special” item.
Tickets marked ‘0’ signify a “non-special”
item.
6-51
Stats
Classifying & Counting
continued
What is the average of all of the ticket
values in a 0-1 box?
What is the SD of all of the ticket values in
a 0-1 box?
6-52
Stats
Classifying & Counting
continued further
What is the sum of a sample of n draws
from a 0-1 box?
Expected Value for the sum of a sample of
n draws from a 0-1 box?
What is the SD for the sum of a sample of
n draws from a 0-1 box?
6-53
Stats
Expected Value and
Standard Error for
Sample Counts
What is the Expected Value of the number
of 1’s drawn from a 0-1 box?
(This is the Expected Value for a sample
count drawn from a population with _____
“special” items and _______ “non-special”
items.)
What is the Standard Error for the count of
1’s drawn from a 0-1 box?
6-54
Stats
A Sample Count is Likely
...
The sample count is likely to be around
__________ _, give or take
____________or so.
The expected value for the count,
EV(count), fills the first blank.
The standard error for count, SE(count),
fills the second blank.
Observed values are rarely more than 2 or
3 SE’s away from the expected value.
6-55
Stats
Remember ...
The formulas here are for simple
random samples. They likely do not
apply to other kinds of samples.
6-56
Stats
Expected Value and
Standard Error for
Sample Proportions
What is the Expected Value of the
percentage of 1’s drawn from a 0-1 box?
(This is the Expected Value for a sample
percentage drawn from a population with
_____ “special” items and _______ “nonspecial” items.)
What is the Standard Error for the
percentage of 1’s drawn from a 0-1 box?
6-57
Stats
A Sample Percentage
is Likely ...
The sample percentage is likely to be
around
__________ _, give or take
____________or so.
The expected value for the count, EV(%),
fills the first blank.
The standard error for count, SE(%), fills
the second blank.
Observed values are rarely more than 2 or
3 SE’s away from the expected value.
6-58
Stats
Central Limit Theorem
for Percentages & Counts
When drawing a LARGE sample at
random with replacement from a box, the
probability histogram for the sample
percentage will follow a normal curve.
The average of this probability histogram
is the
EV(%) = the population %,
and the
SD of this probability histogram is
SE(%) = .
6-59
Stats
Central Limit Theorem
for Percentages & Counts
When drawing a LARGE sample at
random with replacement from a box, the
probability histogram for the sample count
will follow a normal curve.
The average of this probability histogram
is the
EV(count) =
and the
SD of this probability histogram is
SE(count) =
6-60
Stats
Summarizing …
Expected Values and Standard Errors
6-61
Stats
Shape of the
Sampling Distribution
and Sample Size
What happens to the Shape of the
Sampling Distribution as the Sample Size
gets large?
6-62
Stats
Expected Values,
Standard Errors,
and Sample Size
What happens to Expected Values and
Standard Errors as Sample Size
increases?
6-63
Stats
Summarizing the
Central Limit Theorem
As the sample size (# of draws from the
box, n) gets large, …
6-64
Stats
Estimation
Box models:
If we know what goes in the box, then we
can say how likely various outcomes are.
In practice,
We do not know what is in the box.
That is,
We do not know the population
parameters.
Instead
We use data to estimate the population
parameters, such as average, %, sd, …
6-65
Stats
Confidence Intervals
Point estimate:
To estimate the population average
(mean) with a single value, use
The likely size of your estimation error is
Interval estimate:
To estimate the population average
(mean) with an interval of values, the
width of your interval depends upon how
confident you want to be that your interval
6-66
includes the population mean.
Stats
Confidence Intervals
A confidence interval is used when
estimating an unknown parameter from
sample data. The interval gives a range
for the parameter - and a confidence level
that the range covers the true value.
Chances are in the sampling procedure,
not in the parameter.
6-67
Stats
Confidence Interval
Example
Pennies
6-68
Stats
Confidence Intervals
Point estimate:
To estimate the population percentage
with a single value, use
The likely size of your estimation error is
Interval estimate:
To estimate the population percentage
with an interval of values, the width of your
interval depends upon how confident you
want to be that your interval includes the
population percentage.
6-69
Stats
Confidence Interval
Example
Pennies
6-70
Stats
The Bootstrap
When estimating a population
percentage (i.e. when sampling from
a 0-1 box), the fraction of 0’s and 1’s
in the box is unknown.
The SD of the box can be estimated
by substituting the fraction of 0’s and
1’s in the sample for the unknown
fractions in the box.
The estimate is good when the
sample is reasonably large.
6-71
Stats
Basic Method for
Constructing
Confidence Intervals
6-72
Stats
Interpreting a
Confidence Interval
6-73
Stats
Margin of Error
6-74
Stats
Sample Size Computations
6-75
Stats
6-76
Stats