Project 1. Advice to Teachers and Others

Project 1.
Advice to Teachers and Others
In this project we explore two articles. The first is advice on choosing a text book, but it is
also advice on what is important in a statistics course.
Robert W. Hayden, Advice to Mathematics Teachers on Evaluating Introductory
Statistics Textbooks in Teaching Statistics. “Resources for Undergraduate In The
Introductory Statistics Course, The Entity-Property-Relationship Approach,” 38
(2000), available at http://www.statland.org/MAAFIXED.PDF.
The second article comments about cancer and the meaning of the median. The url below is
just one of many (many!) links to copies of this useful article on the web.
Stephen Jay Gould, The Median Isn’t the Message, Discover, 1985, available at
http://www.edwardtufte.com/tufte/gould.
1. List four things that you learned by carefully reading Hayden’s article.
2. (*) Get a copy of the text used for statistics or your school (or district) and evaluate it
in the manner the author evaluated texts in table I (including the norm). Be sure to
state what text you are using with a complete reference.
3. Evaluate the TN State standards for statistics http://state.tn.us/education/ci/
math/index.shtml. Do they appear to be written by statisticians or not? (E.g., are
the four r’s present? Are exploratory techniques unimportant, but exploratory attitudes
essential? Is data essential and central? In general, compare the standards to the
principles suggested in the article.)
4. Read the Jimmy Nut Company problem sited in Hayden’s article.
a) Pretend this incorrectly worded nut problem makes sense and find the χ2 values
for pounds, for tons (2000 pounds) and grams (?? pounds).
b) Explain why test statistics must be dimensionless (i.e., unitless).
1
c) Explain how to correctly test the company’s claim and do so. (Indicate any missing
information necessary, and for the purposes of this example, supply reasonable
estimates so that you can complete your calculation.)
5. Read “The Median Isn’t the Message” and summarize it in (at least) three sentences
for your students.
6. (*) Gould vaguely describes the probability distribution of “survival time,” but of course
that curve would depend on the treatment method. Using the first line (Sugarbaker) of
Table 2: “Results of treatment of cytoreductive surgery combined with perioperative
intraperitoneal chemotherapy for diffuse malignant peritoneal mesothelioma” from
http://www.surgicaloncology.com/meso.htm, sketch the five number summary for
survival lengths. Clearly indicate the scale! (Note: this will take some estimating, there
is not sufficient evidence there to calculate most of the five numbers, just show me you
understand what they represent by making reasonable guesses.)
7. In the previous exercise I had you use just the first row of table 2. Now look at all
of the rows and give three things that might account for the differences between the
results these rows summarize.
8. Comment on the Belloc quote from this article:
Statistics are the triumph of the quantitative method, and the quantitative
method is the victory of sterility and death.
(For example: what des it mean? Do you agree with it? Did Gould?)
Though I hardily agree with Hayden’s suggestions, in this course I will take an approach
between what he recommends (e.g., as described by the AP Statistics Course) and the more
common mundane grunt-work approach (which is likely what our state will be testing high
school students for).
2
Project 2.
Fathoming π with χ
It appears that the digits of π behave like a “random” sequence: that is, in any base, every
pattern (such as pairs, full-houses, . . . ) should appear with the frequency expected. Such
numbers are called normal numbers. Warning: this is a very different meaning of the word
’normal’ than is used in statistics! In general, it is very difficult to prove that numbers are
normal, but we can present two preliminary tests. Open Fathom and open the collection of
the first 5000 digits of π:
Sample Documents | Mathematics | Number Theory | Pi 5000 Digits.ftm
Let’s test (part of) the assumption that π is normal. (Our purpose here is tri-fold: to practice
using Fathom, to calculate a few probabilities and to apply a couple hypotheses tests.)
1. If π is normal, what should be the distribution of the counts of digits of π (how many
0s, how many 1s, how many 2s, . . . )? (This can be answered with one word, but expand
your answer into a concise and accurate sentence.)
2. Graph the frequency histogram of the digit counts in Fathom with an appropriate title
for the digits and compare it with your previous answer. (For example, you can do this
in Fathom and use Snaggit to insert it into your document.)
3. (*) Use an appropriate statistical test to see if the sample digit (first 5000 digits of π)
provides evidence against the claim that the counts of digits follow the distribution
that you proposed.
a) Which, if any, of the assumptions required for this hypothesis test are questionable
in this case?
b) State the null and alternative hypothesis—always define any symbols that you use.
c) State the test statistic and then draw a graph with the appropriate tail shaded.
(Draw by hand?)
d) Calculate the p-value and then use it in a sentence which explains what it means.
(Convince me that you know what the p-value represents.)
e) Write your conclusion in the context of the problem.
3
4. The Fathom function runLength will show the number of matching values immediately
prior to and including the current value. For example 23314333 would return 11211123;
and 3333333 would return 1234567. Add runLength as an attribute to the collection
(not the table) and define it by the formula runLength(digit).
5. If π is normal, what should the expected frequency distribution of run lengths (the
measure runLength) look like? That is, for any randomly chosen 5000 consecutive
digits from any normal number, fill in the middle two columns of the table below.
6. (**) Use an appropriate statistical test to see if the sample digit sequence (first 5000 digits
of π) provides evidence against you claim about the runLength frequency distribution.
a) What assumptions required for this test are questionable in this case?
b) Fill in the final column of the table below. (Hint: One method is to open the
collection and add the measure named one defined by count(runLength=1), the
measure two defined by count(runLength=2). After you do the same for three,
add the measure more (the count of runLengths > 3).
c) State the null and alternative hypothesis, always define any symbols that you use.
d) State the test statistic and draw a graph with the appropriate tail shaded. (You
may do the drawing by hand.)
e) Calculate the p-value and then use it in a sentence which explains what it means.
f) Write your conclusion in the context of the problem.
7. Why did I ask for “evidence against this claim” in the two problems above when what
we would really like is “evidence for this claim?” Explain why we can’t find evidence
for it more directly.
8. Bonus Question: Our table below just has four rows—lets consider the number of
rows.
a) Find which of the digits has the longest runLength in the first 5,000 digits of pi.
b) This runLength is greater than four, so why did I stop the table below at “> 3?”
c) In 2010 Yee, Kondo, et al. calculated the first 5,000,000,000,000 digits of π
(http://www.numberworld.org/misc_runs/pi-5t/details.html), what would
be an appropriate number of rows for the corresponding table for this much longer
sequence of digits?
run length
1
2
3
>3
probability
expected number
We end with a few digits of π for your amusement.
4
actual number
π = 3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679
8214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196
4428810975665933446128475648233786783165271201909145648566923460348610454326648213393607260249141273
7245870066063155881748815209209628292540917153643678925903600113305305488204665213841469519415116094
3305727036575959195309218611738193261179310511854807446237996274956735188575272489122793818301194912
9833673362440656643086021394946395224737190702179860943702770539217176293176752384674818467669405132
0005681271452635608277857713427577896091736371787214684409012249534301465495853710507922796892589235
4201995611212902196086403441815981362977477130996051870721134999999837297804995105973173281609631859
5024459455346908302642522308253344685035261931188171010003137838752886587533208381420617177669147303
5982534904287554687311595628638823537875937519577818577805321712268066130019278766111959092164201989
3809525720106548586327886593615338182796823030195203530185296899577362259941389124972177528347913151
5574857242454150695950829533116861727855889075098381754637464939319255060400927701671139009848824012
8583616035637076601047101819429555961989467678374494482553797747268471040475346462080466842590694912
9331367702898915210475216205696602405803815019351125338243003558764024749647326391419927260426992279
6782354781636009341721641219924586315030286182974555706749838505494588586926995690927210797509302955
3211653449872027559602364806654991198818347977535663698074265425278625518184175746728909777727938000
8164706001614524919217321721477235014144197356854816136115735255213347574184946843852332390739414333
4547762416862518983569485562099219222184272550254256887671790494601653466804988627232791786085784383
8279679766814541009538837863609506800642251252051173929848960841284886269456042419652850222106611863
0674427862203919494504712371378696095636437191728746776465757396241389086583264599581339047802759009
9465764078951269468398352595709825822620522489407726719478268482601476990902640136394437455305068203
4962524517493996514314298091906592509372216964615157098583874105978859597729754989301617539284681382
6868386894277415599185592524595395943104997252468084598727364469584865383673622262609912460805124388
4390451244136549762780797715691435997700129616089441694868555848406353422072225828488648158456028506
0168427394522674676788952521385225499546667278239864565961163548862305774564980355936345681743241125
1507606947945109659609402522887971089314566913686722874894056010150330861792868092087476091782493858
9009714909675985261365549781893129784821682998948722658804857564014270477555132379641451523746234364
5428584447952658678210511413547357395231134271661021359695362314429524849371871101457654035902799344
0374200731057853906219838744780847848968332144571386875194350643021845319104848100537061468067491927
8191197939952061419663428754440643745123718192179998391015919561814675142691239748940907186494231961
5679452080951465502252316038819301420937621378559566389377870830390697920773467221825625996615014215
0306803844773454920260541466592520149744285073251866600213243408819071048633173464965145390579626856
1005508106658796998163574736384052571459102897064140110971206280439039759515677157700420337869936007
2305587631763594218731251471205329281918261861258673215791984148488291644706095752706957220917567116
7229109816909152801735067127485832228718352093539657251210835791513698820914442100675103346711031412
6711136990865851639831501970165151168517143765761835155650884909989859982387345528331635507647918535
8932261854896321329330898570642046752590709154814165498594616371802709819943099244889575712828905923
2332609729971208443357326548938239119325974636673058360414281388303203824903758985243744170291327656
1809377344403070746921120191302033038019762110110044929321516084244485963766983895228684783123552658
2131449576857262433441893039686426243410773226978028073189154411010446823252716201052652272111660396
6655730925471105578537634668206531098965269186205647693125705863566201855810072936065987648611791045
3348850346113657686753249441668039626579787718556084552965412665408530614344431858676975145661406800
7002378776591344017127494704205622305389945613140711270004078547332699390814546646458807972708266830
6343285878569830523580893306575740679545716377525420211495576158140025012622859413021647155097925923
0990796547376125517656751357517829666454779174501129961489030463994713296210734043751895735961458901
9389713111790429782856475032031986915140287080859904801094121472213179476477726224142548545403321571
8530614228813758504306332175182979866223717215916077166925474873898665494945011465406284336639379003
9769265672146385306736096571209180763832716641627488880078692560290228472104031721186082041900042296
6171196377921337575114959501566049631862947265473642523081770367515906735023507283540567040386743513
6222247715891504953098444893330963408780769325993978054193414473774418426312986080998886874132604721
5695162396586457302163159819319516735381297416772947867242292465436680098067692823828068996400482435
4037014163149658979409243237896907069779422362508221688957383798623001593776471651228935786015881617
5578297352334460428151262720373431465319777741603199066554187639792933441952154134189948544473456738
3162499341913181480927777103863877343177207545654532207770921201905166096280490926360197598828161332
3166636528619326686336062735676303544776280350450777235547105859548702790814356240145171806246436267
9456127531813407833033625423278394497538243720583531147711992606381334677687969597030983391307710987
0408591337464144282277263465947047458784778720192771528073176790770715721344473060570073349243693113
8350493163128404251219256517980694113528013147013047816437885185290928545201165839341965621349143415
9562586586557055269049652098580338507224264829397285847831630577775606888764462482468579260395352773
4803048029005876075825104747091643961362676044925627420420832085661190625454337213153595845068772460
2901618766795240616342522577195429162991930645537799140373404328752628889639958794757291746426357455
2540790914513571113694109119393251910760208252026187985318877058429725916778131496990090192116971737
5
Project 3.
Brushing teeth
In this project we will complete a highly modified version of Investigation #9: Planning and conducting a survey in the “Making Sense of Statistical Studies” text by modifying it to parallel the the Gallup article Teens and Teeth: Most Brush, Few Floss
(http://www.gallup.com/poll/13009/teens-teeth-most-brush-few-floss.aspx).
Warning: this involves collecting survey information, so start early enough to
get this done!
Note that I refer to the questions in the “investigation” so that you may more fully understand
what I am asking. Read the Gallup article before starting. Do the following.
1. Present a survey form designed to collect (a) gender, (b) “on the average, about how
many times do you brush your teeth in a typical day,” and (c) “how often, if ever, do
you floss between your teeth in a typical week;” using the possible answers found in the
Gallup poll. (Similar to investigation questions 1 and 2).
2. Discuss how you will choose your sample. For example, an ideal sample would be simple
random–can you get one? If this is not possible, be sure too explain why the sampling
method you choose will give the most representative sample you can reasonably obtain.
(Similar to investigation questions 3 and 4).
3. Discus what you can do to increase the probability that the responses you get from
students are truthful. That is, how will you avoid them answering what they think
you want to hear, or what they would like you to believe about them, or what might
impress their buddy. . .
4. Record the raw results of your survey and then construct bar charts of your data which
parallel those in the Gallup article. (These may be drawn by hand, in Fathom, in
Excel,. . . , just be sure to label and title appropriately.) Intuitively (visually) compare
your results to the Gallup results. (Similar to investigation questions 5 and 6).
6
5. Do you think that your results are representative? (Part of your answer must state
what the population you are sampling.) If not representative, how do you think they
vary from the actual results? What aspect of your survey design support your answers?
6. Do you think there is a difference between what your students say they do and what
they actually do? Could a different design help? (Similar to investigation question 8.)
7. (*) Appropriately conduct a hypothesis test using the Gallup data to test the article’s
claim “Girls are somewhat more likely than boys to brush twice a day or more.” (Discuss
the assumptions, identify any variables, calculate the test statistic and a p-value, present
your conclusion in the context of the question.)
8. (*) Repeat the previous question with your sample (instead of using the Gallup sample).
9. (*) For each of the five responses to the question about flossing, construct a confidence
interval. For the answer “never”, write a sentence (or two) that explains what this
confidence interval represents. (Convince me that you understand!)
7
Project 4.
Confidence Interval Pedagogy
In this project you will develop a lesson plan to teach the confidence interval for a single
mean and develop a worksheet to illustrate it for your students. We’ll assume you students
can calculate areas under the standard normal curve, find critical values zc , and calculate
z-scores. I know most of you are not teaching statistics right now, but let’s pretend that that
you are.
1. (**) Start by working through the student project “100 Confidence Intervals.” Complete
it and turn you work in as part of this project.
2. (**) Write a lesson plan which introduces the “large sample” estimate:
s
x̄ ± zα/2 √ .
n
(1)
I want to see that you can make this derivation clear to a high school student. One
possibility: start with the central limit theorem, shade the correct region under the
appropriate graph, define the “critical values” zα/2 , use these with
z=
x − x̄
√s
n
to find the endpoints of the interval. Be sure to discuss the necessary assumptions and
the correct interpretation of the final result.
3. Briefly extend your teaching plan to include the interval estimate that uses Student’s
t-distribution. This is usually done with “hand-waving,” but explain as best as possible
while your hands wave.
s
x̄ ± tα/2 √
with df = n − 1.
(2)
n
4. (**) Rewrite the attached student project “100 Confidence Intervals” that I quickly
drafted to make it better and more appropriate for your students (e.g., just give them
a distribution to start with? ask better lead in questions? . . . .) Clean up any parts
you found difficult to follow and add whatever you think might help. If you would like,
you can change the technology from Fathom to whatever you prefer.
8
5. Bonus Question: Write a very short lesson (15-20 minutes) to introduce the one sided
confidence interval (again focus on the large sample case because it follows easily from
the central limit theorem, then briefly hand wave a t-interval).
9
Student Project: 100 Confidence Intervals
The goal of this project is to better understand what “confidence” means when constructing
confidence intervals. To do this we will use Fathom to construct 100 confidence intervals
for the mean, each with 95% confidence, and then count how many of them contain the
population mean.
Warmup:
1. Describe, in words, what a 95% confidence interval for the mean of a population is.
Be sure to use “95%,” “sample mean,” “population mean,” and “interval” in your
answer. Do not give (or describe) the formula for such an interval—just explain what
it represents. This is a very important problem!
2. If we take 100 different large samples of a population, and form the 100 different
95% confidence intervals for the mean that go with these samples, how many of the
confidence would we expect to contain the sample mean? How many of the confidence
intervals would we expect to contain the population mean? Explain why your answers
are correct.
3. Suppose you had chosen other numbers (say 1,000,000 times as large), would that
change your answer to the previous problem?
For our population we will choose a normal population, any other population could be used
if the sample size was large enough, but this is a very common choice.
3. Choose an integer (such as 5) for the mean, and a positive integer (such as 3) for the
standard deviation. Record your choices: µ =
,σ =
. (Why can’t the standard
deviation be negate?)
4. Create a collection named Sample by dragging the collection icon onto the Fathom
document (and then renaming it Sample).
5. In your collection Sample add an attribute x defined by the formula
randomNormal(mean,standard deviation)
where mean and standard deviation are the values that you choose above.
6. Now left click on your collection and add 50 new cases.
This makes Sample a sample of 50 values of x chosen from your normal distribution. Now
lets form a confidence interval. You might also graph the x values from this sample as a dot
plot or histogram to better understand the sample.
10
7. Drag an estimate from the menu bar onto the Fathom document. Select “Estimate
Mean” and drag the attribute x from your collection Sample onto the appropriate
region of the Estimate.
8. This gives us one confidence interval, but we’d like 100, so right click on the estimate
and select “Collect Results As measures”. This will pop up a small window in which
you can check “Replace existing cases” and change the 5 measures to 100 measures.
Now you should have one-hundred 95% confidence intervals, each created from a different
sample of 50 cases from your original population.
9. To see these intervals better, left click on the “Measures from Estimate of Sample”
(your collection of confidence intervals) and then drag a table from the menu onto the
document. This will display the 100 confidence intervals in a table. The first column is
the sample mean, the next two are the endpoints of the confidence interval.
10. Count how many do not contain the sample mean. (Hint: right click on
lowerConfidenceBound
and select “sort descending,” this will make it easy to count the cases where the
population mean is too small to be in the interval. Then “sort ascending” the upper
bound. . . .) Record the count. Did you find more or less intervals tat do not contain
the mean than you expected (question 2)?
11. Right click on your collection of confidence intervals (“Measures from Estimate of
Sample”) and select “Collect more measures.” (This will resample and form new
confidence intervals.) Repeat the previous exercise with this new set of 100 confidence
intervals.
11
Project 5.
Answers may vary
. . . by how much?
A very common exercise that teachers assign their students (and one that I would recommend
that you use) is to have the students use technology to construct 95% confidence intervals
from 100 different samples (often of some set size, e.g., n = 100) from a given population
for which you know the mean (for example a standard normal with mean 200 and standard
deviation 50).1 Then once they have the one-hundred 95% confidence intervals, have them
circle those that do not contain the population mean, and then count how many they circled.
In this project I want you to explore what the set of your students responses should look
like. That is, your students responses form a sample of the counts of 95% confidence intervals
that do not contain the mean, what does this sampling distribution look like?
1. Consider the random variable x, “the count of circled intervals in a random sample of
100 confidence intervals”, what type of distribution does this have? (Hint: it is one of
the standard ones. Start by asking yourself “discrete or continuous?” Once you decide
which distribution it is, name it, then list and verify the assumptions for that type of
distribution.)
2. What will the mean and standard deviation of this variable x be? That is, on the
average, how many intervals will each student circle? How much variation should we
expect? (Hint: these answers do not depend on the distribution that they are sampling
from or on the size of the samples they take each of the 100 times–at least as long as
the samples are each large enough to reasonably form the confidence intervals.)
3. Describe the probability distribution for x. (In your table, just list those for which the
probability is at least 0.001).
A second common exercise is to have the students flip a coin n times and count the number of
heads. We usually do this with technology so we can have the students flip 1,000 or 100,000
times. When students decide to just make up an answer, rather than do the exercise, they
1
We did this in the previous project!
12
often make up a number very close to n/2. Let’s use Fathom explore how close to n/2 the
answers should be. So open Fathom and below we will create a collection in which each case
represents a random sample of coin flips. We will also use a random number of flips (rather
than a fixed n).
4. (*) Drag a new collection onto your Fathom worksheet and proceed as follows.
a) Add an attribute flips (which will be the number of flips). This should be a
random integer between 30 and 10,000,000 (inclusive). You can do this with the
formula randomInteger(30,10000000).
b) Add an attribute heads to count the number of heads defined by the formula
randomBinomial(flips). (Note that this does not indicate the probability of
success explicitly, but that probability defaults to 12 .)
c) Add an attribute expected (defined by the appropriate formula) to give how many
heads we expect for that many flips.
d) Add an attribute std dev (defined by the appropriate formula) to give the standard
deviation of the binomial distribution with this number of flips.
e) Now add the attribute we would like to understand: abs difference defined by
the formula abs(expected-heads). This is the measure of how far the count of
heads in this case varies from what we expected it to be.
f) Now add cases to your collection so that you have 2000 cases.
5. Construct (and submit) a scatter plot of the absolute difference (vertical) verses the
number of flips (horizontal). It should look something like the following.
13
6. To that graph, add the plot of three functions at one standard deviation, two standard
deviations and three standard deviations (remember you have found a formula for one
standard deviation above, use that formula (not the attribute std dev).
7. Now lets count what is under each of those curves. To the collection add measures:
one, two, and three defined by
count(abs difference < std dev)/count(),
count(abs difference < 2std dev)/count(), and
count(abs difference < 3std dev)/count().
These count the number of cases that meet the criteria, then divide by the total number
of cases to make the result a proportion. Record what these proportions equal.
8. (*) Click on rerandomize a couple times to see how the graph changes. Look at the
measures defined in the last problem (you should recognize them), explain why they
have the values that they do. What rule do they illustrate?
q
9. Bonus Question: Add the new function flips
your graph (this is the asymptotic
2π
mean absolute deviation). Add a measure to the collection that calculates the proportion
of cases for which the absolute deviation is less than this value. Report what this
proportion equals and explain why it is not approximately 12 (that is, why are not half
of the absolute deviations on each side of the mean absolute deviation?).
q
In the next project we will discover why flips
is the correct mean absolute deviation from
2π
the mean for the normal approximation to the binomial distribution.
14
Project 6.
Continuous Distributions
This summer we looked ways of calculating the mean for two different situations. First, for a
frequency distribution the mean is defined by
n
P
µ=
xi X
=
xi · n1 .
n
i=1
(3)
We remember the first of these sums as “add’em up, divide by how many;” but the second
may be more informative: “multiply each value by its frequency of occurrence (its density),
then add them up.” In all such formulas the sums run over every possible value of xi (that is,
every possible value of i); but since this is always the case, we usually leave the limits off of
the sums and often also leave off the subscripted ‘i ’s).
When we have a discrete probability distribution, where xi occurs with probability
p(xi ), then the natural generalization of the above is to replace the density n1 with the
probability p(xi ) as follow.
X
µ=
xi p(xi ).
(4)
So if we switch to a continuous random variable x with probability
density function
Rb
f (x), (i.e., the probability x is between a and b is p(a < x < b) = a f (x)dx), we just change
this sum above to an integral
Z
µ = xf (x)dx.
(5)
Again we integrate over every possible x value.
Ideally you should see these formulas for the mean as all versions of the same simple
concept. So, for example, from the first you should be able to determine the others. Remember:
whenever possible try to understand well enough that you do not need to memorize formulas!
For the variance (which gives the standard deviation), the formulas for these three
situations are as follows.
15
frequency distribution:
discrete probability distribution:
σ2 =
X
(xi − µ)2 n1
(6)
σ2 =
X
(xi − µ)2 p(xi )
(7)
(x − µ)2 f (x)dx
(8)
σ2 =
continuous probability distribution:
Z
Again, from the first of these (which is worth memorizing), you should be able to figure out
the other two. Let’s try these formulas for the mean and variance out on some continuous
distributions.
First lets consider the uniform distribution on the interval [a, b].
1. (*) Calculate the mean µ as follows.
Rb
a) Use the fact that the sum of all probabilities, a f (x)dx, must be one to show
f (x) = 1b . (Hint: do you understand why “the distribution is uniform” means that
f (x) is a constant?)
b) By integrating the appropriate integral, find the mean µ for this distribution (it
will be a function of a and b).
c) Explain how to show a high-school student that your formula for the mean is
correct without using an integral. (Hint: use a simple geometry argument.)
d) Find the median of this distribution. How do you know you are correct? (Hint: no
integral
is necessary,
but if you want to know how one could use integrals, solve
Rm
Rb
f
(x)dx
=
f
(x)dx
for the median m.)
a
m
2. (*) By integrating the appropriate integral, find the variance σ 2 for this distribution
(it will be a function of a and b). (Hint: this will just be the integral of a polynomial,
multiply it out and integrate.) Also find the standard deviation σ.
(It is possible to find the answers to the above if you look around, but what I am asking for
here is the work necessary to find them.)
Next lets consider the triangular distribution f (x) = 2x defined on the interval [0, 1].
3. (*) Calculate the mean µ as follows.
a) By integrating the appropriate integral from 0 to 1, find the mean µ for this
distribution (it will be a real number).
b) Explain how to show a high-school student that your formula for the mean is
correct without using an integral. (Hint: use a simple geometry argument.)
c) Find the median of this distribution. How do you know you are correct? (Hint:
again no integral is necessary.)
16
4. (*) By integrating the appropriate integral, find the variance σ 2 for this distribution (it
will be a real number). (Hint: this will just be the integral of a polynomial, multiply it
out and integrate.) Also find the standard deviation σ.
Finally lets consider the normal curve. This is defined by the following density function
f (x) =
√ 1
2πσ 2
e−
(x−µ)2
2σ 2
(9)
5. Write down the integral to determine the mean µ of the standard normal distribution.
Explain why we know the value without integrating. (Hints: standard, symmetry; if
stuck, graph the pdf f (x).)
6. Now lets find the mean value of |x| for the standard normal distribution. One way to
do this is to just put |x|f (x), not xf (x), in equation (5). To evaluate this integral, use
the fact that the function in this integral is even, so we can integrate the positive half
(from 0 to ∞) and double the answer. On this interval |x|f (x) = xf (x), so just work
out this integral. (Hint: Consider the substitution u = x2 .)
An odd example for those up to it. Consider the probability density function
f (x) =
π2
2
+ x2
defined on the non-negative real numbers [0, ∞).
7. Bonus Question: Set up and evaluate the integral necessary to find mean value of
this distribution. Comment appropriately.
By the way, the median of this distribution is π, that is,
Z π
Z ∞
2dx
2dx
1
=
= .
2
2
2
2
π +x
2
0 π +x
π
17
Project 7.
Surveys and Experimental Studies
Read chapters one and two of “An Introduction to Statistical Methods and Data Analysis” by
Ott and Longnecker (6th edition, 2010). We discussed most of this material last summer, but
it has been awhile, and this text does go into more detail. Complete the following problems.
• 1.5
• 2.1
• 2.4
• 2.5
• 2.10
• 2.12
• (*) 2.16
• 2.18
• 2.19
• Bonus Question: 2.22
• Bonus Question: 2.24
18
Project 8.
Describing Data and Determining
Probability
Read chapters three and four of “An Introduction to Statistical Methods and Data Analysis”
by Ott and Longnecker (6th edition, 2010). We discussed most of this material last summer,
but it has been awhile, and this text does go into more detail. Complete the following
problems. (You may skip section 4.14 for now, we will look at it in detail next project.)
• 3.4
• (*) 3.9
Before doing this problem, correct and extend the table: 2003 405 billion, 3.7%; 2004
454 billion, 3.8%; 2005 494 billion 4.0%, 2006 520 billion, 3.9%; 2007 548 billion,
3.9%; 2008 612 billion, 4.2%; 2009 656 billion, 4.6%. (Source: http://www.cbo.gov/
ftpdocs/108xx/doc10871/historicaltables.pdf tables F7 and F8; there is also a
fine (but dated) example graph at http://www.defense.gov/news/FY10%20Budget%
20Request.pdf.)
• Bonus Question: 3.13
• 3.19
• 3.39
• 3.54
• 4.26
• 4.34
• 4.36
• 4.64
• 4.79
• 4.88
19
Project 9.
Recognizing Normality
many statistical test require that the data set be nearly normal. There are two common
ways to test data to see if it is normal, visually inspecting a histogram (not very useful for
small data sets) and by using a normal quantile plot. We will explore both of these in this
project. The first problem contains a lot of grunt work, but I think that work suggest some
import idea and should make the third problem make more sense.
1. (**) Lets begin by making sure we recall what measures of position are. Consider the
sample of ten US women’s heights
S = {184, 183, 152, 172, 155, 151, 158, 173, 155, 161}.
a) Find the following percentiles: 6th, 7th, 8th, 9th, 10th, 11th, 12th, 13th.
b) Find the following percentiles: 5th, Q1 , Q2 , median, 0%, 100%. Was the way you
found the zeroth and 100th percentile than same as the way you found all of the
others?
c) Repeat the previous problem with the standard normal curve.
d) The z-value with 5% to the left is the 5th percentile, but it is also called the
1/20th quantile. In general, the xth quantile is the z-value for which the are to
the left under the normal curve is x. Find the following quantiles.
1 − 12
y1 =
,
10
2 − 12
y2 =
,
10
3 − 12
y3 =
,
10
. . . , y10
10 −
=
10
1
2
e) Finally sort the data items xi above and draw a scatter plot (xi , yi ) for (i =
1, 2, . . . , 10).
2. Look in the review texts for the AP Statistics test and explain how they suggest to
visually inspect histogram for normality. (This shows up in many of the applications of
statistical tests, confidence intervals, . . . )
3. (***) Read section 4.14 of the Ott text and complete exercise 4.114. This problems
asks for a graphical and a quantitative assessment, we will do both with Fathom.
20
a) First put the data set in Fathom. I found it helpful to create a collection with
just one attribute (at first), say mass, and the right number of cases, then create
a table (select the collection and drag a table to the document)—fill in the table
with the data.
b) To provide the graphical assessment, plot the mass using both a histogram and a
normal quantile plot. Submit both graphs and what you conclude from them.
The quantitative assessment is more problematic because it uses a correlation coefficient.
Fathom does display a regression line and its equation, but not the correlation coefficient
we need to complete the analysis. In a way this is good—we will recreate the graph
ourselves and hopefully better understand the normal quantile plot that way.
Read page 197 of Ott very carefully—we will use the method there to redraw the
Fathoms Normal quantile plot. The idea is to first sort the data items, and then draw
a scatter plot of points (x, y) where the x is the ith data point and and y is the value
on the standard normal curve normal where you would expect the ith term of a sample
to occur. For example, the 5th of 12 items is about 5/12th of the way through the data
set of 12 items, so graph it as the x value along with a z value which has 5/12th of the
data to its left (a.k.a. z1−5/12 ). This is a fine idea, except it will not work. The largest
of the 12 data points would be would be 12/12 of the way through the data, so is the
100% percentile, which for a normal distribution would be at infinity. So we use the
sorted data points
mass1 , mass2 , . . . , massi , . . . , massn
as the x-values, and for the ith y value, find the z-value which has (not i/n of the
normal curve area to the left, but) (i − 0.567)/(n − 0.134).2 This differs from the text in
two ways, first they use the data as the y-value, not the x (but this does not effect the
correlation coefficient), second they suggest first (i − 0.5)/n, then (i − 0.375)/(n + 0.25)
as the correct quantile. There are at least a dozen such suggestions in the literature,
each normalized in a different way.
c) To your collection of data first add the index (attribute) i (hint caseIndex which
is under special in the function menu), and then the normal quantile: add another
new attribute normalQuantile and define it by
i − 0.567
normalQuantile
.
count() − 0.134
Note that n = count() is the number of data points.
d) Now plot a scatter plot of this data set using mass on the horizontal axis and
normalQuantile on the vertical. Add a regression line. If all went well it should
look very close to the normal quantile plot drawn in step (a) by Fathom. If it
does, submit it, if not, contact me for help!
2
I think this is what Fathom uses, I will try to find out.
21
e) Finally, use the correlation coefficient and table 16 in the our text to give a
quantitative assessment of the normality of the mass data. (Hint: see table 4.13
page 197, and there is an example of this on the top of page 199.)
22
Project 10.
...
.
23
Project 11.
...
.
24
Project 12.
...
[after Thanksgiving]
25
Project 13.
...
.
26
Project 14.
Final
As we discussed this summer, we will end with a final, a subset (possibly all) of an AP
Statistics exam. This means that I do not need to say any more about it because you have
two books designed just to review for the final:
1. Barron’s AP Statistics with CD-ROM
2. Cracking the AP Statistics Exam, 2010 Edition
We worked through much of theory in the first one during the summer institute.
27