pdf

Problem Set 2
Math 243 – Summer 2012
1 Answer each of the following based on your reading of Chapter 2.
a) What are the two major different kinds of variables?
b) How are variables and cases arranged in a data frame?
1
a) What are the cases in the data frame?
A
B
C
D
Individual car companies
Individual makes and models of cars
Individual configurations of cars
Different sizes of cars
b) Identify which variables are quantitative and which are
categorical. Note that some are not variables at all.
c) How is the relationship between these things: population,
sampling frame, sample, census?
• Kia Optima: not.a.variable categorical quantitative
• City MPG: not.a.variable categorical quantitative
• Vehicle type: not.a.variable categorical quantitative
d) What’s the difference between a longitudinal and crosssectional sample?
• SUV: not.a.variable categorical quantitative
• Trans. type: not.a.variable categorical quantitative
• # of cyl.: not.a.variable categorical quantitative
e) Describe some types of sampling that lead to the sample
potentially being unrepresentative of the population?
2 This problem is based on the KidsFeet data set in the mosaic
package. For all the parts of this problem, give both the R statements to find what is asked for and the output of the command.
(Use copy and paste.)
c) Why are some cars listed twice? Is this a mistake in the
table?
A Yes, it’s a mistake.
B A car brand might be listed more than once,
but the cases have different attributes on
other variables.
C Some cars are more in demand than others.
a) The names of the variables.
4 Here is a data set from an experiment about how reaction
b) The mean of the foot width variable.
c) Which of the cases are girls.
d) The mean foot width for the subset of data for the girls.
The following give some practice using logical operators.
times change after drinking alcohol.[?] The measurements give
how long it took for a person to catch a dropped ruler. One
measurement was made before drinking any alcohol. Successive measurements were made after one standard drink, two
standard drinks, and so on. Measurements are in seconds.
Before
0.68
0.54
0.71
0.82
0.58
0.80
e) The mean foot width for the subset of data for people
whose bigger foot is left and dominant hand is also left.
f ) The mean foot width for the subset of data for people
who are either male or whose bigger foot matches the
dominant hand.
g) The mean foot width for the subset of data for people
whose bigger foot does NOT match the dominant hand.
Vehicle
type
compact
compact
SUV
SUV
compact
compact
pickup
Trans.
type
Man.
Auto.
Auto.
Auto.
Man.
Auto.
Auto.
# of
cyl.
4
6
6
8
4
4
8
City
MPG
21
20
14
12
24
24
13
After 2 After 3
0.80
1.38
0.92
1.44
0.83
1.46
0.97
1.51
0.70
1.49
0.92
1.54
and so on ...
(a) What are the rows in the above data set?
3 Here is a small data frame about automobiles.
Make and
model
Kia Optima
Kia Optima
Saab 9-7X AWD
Saab 9-7X AWD
Ford Focus
Ford Focus
Ford F150 2WD
After 1
0.73
0.64
0.66
0.92
0.68
0.87
Hwy
MPG
31
28
20
16
35
33
17
A
B
C
Individual measurements of reaction time.
An individual person.
The number of drinks.
(b) How many variables are there?
A
B
C
One — the reaction times.
Two — the reaction times with and without
alcohol.
Four — the reaction times at four different
levels of alcohol.
Problem Set 2
Math 243 – Summer 2012
The format used for these data has several limitations:
2
5 Sometimes categorical information is represented numerically.
In the early days of computing, it was very common to represent everything with a number. For instance the categorical
variable for sex, with levels male or female, might be stored as
0 or 1. Even categorical variables like race or language, with
many different levels, can be represented as a number. The
• It provides no flexibility for different levels of alcohol, for codebook provides the interpretation of each number (hence
example 1.5 standard drinks, or for taking into account the word “codebook”).
how long the measurement was made after the drink.
Here is a very small part of a dataset from the 1960s used to
study the influence of smoking and other factors on the weights
Another format, which would be better, is this:
of babies at birth.[?] gestation.csv
• It leaves no room for multiple measurements of an individual at one level of alcohol, for example, two or three
baseline measurements, or two or three measurements after one standard drink.
SubjectID
S1
S1
S1
S1
S2
S2
S2
ReactionTime
0.68
0.73
0.80
1.38
0.54
0.64
0.92
and so on ...
Drinks
0
1
2
3
0
1
2
What are the cases in the reformatted data frame?
A
B
C
Individual measurements of reaction time.
An individual person.
The number of drinks.
How many variables are there?
A
B
C
The same as in the original version. It’s the
same data!
Three — the subject, the reaction time, the
alcohol level.
Four — the reaction times at four different
levels of alcohol.
gest.
284
282
279
244
245
351
282
279
281
273
285
255
261
261
wt
120
113
128
138
132
140
144
141
110
114
115
92
115
144
race
8
0
0
7
7
0
0
0
8
7
7
4
3
0
ed
5
5
2
2
1
5
2
1
5
2
2
7
2
2
wt.1
100
135
115
178
140
120
124
128
99
154
130
125
125
170
inc
1
4
2
98
2
99
2
2
2
1
1
1
4
7
smoke
0
0
1
0
0
3
1
1
1
0
0
1
1
0
number
0
0
1
0
0
2
1
1
2
0
0
5
5
0
At first glance, all of the data seems quantitative. But read the
codebook:
gest. - length of gestation in days
wt -
birth weight in ounces (999 unknown)
race - mother’s race
0-5=white 6=mex 7=black 8=asian
9=mixed 99=unknown
The lack of flexibility in the original data format indicates a
more profound problem. The response to alcohol is not just a
matter of quantity, but of timing. Drinks spread out over time
have less effect than drinks consumed rapidly, and the physio- ed - mother’s education
0= less than 8th grade,
logical response to a drink changes over time as the alcohol is
1 = 8th -12th grade - did not graduate,
first absorbed into the blood and then gradually removed by
2= HS graduate--no other schooling ,
the liver. Nothing in this data set indicates how long after the
3= HS+trade,
drinks the measurements were taken. The small change in reac4=HS+some college
tion time after a single drink might reflect that there was little
5= College graduate,
time for the alcohol to be absorbed before the measurement was
6&7 Trade school HS unclear,
taken; the large change after three drinks might actually be the
9=unknown
response to the first drink finally kicking in. Perhaps it would
have been better to make a measurement of the blood alcohol
marital 1=married, 2= legally separated, 3= divorced,
level at each reaction-time trial.
4=widowed, 5=never married
It’s important to think carefully about how to measure your
variables effectively, and what you should measure in order to inc - family yearly income in $2500 increments
capture the effects you are interested in.
0 = under 2500, 1=2500-4999, ...,
8= 12,500-14,999, 9=15000+,
Math 243 – Summer 2012
Problem Set 2
98=unknown, 99=not asked
smoke - does mother smoke? 0=never, 1= smokes now,
2=until current pregnancy, 3=once did, not now,
9=unknown
number - number of cigarettes smoked per day
0=never, 1=1-4, 2=5-9, 3=10-14, 4=15-19,
5=20-29, 6=30-39, 7=40-60,
8=60+, 9=smoke but don’t know,
98=unknown, 99=not asked
3
the number variable, there is a clear order to the categories until
one gets to level 9, which means “smoke but don’t know.” This
is an unfortunate choice. It would be better to store number as
a quantitative variable telling the number of cigarettes smoked
per day. Another variable could be used to indicate whether
missing data was “smoke but don’t know,” “unknown”, or “not
asked.”
6 Since the computer has to represent numbers that are both
very large and very small, scientific notation is often used.
The number 7.23e4 means 7.23 × 104 = 72300. The number
1.37e-2 means 1.37 × 10−2 = 0.0137.
Taking into account the codebook, what kind of data is each For each of the following numbers written in computer scientific
variable? If the data have a natural order, but are not genuinely notation, say what is the corresponding number.
quantitative, say “ordinal.” You can ignore the “unknown” or
“not asked” codes when giving your answer.
a) 3e1
(a) Gestation categorical ordinal quantitative
(b) Race categorical ordinal quantitative
.0001 0.001 0.003 0.01 0.03 0.1 0.3 1 3 10 30 100 300 1000 3
b) 1e3
.0001 0.001 0.003 0.01 0.03 0.1 0.3 1 3 10 30 100 300 1000 3
(c) Marital categorical ordinal quantitative
(d) Inc categorical ordinal quantitative
(e) Smoke categorical ordinal quantitative
(f) Number categorical ordinal quantitative
The disadvantage of storing categorical information as numbers
is that it’s easy to get confused and mistake one level for another. Modern software makes it easy to use text strings to label
the different levels of categorical variables. Still, you are likely
to encounter data with categorical data stored numerically, so
be alert.
A good modern practice is to code missing data in a consistent
way that can be automatically recognized by software as meaning missing. Often, NA is used for this purpose. Notice that in
c) 0.1e3
.0001 0.001 0.003 0.01 0.03 0.1 0.3 1 3 10 30 100 300 1000 3
d) 0.3e-2
.0001 0.001 0.003 0.01 0.03 0.1 0.3 1 3 10 30 100 300 1000 3
e) 10e3
.0001 0.001 0.003 0.01 0.03 0.1 0.3 1 3 10 30 100 300 1000 3
f ) 10e-3
.0001 0.001 0.003 0.01 0.03 0.1 0.3 1 3 10 30 100 300 1000 3
g) 0.0003e3
.0001 0.001 0.003 0.01 0.03 0.1 0.3 1 3 10 30 100 300 1000 3