Problem Set 2 Math 243 – Summer 2012 1 Answer each of the following based on your reading of Chapter 2. a) What are the two major different kinds of variables? b) How are variables and cases arranged in a data frame? 1 a) What are the cases in the data frame? A B C D Individual car companies Individual makes and models of cars Individual configurations of cars Different sizes of cars b) Identify which variables are quantitative and which are categorical. Note that some are not variables at all. c) How is the relationship between these things: population, sampling frame, sample, census? • Kia Optima: not.a.variable categorical quantitative • City MPG: not.a.variable categorical quantitative • Vehicle type: not.a.variable categorical quantitative d) What’s the difference between a longitudinal and crosssectional sample? • SUV: not.a.variable categorical quantitative • Trans. type: not.a.variable categorical quantitative • # of cyl.: not.a.variable categorical quantitative e) Describe some types of sampling that lead to the sample potentially being unrepresentative of the population? 2 This problem is based on the KidsFeet data set in the mosaic package. For all the parts of this problem, give both the R statements to find what is asked for and the output of the command. (Use copy and paste.) c) Why are some cars listed twice? Is this a mistake in the table? A Yes, it’s a mistake. B A car brand might be listed more than once, but the cases have different attributes on other variables. C Some cars are more in demand than others. a) The names of the variables. 4 Here is a data set from an experiment about how reaction b) The mean of the foot width variable. c) Which of the cases are girls. d) The mean foot width for the subset of data for the girls. The following give some practice using logical operators. times change after drinking alcohol.[?] The measurements give how long it took for a person to catch a dropped ruler. One measurement was made before drinking any alcohol. Successive measurements were made after one standard drink, two standard drinks, and so on. Measurements are in seconds. Before 0.68 0.54 0.71 0.82 0.58 0.80 e) The mean foot width for the subset of data for people whose bigger foot is left and dominant hand is also left. f ) The mean foot width for the subset of data for people who are either male or whose bigger foot matches the dominant hand. g) The mean foot width for the subset of data for people whose bigger foot does NOT match the dominant hand. Vehicle type compact compact SUV SUV compact compact pickup Trans. type Man. Auto. Auto. Auto. Man. Auto. Auto. # of cyl. 4 6 6 8 4 4 8 City MPG 21 20 14 12 24 24 13 After 2 After 3 0.80 1.38 0.92 1.44 0.83 1.46 0.97 1.51 0.70 1.49 0.92 1.54 and so on ... (a) What are the rows in the above data set? 3 Here is a small data frame about automobiles. Make and model Kia Optima Kia Optima Saab 9-7X AWD Saab 9-7X AWD Ford Focus Ford Focus Ford F150 2WD After 1 0.73 0.64 0.66 0.92 0.68 0.87 Hwy MPG 31 28 20 16 35 33 17 A B C Individual measurements of reaction time. An individual person. The number of drinks. (b) How many variables are there? A B C One — the reaction times. Two — the reaction times with and without alcohol. Four — the reaction times at four different levels of alcohol. Problem Set 2 Math 243 – Summer 2012 The format used for these data has several limitations: 2 5 Sometimes categorical information is represented numerically. In the early days of computing, it was very common to represent everything with a number. For instance the categorical variable for sex, with levels male or female, might be stored as 0 or 1. Even categorical variables like race or language, with many different levels, can be represented as a number. The • It provides no flexibility for different levels of alcohol, for codebook provides the interpretation of each number (hence example 1.5 standard drinks, or for taking into account the word “codebook”). how long the measurement was made after the drink. Here is a very small part of a dataset from the 1960s used to study the influence of smoking and other factors on the weights Another format, which would be better, is this: of babies at birth.[?] gestation.csv • It leaves no room for multiple measurements of an individual at one level of alcohol, for example, two or three baseline measurements, or two or three measurements after one standard drink. SubjectID S1 S1 S1 S1 S2 S2 S2 ReactionTime 0.68 0.73 0.80 1.38 0.54 0.64 0.92 and so on ... Drinks 0 1 2 3 0 1 2 What are the cases in the reformatted data frame? A B C Individual measurements of reaction time. An individual person. The number of drinks. How many variables are there? A B C The same as in the original version. It’s the same data! Three — the subject, the reaction time, the alcohol level. Four — the reaction times at four different levels of alcohol. gest. 284 282 279 244 245 351 282 279 281 273 285 255 261 261 wt 120 113 128 138 132 140 144 141 110 114 115 92 115 144 race 8 0 0 7 7 0 0 0 8 7 7 4 3 0 ed 5 5 2 2 1 5 2 1 5 2 2 7 2 2 wt.1 100 135 115 178 140 120 124 128 99 154 130 125 125 170 inc 1 4 2 98 2 99 2 2 2 1 1 1 4 7 smoke 0 0 1 0 0 3 1 1 1 0 0 1 1 0 number 0 0 1 0 0 2 1 1 2 0 0 5 5 0 At first glance, all of the data seems quantitative. But read the codebook: gest. - length of gestation in days wt - birth weight in ounces (999 unknown) race - mother’s race 0-5=white 6=mex 7=black 8=asian 9=mixed 99=unknown The lack of flexibility in the original data format indicates a more profound problem. The response to alcohol is not just a matter of quantity, but of timing. Drinks spread out over time have less effect than drinks consumed rapidly, and the physio- ed - mother’s education 0= less than 8th grade, logical response to a drink changes over time as the alcohol is 1 = 8th -12th grade - did not graduate, first absorbed into the blood and then gradually removed by 2= HS graduate--no other schooling , the liver. Nothing in this data set indicates how long after the 3= HS+trade, drinks the measurements were taken. The small change in reac4=HS+some college tion time after a single drink might reflect that there was little 5= College graduate, time for the alcohol to be absorbed before the measurement was 6&7 Trade school HS unclear, taken; the large change after three drinks might actually be the 9=unknown response to the first drink finally kicking in. Perhaps it would have been better to make a measurement of the blood alcohol marital 1=married, 2= legally separated, 3= divorced, level at each reaction-time trial. 4=widowed, 5=never married It’s important to think carefully about how to measure your variables effectively, and what you should measure in order to inc - family yearly income in $2500 increments capture the effects you are interested in. 0 = under 2500, 1=2500-4999, ..., 8= 12,500-14,999, 9=15000+, Math 243 – Summer 2012 Problem Set 2 98=unknown, 99=not asked smoke - does mother smoke? 0=never, 1= smokes now, 2=until current pregnancy, 3=once did, not now, 9=unknown number - number of cigarettes smoked per day 0=never, 1=1-4, 2=5-9, 3=10-14, 4=15-19, 5=20-29, 6=30-39, 7=40-60, 8=60+, 9=smoke but don’t know, 98=unknown, 99=not asked 3 the number variable, there is a clear order to the categories until one gets to level 9, which means “smoke but don’t know.” This is an unfortunate choice. It would be better to store number as a quantitative variable telling the number of cigarettes smoked per day. Another variable could be used to indicate whether missing data was “smoke but don’t know,” “unknown”, or “not asked.” 6 Since the computer has to represent numbers that are both very large and very small, scientific notation is often used. The number 7.23e4 means 7.23 × 104 = 72300. The number 1.37e-2 means 1.37 × 10−2 = 0.0137. Taking into account the codebook, what kind of data is each For each of the following numbers written in computer scientific variable? If the data have a natural order, but are not genuinely notation, say what is the corresponding number. quantitative, say “ordinal.” You can ignore the “unknown” or “not asked” codes when giving your answer. a) 3e1 (a) Gestation categorical ordinal quantitative (b) Race categorical ordinal quantitative .0001 0.001 0.003 0.01 0.03 0.1 0.3 1 3 10 30 100 300 1000 3 b) 1e3 .0001 0.001 0.003 0.01 0.03 0.1 0.3 1 3 10 30 100 300 1000 3 (c) Marital categorical ordinal quantitative (d) Inc categorical ordinal quantitative (e) Smoke categorical ordinal quantitative (f) Number categorical ordinal quantitative The disadvantage of storing categorical information as numbers is that it’s easy to get confused and mistake one level for another. Modern software makes it easy to use text strings to label the different levels of categorical variables. Still, you are likely to encounter data with categorical data stored numerically, so be alert. A good modern practice is to code missing data in a consistent way that can be automatically recognized by software as meaning missing. Often, NA is used for this purpose. Notice that in c) 0.1e3 .0001 0.001 0.003 0.01 0.03 0.1 0.3 1 3 10 30 100 300 1000 3 d) 0.3e-2 .0001 0.001 0.003 0.01 0.03 0.1 0.3 1 3 10 30 100 300 1000 3 e) 10e3 .0001 0.001 0.003 0.01 0.03 0.1 0.3 1 3 10 30 100 300 1000 3 f ) 10e-3 .0001 0.001 0.003 0.01 0.03 0.1 0.3 1 3 10 30 100 300 1000 3 g) 0.0003e3 .0001 0.001 0.003 0.01 0.03 0.1 0.3 1 3 10 30 100 300 1000 3
© Copyright 2026 Paperzz