Homework 5: ECO220Y Required Exercises: Chapter 7: 3, 5 – 8, 13, 17, 19, 21, 23, 31, 33, 41, 43, 47, 59, 63 (Note: The first sets of exercises at the end of each chapter correspond to specific sections. These are the easiest exercises. These are best done as you complete each section (i.e. before you continue and finish the reading).) Extra Exercises (for those who want more practice): Chapter 7: 15, 29, 35 Required Problems: (1) An important problem in economics is how to estimate demand. You’ll recall from introductory and intermediate microeconomics that when analyzing a market – regardless of the model of competition – we specify demand. For example, it could be P = 100 – Q, which is a linear demand curve. The more general linear form is P = a – bQ where a and b are parameters that we will need to estimate using data. We know that quantity demanded will of course be affected by price. In fact, one of the undisputed tenets of economics is the Law of Demand: quantity demanded goes down as price goes up and vice versa. Price → Quantity Demanded P $13.95 $9.95 D Q (a) Give an example of time series data containing prices and quantities of some good. Give an example of cross-sectional data containing prices and quantities of some good. (b) When considering collecting market data on prices and quantities, what kind of data will typically be available? (observational or experimental) Why would the other kind seldom be available? Estimating demand from observational data containing only prices and quantities is impossible. This is because demand shifters affect both the price and the quantity sold. This makes price an endogenous variable and means that there will be a serious endogeneity bias in the estimate of the slope of demand (i.e. for a simple linear demand) obtained from a regression analysis (least squares method). To see that demand shifters – population size, tastes, income, prices of substitute goods, prices of compliment goods, etc. – affect both the market price and the market quantity consider two simple models: perfect competition and monopoly. Monopoly Perfect Competition P S P $13.95 $13.95 $9.95 $9.95 D D mc D’ 19 20 Q 19 20 D’ Q (c) Redraw the diagram for the experimental drug trial data that we did as an example in lecture: that is, a diagram with boxes and arrows that illustrates the research question and the role of any other variables. Next draw such a diagram for the question of estimating demand where we ask how price affects the quantity demanded. Make sure to include the role of demand shifters in your diagram. Henry J. Moore, “father of economic statistics,” conduced regressions for many industries in early 20th century. In some regressions found negative demand elasticities, but in pig iron, for example, found positive demand elasticity and Page 1 of 9 concluded “he had discovered a new type of demand curve with positive slope.” Of course this is insane: I think we’d all like start businesses selling products where the higher the price we charge – other things equal (including quality, service, etc.) – the more our customers want to buy! Moore’s regression analysis had a serious endogeneity bias and hence did not reflect the real slope of demand for pig iron at all. To illustrate consider the following diagram. P S2 S1 S3 D3 D1 D2 Q (d) If the above diagram shows three different time periods which kind of data would it generate on prices and quantities? (observational or experimental; time series, cross-sectional, or panel) How many variables? How many observations? Would the correlation be positive or negative? Does this mean demand is upward sloping? (e) Suppose we had a random sample of 10,000 customers. At random each was offered the opportunity to buy a product at a price of $1, $1.50, $2, $2.5 or $3. We used their purchase decisions to construct data with five observations and two variables: price and quantity sold. Could we use these data to estimate the slope of demand? Would that estimate suffer from an endogeneity bias? Is it a problem that the 10,000 customers have various different tastes, incomes, etc.? (2) How does time spent sleeping the night before a test affect a student’s mark? As part of an empirical project a student surveys a random sample of 10 students in a large economics course. Right after the test each student is asked how much time they had slept the night before (in hours). The student then follows up with the sample a week later to ask about their mark on the test. There were 120 points possible. Here are the data. Use only a non-programmable handheld calculator for all parts of this problem. mark 79 55 85 58 57 74 83 97 69 36 100 mark hours 7 1.5 8.5 4 3.5 7 8 8 3.5 2.5 80 60 40 2 4 6 hours 8 10 (a) Compute the covariance and indicate the units of measurement. (b) Using your answer from (a) and the fact that the s.d. of sleep is 2.60 hours and the s.d. of marks is 17.98, compute the coefficient of correlation and interpret it. (c) Compute coefficient of determination and interpret it. (d) What kind of data are these: observational, experimental, or a natural experiment? Page 2 of 9 (e) Find the regression line equation and interpret the coefficients. Explain how you can or cannot use these results to answer the research question posed at the start of this problem. (f) A team of students is called upon to replicate these results. They collect a larger random sample (n = 30) and obtain the following results. Presuming that both the original study and the replication study contained no nonsampling errors, what is the best explanation for each of the differences (i.e. the difference in the covariance, standard deviations, correlation, R2, intercept, and slope)? mark 120 100 80 60 40 covariance = 21.75; s.d. hours = 1.76; s.d. marks = 19.10; correlation = 0.65 OLS results: mark-hat = 39.9 + 7.0*hours, n = 30, R-squared = 0.42 2 4 6 8 hours 10 (g) Consider again the results from Part (f). Suppose that marks are calculated as percentage marks (i.e. 0 – 100%) instead of raw marks out of 120 points. What would the new OLS results be? The new R-squared? (h) Consider again the results from Part (f). Suppose that study time is recorded in minutes instead of hours. What would the new OLS results be? The new R-squared? NUM_KID-hat = 5.1 – 0.2*EDU_YRS r = -0.3132 NUM_KID = 5.1 - 0.2*EDU_YRS n = 2,589 NUM_KID, NUM_KID_hat (3) An economist studies the relationship between a woman’s education (EDU_YRS) and the number of children that she has (NUM_KID). A random sample of 2,589 Canadian women is drawn. Using the sample of 2,589 Canadian women the following OLS line and coefficient of correlation are estimated: 10 5 0 (a) What kind of data are these? 0 5 10 15 20 EDU_YRS 25 30 (b) How should you interpret these results? What is the meaning of 5.1? What is the meaning of -0.2? What is the meaning of -0.3132? (c) Is the scatter diagram a good description of these data? Explain. Aside from using statistics such as the coefficient of correlation and least squares line, what is another way to summarize the relationship between these two variables? (4) Consider these data: http://www.ers.usda.gov/media/521667/corndatatable.htm. (a) How many observations are in these data? How many variables? Explain. [Answer with 1 – 2 sentences] (b) What kind of data are these? (observational, experimental, time series, cross-sectional, panel) Explain. [Answer with 2 – 3 sentences] For the next parts consider a regression analysis where production in millions of bushels is the dependent variable and the weighted-average farm price in dollars per bushel is the independent variable. (c) Presuming for the moment that the underlying conditions are met (if you are not sure what these are, review pages 191 and 198 in our textbook), fully interpret the OLS line, which has an intercept of 1687.7 and a slope of 2141.2. Include in your interpretation a clear statement about whether or not the OLS line is the supply curve. (HINT (that usually would not be given on a test): Make sure you offer a full interpretation, which means it must Page 3 of 9 address both the intercept and slope, be context-specific, specify the units of measurement, and be clear about causality.) [Answer with 2 – 3 sentences] (e) Based on this scatter diagram, are there any concerns about the underlying conditions being met? Explain. [Answer with 2 – 3 sentences] Quantity (millions bushels) (d) Presuming for the moment that the underlying conditions are met, fully interpret the R2, which is 0.5848. [Answer with 1 sentence] 1926 - 2012 Y-hat = 1687.7 + 2141.2X 20000 15000 10000 5000 0 0 2 4 6 Price (w.a.$/bushel) 8 (5) Consider the following excerpt from Daniel Kahneman’s biography from the Nobel prize website. I had the most satisfying Eureka experience of my career while attempting to teach flight instructors that praise is more effective than punishment for promoting skill-learning. When I had finished my enthusiastic speech, one of the most seasoned instructors in the audience raised his hand and made his own short speech, which began by conceding that positive reinforcement might be good for the birds, but went on to deny that it was optimal for flight cadets. He said, “On many occasions I have praised flight cadets for clean execution of some aerobatic maneuver, and in general when they try it again, they do worse. On the other hand, I have often screamed at cadets for bad execution, and in general they do better the next time. So please don't tell us that reinforcement works and punishment does not, because the opposite is the case.” This was a joyous moment, in which I understood an important truth about the world: because we tend to reward others when they do well and punish them when they do badly, and because there is regression to the mean, it is part of the human condition that we are statistically punished for rewarding others and rewarded for punishing them. I immediately arranged a demonstration in which each participant tossed two coins at a target behind his back, without any feedback. We measured the distances from the target and could see that those who had done best the first time had mostly deteriorated on their second try, and vice versa. But I knew that this demonstration would not undo the effects of lifelong exposure to a perverse contingency. http://www.nobelprize.org/nobel_prizes/economic-sciences/laureates/2002/kahneman-bio.html Consider Prof. Murdock’s general explanation of regression to the mean: Suppose observed performance is a function of some non-random factors (such as skill, effort, etc.) and a random component (pure chance). High draws of the random component boost performance and low draws detract from performance. However, if you get lucky with a good draw of the random component you should not expect to be lucky next time. You should expect an average draw, which if you had been lucky is worse than your lucky draw. Similarly, if you had a poor draw of the random component you should not expect to be unlucky next time: you should expect an average draw, which, of course, is better than poor. Hence there is regression towards the mean: people who got lucky tend to move down towards the mean and people who got unlucky tend to move up towards the mean. The more that performance is driven by random factors the bigger the regression to the mean effect will be. In the extreme case of Kahneman’s example of tossing a coin behind your back two times not even knowing where the target is there is no skill and all chance involved: this was a great way to illustrate regression to the mean because it is extremely powerful in this case. On the other hand, if there is no random component to performance then there will be no regression to the mean. (a) In what way was the seasoned flight instructor mistaken? (Is it that his recollection is faulty? Is it that his inference based on his observations is faulty? Is it both? Is it something else entirely?) Page 4 of 9 (b) Is it important for Kahneman’s argument that some portion of a flight cadet’s performance is entirely random and beyond his/her control? (c) What does the last line of Kahneman’s speech mean for you? (i.e. Is he optimistic that we will be able to take the message of the simulation we just did to heart?) (6) Make sure to do Exercises 1 – 6 on pages 12 – 16 of “Logarithms in Regression Analysis with Asiaphoria” (http://homes.chass.utoronto.ca/~murdockj/eco220/logarithms_in_regression_analysis_with_Asiaphoria.pdf). (7) Recall the Fortune 500 companies example from Lecture 5. Here are two graphs from that lecture: 6 Ln Revenues ($b) Ln Revenues ($b) Fortune 500 Companies, 2013 Y-hat = 1.4 + 0.4X, R2 = 0.42 5 4 3 2 1 -2 0 2 4 Ln Assets ($b) 6 8 6 Fortune 500 Companies, FAKE DATA Y-hat = 1.4 + 0.4X, R2 = 0.42 4 2 0 -2 0 2 4 Ln Assets ($b) 6 8 What do these two additional graphs show? (Each lines up with the corresponding graph above it.) 3 2 1 residual residual 2 1 0 0 -1 -1 -2 -2 1 2 3 Y-hat 4 5 0 1 2 Y-hat 3 4 5 Mobile-cellular subscriptions per 100 inhabitants (8) Consider again the data in Section 3 and Figure 1 in “Logarithms in Regression Analysis.” Below is a graph of these data including the R2, SST, SSR, and SSE. This first graph uses the entire data set (entire cross-section of 156 countries). 250 200 150 100 50 0 2012, n = 156 countries, R2 = 0.43 SST = 267006, SSR = 114678, SSE = 152329 .2 .4 .6 HDI .8 1 The next two graphs use only the 34 member nations of the OECD. The last graph excludes three OECD members that are a bit unusual: Turkey and Mexico, which have rather low HDI’s, and Canada with has the lowest mobile-cellular subscriptions per 100 inhabitants of all OECD members. Finally, the raw data for the OECD member nations are reproduced below. These data are observational, cross-sectional data. Page 5 of 9 .7 .75 country_name Australia Austria Belgium Canada Chile Czech Republic Denmark Estonia Finland France Germany Greece Hungary Iceland Ireland Israel Italy Japan Korea (Republic of) Luxembourg Mexico Netherlands New Zealand Norway Poland Portugal Slovakia Slovenia Spain Sweden Switzerland Turkey United Kingdom United States .8 .85 HDI .9 country_code AUS AUT BEL CAN CHL CZE DNK EST FIN FRA DEU GRC HUN ISL IRL ISR ITA JPN KOR LUX MEX NLD NZL NOR POL PRT SVK SVN ESP SWE CHE TUR GBR USA .95 Mobile-cellular subscriptions per 100 inhabitants Mobile-cellular subscriptions per 100 inhabitants 180 160 140 120 100 80 2012, n = 34 countries, R2 = 0.00 SST = 14981, SSR = 45, SSE = 14936 180 2012, n = 31 countries, R2 = 0.08 SST = 10738, SSR = 831, SSE = 9907 160 140 120 100 hdi_2012 0.938 0.895 0.897 0.911 0.819 0.873 0.901 0.846 0.892 0.893 0.92 0.86 0.831 0.906 0.916 0.9 0.881 0.912 0.909 0.875 0.775 0.921 0.919 0.955 0.821 0.816 0.84 0.892 0.885 0.916 0.913 0.722 0.875 0.937 .8 .85 HDI .9 .95 mobile_tel_subs_per_100_2012 106.2 161.2 119.4 75.7 138.5 122.8 118 154.5 172.5 98.1 131.3 116.9 116.4 105.4 107.1 119.9 159.5 109.4 110.4 145.5 86.8 117.5 110.3 115.5 132.7 115.1 111.2 110.1 108.3 122.6 135.3 90.8 130.8 98.2 (a) Compare and contrast the R2 across the three scatter diagrams. What explains any differences? (b) Compare and contrast the SST, SSR, and SSE across the three graphs. What explains any differences? Page 6 of 9 (9) [Note: This is an important question that reviews many important concepts about descriptive statistics that we have covered so far. Next week we shift towards probability theory.] Each year the U.S. Department of Energy releases a fuel economy guide to inform consumers about the fuel economy and greenhouse gas emissions associated with vehicles (cars, vans, etc.) released that year. Further, on the website (www.fueleconomy.gov), they release detailed data for each make and model of vehicle each year. Consider the most recent data on 1,250 makes and models in 2015 (e.g. Ford Focus with automatic transmission, Honda Civic with manual transmission). These data include: the fuel economy (FE) in city driving in miles per gallon (MPG), the FE in highway driving in MPG, and a green house gas (GHG) emissions rating on a scale from 1 to 10 where 1 is the worst and 10 is the best. (a) What kind of data are these? How many observations? How many variables? (b) Consider this tabulation of the GHG rating and the fact that the sample mean is 5.3 and the s.d. is 1.6. What is the approximate shape of the distribution? Suppose a vehicle had a GHG rating of 3: what would the standardized rating be? How would you interpret it? GHG Rating | Freq. Percent Cum. ------------+----------------------------------1 | 27 2.16 2.16 2 | 23 1.84 4.00 3 | 83 6.64 10.64 4 | 244 19.52 30.16 5 | 359 28.72 58.88 6 | 206 16.48 75.36 7 | 199 15.92 91.28 8 | 86 6.88 98.16 9 | 18 1.44 99.60 10 | 5 0.40 100.00 ------------+----------------------------------Total | 1,250 100.00 (c) Consider this variance-covariance matrix for city and highway fuel economy. Fully interpret all numbers (which will include computing other numbers using these). | city_fe hwy_fe -------------+-----------------city_fe | 30.4814 hwy_fe | 31.1006 38.2749 (d) Consider these OLS regression results. Fully interpret all numbers. city_fe-hat = -2.53 + 0.81*hwy_fe; R-squared = 0.83; n = 1,250 (e) Consider these OLS regression results (noticing the natural log transformation). Fully interpret all numbers. ghg_rating-hat = -12.59 + 6.02*ln(city_fe); R-squared = 0.94; n = 1,250 (10) An interesting use of regression analysis in descriptive statistics is The Economist’s calculation of the adjusted Big Mac Index to compare currencies: http://www.economist.com/content/big-mac-index. Visit the site, read the description and make sure you understand how the adjusted index is computed. (Of course, you can start with the raw index, which does not involve any regression.) Explore the interactive site and pick a country (any country except for the U.S.) to investigate in detail. Extra Problems (for those who want more practice): (11) Suppose that the coefficient of correlation between x and y is 0.5 and the s.d. of x is $10 and the s.d. of y is $10. A move from x = $10 to x = $12 tends to be associated with how much of a change in y? Page 7 of 9 (12) Consider this figure taken from the journal article: Kraay, Aart, and David McKenzie. 2014. "Do Poverty Traps Exist? Assessing the Evidence." Journal of Economic Perspectives, 28(3): 127-48. It is available at http://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.28.3.127. (a) Are the raw data underlying this graph time series, panel, or cross-sectional? Why? (b) Is the line in the graph the OLS line? How do you know? (c) Why have the authors taken the logarithm of both the y and x variables? What is the meaning of the scale of the axes? Why are 5 and 10 as far apart as 24 and 50? (d) The title of this figure is “Absolute Income Stagnation is Rare.” Explain how the figure is consistent with its title. (13) Consider the graph below showing real data on students’ grades and the number of classes that they missed. The title shows the OLS line. Suppose the professor posted this graph on the first day of class and told students “BEWARE!! For every class that you miss you should expect that on average your grade will decrease by 2.9 percentage points. COME TO CLASS AT ALL COSTS!!” Explain whether this professor is behaving ethically (with respect to statistics). Page 8 of 9 Y-hat = 66.5 + -2.9X Grade 100 80 60 40 20 0 5 # of classes missed 10 (14) What does the derivation below show? n R2 SSR SST ( yˆ i y ) 2 i 1 SST n n ( y bx bxi y ) 2 i 1 SST n b2 (x i x)2 (y i y)2 i 1 n i 1 s xy 2 sx (a bx i 1 SST (bxi bx ) 2 i 1 SST n (x i x ) 2 (n 1) (y i y ) 2 (n 1) i 1 n i 1 2 s s x2 s xy 2 2 2 xy s x s y s x s y sy 2 y) 2 n n b2 i b 2 ( xi x ) 2 i 1 SST 2 Page 9 of 9
© Copyright 2026 Paperzz