Exercise 1 SLR Analytics and Assessment

Exercise 1: SLR Analytics & Assessment
This exercise is designed to give you practice running simple linear regression (SLR) models
(econometric models with a single explanatory, or independent, variable). In the first SLR
analysis, you’ll be looking at the relationship between Critical Reading and overall SAT scores.
In the subsequent SLR analyses you’ll be looking at the relationship between a) weekly Spotify
streams and iTunes sales in the US, b) annual %wins and runs scored and allowed in Korean
baseball (2002-16) and c) average shot distance and success rates for NBA players in the 201516 season.
Predicting SAT Scores
In the first SLR analysis, you’ll be using SAT Critical Reading (CR) scores to predict combined
SAT scores. The data in this exercise are built from summary statistics for the 2014 SAT results,
for college-bound students. Links to the data sources are at the end of this Exercise (much of the
data resources have also been posted to the EC228 course website).
Let’s get started! Time to generate your own datasets.
1. Go to http://www.cmaxxsports.com/sat.php to download your first dataset of five
observations. 1 You will be using this data to estimate the relationship between CR and SAT
scores (and with a sample of five observations). Feel free to refresh often.
2. For this sample of five scores, and using Excel, let’s look at the relationship between CR and
SAT scores:
a. Plot the data points using the XY scatter …. putting CR scores on the horizontal axis and
SAT scores on the vertical axis.
b. Use Excel‘s “add trendline” feature (right click on a data point in the scatterplot to see
this option) to add a linear trendline to your XY scatter plot. Be sure to specify (under
“Options”) that the equation and R 2 be displayed along with the trendline. Record the
trendline slope, intercept and R 2 on the Answer Sheet, and attach a snapshot of the
scatterplot and trendline to your Answers.
1
Five college-bound seniors taking the 2014 SAT are selected at random, and their CR and SAT scores are reported.
OLS SLR Analytics and Assessment
c. In class we derived the OLS formulas for the SLR intercept and slope estimates:
∑ ( xi − x )( yi − y ) , and βˆ = y − βˆ x .
βˆ1 =
0
1
∑ ( xi − x )2
Go back to your sample dataset and work through these calculations. 2
i. Compute the means: x and y
ii. Demean the data: ( xi − x ) ’s and ( yi − y ) ’s
iii. Compute the products of the demeaned data and sum them:
∑ (x
i
− x )( yi − y )
iv. Square the demeaned x’s and sum them: ∑ ( xi − x ) 2
v. Use the formulas above to estimate the slope and intercept (for your given sample).
Record your slope and intercept results on the Answer Sheet. Do you get the same
estimates of the slope and intercept that you saw with the trendline equation?
ˆi βˆ0 + βˆ1 xi )
Using the parameters you’ve just derived, calculate predicted SAT scores ( y=
and the associated residuals (actuals – predicteds: uˆ=
yi − yˆi ) for each of your five data
i
points.
d. Derive the SSR (Sum Squared Residuals) and calculate the MSE (Mean Squared Error:
SSR SSR
). Take the square root of this to calculate the RootMSE (RMSE). Record
=
n−2
3
your answers on the Answer Sheet.
e. Calculate the variances of your predicted and actual values, and use the ratio of these to
Var ( predicteds )
SSR
calculate R2. And verify that R 2 =
, where SST is the sum
= 1−
Var (actuals )
SST
squared deviations of the actual values from their mean, and SSR was derived above.
Record your answers on the Answer Sheet.
3. Repeat 2. using Excel’s Regression Analysis capability to run the regression of SAT scores
on CR scores for these five data points. Record your answers on the Answer Sheet. You
should get the same parameters estimates, R2, MSE and RMSE that you calculated in 2.
above. Do you?
4. Repeat 2. using Stata. Import your data into Stata (use the File\Import command, or just
copy and paste your data from Excel into Stata) and use Stata to run the regression of SAT
scores on CR scores for these five data points. Record your answers on the Answer Sheet.
You should get the same parameters estimates, R2, MSE and RMSE that you calculated in
Excel. Do you?
5. Continue working in Stata, and with the CR regression:
a. After running your SLR model, use stata’s predict command to generate predicted values
 i ' s the yˆ ' s ) and residuals ( sat − sat
 , the uˆ ' s ).
( sat
i
i
i
i
2
Hint: You’ll find that this (and what follows) is easier if you insert five or six lines in the Excel worksheet above
your data, and use those new cells to do the various required calculations.
2
OLS SLR Analytics and Assessment
b. Verify that the sample correlation of the actuals ( sati ’s, the yi ' s ) with the explanatory
 ' s , the
variable (cr, the x ' s ) is the same as the sample correlation of the predicteds ( sat
i
i
 ) ρˆ .
corr ( sat , cr ) corr
( sat , sat
=
yˆi ' s ) with the actuals: =
 i ' s , the yˆ ' s ) and the
c. Further, show that the sample correlation of the predicteds ( sat
i


residuals ( sat − sat , the uˆ ' s ) is zero: corr ( sat , uˆ ) = 0 .
i
i
i
Elasticities and Meaningfulness (Economic Significance)
6. Continue working in Stata, and the CR regression:
a. Calculate the mean cr score and use the SLR parameter estimates to predict the sat score
at the mean of cr scores (evaluate the SRF at the mean of cr). Your answer should agree
with the mean sat score, since βˆ0= y − βˆ1 x . Does it?
b. Calculate the point elasticity of predicted SAT scores wrt (with respect to) changes in the
cr score, evaluated at the means. Record your figure on the Answer Sheet.
x dyˆ
x
dyˆ
ŷ βˆ0 + βˆ1 x , then
(Recall that if the SRF is =
= βˆ1 .
= βˆ1 and elasticity =
dx
y dx
y
x
When this is evaluated at the means, it is βˆ1 .)
y
c. Would you say that this suggests a meaningful (non-trivial) relationship? Explain why
you say what you say.
7. Return to your SAT v. CR Excel spreadsheet computations. Working in Excel, and for each
observation, compute the square of the x-distances from the mean, and show that the OLS
slope estimate is indeed a weighted average of the slopes of the lines joining each datapoint
to the sample means point, where the weights are proportional to the square of the xdistances from the x-mean. Attach a snapshot of your computations to your Answers.
For the remainder of this Exercise, repeat 2.-4. above (so working through the Excel and
Stata analyses) for a) the iTunes v. Spotify SLR analysis, b) the Korean baseball SLR
analysis, and c) the NBA Sharpshooters SLR analysis… and record your Answers on the
Answer Sheet
You’ll need to spend some time building your final datasets in the SLR analyses below. At some
point we’ll review in class how to merge datasets in Excel and in Stata… which will be the most
challenging part of constructing your final datasets.
3
OLS SLR Analytics and Assessment
iTunes v. Spotify
From Luis Aguiar and Joel Waldfogel (2015), Streaming Reaches Flood Stage: Does Spotify
Stimulate Or Depress Music Sales?, NBER Working Paper 21653, October 2015:
Streaming music services have exploded in popularity in the past few years, variously raising
optimism and concern about their impacts on recorded music revenue. On the one hand,
streaming services allow sellers to engage in bundling with the promise of increasing revenues,
profits, and consumer surplus. Successful bundling would indeed translate some of the interest in
music not generating revenue through individual track sales - unpaid consumption and
deadweight loss - into willingness to pay for the bundled offering. On the other hand, streaming
may displace traditional individual track sales. Even if they displace sales, streams may however
still raise overall revenue if the streaming payment is large enough in relation to the extent of
sales displacement. …
We find that Spotify use displaces permanent downloads. In particular, 137 Spotify streams
appear to reduce track sales by 1 unit. …
Given the current industry’s revenue from track sales ($0.82 per sale) and the average payment
received per stream ($0.007 per stream), our sales displacement estimates show that the losses
from displaced sales are roughly outweighed by the gains in streaming revenue. In other words,
our analysis shows that interactive streaming appears to be revenue-neutral for the recorded
music industry.
As you can see, Aguiar and Waldfogel found that interactive streaming appears to be revenue
neutral. Their analysis focused on weekly music sales data (digital and physical) from Nielsen
and Billboard, and weekly streaming data from Spotify, from late April 2013 until March of
2015, and for more than 20 countries. You’ll be doing something similar, with a slightly
different dataset.
The posted dataset, Spotify iTunes.xlsx,3 contains US weekly Top 50 data for Spotify streams
(5/5/2013 – 1/19/2017) and iTunes sales (9/22/2013 – 1/12/2017). 4 The data were gleaned from
https://spotifycharts.com/regional and http://www.kworb.net/.
3
4
The full datasets have been posted to the EC228 course website: http://www.cmaxxsports.com/ec228/other.html
The data were gleaned from https://spotifycharts.com/regional and http://www.kworb.net/.
4
OLS SLR Analytics and Assessment
To estimate the displacement effect, you’ll be estimating the slope and intercept parameters for
an SLR model: iTunes
= β 0 + β1Spotify . For this model, the y variable is US iTunes unit sales
(in 1,000s), and the x variable is US Spotify streams. Multiply the iTunes units by 1,000 to make
the units comparable to Spotify streams.
You’ll need to merge the two datasets by date to build your final dataset. To do that in Excel use
the VLOOKUP() command… and in stata, use the merge command. I’ll review these in class.
Focusing on this SLR model, repeat 2. – 4. above and enter your answers on the Answer Sheet.
Aguiar and Waldfogel find a 137 Spotify streams to one (137:1) track sale displacement effect.
Looking at your SLR results, how many incremental streams are associated with a predicted
incremental drop of one iTunes sale. Enter your answer on the Answer Sheet.
Moneyball Meets the Korean Baseball Organization (KBO)
두산 - Doosan Bears, 2016 Korean Baseball Organization (KBO) Champions
Developed by Bill James and popularized by Michael Lewis in Moneyball, the Pythagorean
Theorem in Baseball estimates a team’s annual winning percent as
W
W
RS 2
, where RS is runs scored and RA is runs allowed. We will

%=
wins =
W + L G RS 2 + RA2
work with that model later in the semester, but for now, let’s focus on a simpler model:
W
W
RS
. You’ll be estimating the slope and intercept parameters for the
%=
wins =
 .5
W +L G
RA
RS
more general model, % wins
, using posted annual data from the Korean Baseball
= β 0 + β1
RA
5
OLS SLR Analytics and Assessment
Organization, 2002-2016 (KBO data v2.xlsx). 5 For this model, the y variable is %wins, and the
RS
x variable is
.
RA
Focusing on this SLR model, repeat 2. – 4. above and enter your answers on the Answer Sheet.
Note that you’ll have to merge several datasets to build the final dataset used in the analysis.
There are different ways to do this; I suggest merging by TeamYear, which uniquely identifies
records.
When you have finished answering 2. and 3.:
How close are the estimate intercept and slope parameters to the simple model above, which has
=
β 0 0=
and β1 .5 ?
The residuals from the model will tell you which teams are better or worse than average at
converting runs (scored and allowed) into wins. Teams with positive residuals have above
average %wins given their RS/RA, and accordingly are better at scoring runs (and preventing
opponents from doing so) when that is needed to win games. Teams with negative residuals run
up the score, but have a harder time scoring runs when they need them to win close games.
On the Answer Sheet, list the two best and two worst 2016 KBO teams in terms of converting
runs into wins according to your analysis, which controls for RS/RA.
0
0
50
100
100
150
top
200
top
200
250
300
400
NBA Sharpshooters: 2015-16… and Not!
0
100
200
300
400
500
0
left
100
300
200
400
left
Stephen Curry: Shots 2015-16
(dist: 17.2785; %: .4947257; shots: 1,896)
DeAndre Jordan: Shots 2015-16
(dist: 1.2537; %: .70; shots: 540)
Stephen Curry had a remarkable 2015-16 NBA season, shooting with astounding success over
the course of the season. But so did other NBA players, such as BC’s own Jared Dudley, J.J.
Redick and Kyle Korver. Here are 2015-16 average shot distances and success rates for those
four players:
5
The data were gleaned from http://www.koreabaseball.com/Default.aspx . I initially considered
http://www.baseball-reference.com/register/league.cgi?id=dffe1d5e but did not find their data to be reliable.
6
OLS SLR Analytics and Assessment
Stephen Curry
Jared Dudley
J.J. Redick
Kyle Korver
avgDist %made
nShots
17.28
49.5%
1,896
18.73
47.8%
483
19.33
47.4%
949
20.75
43.7%
679
As you move down through the table, average distances increase and success rates decline. So
how can you compare these and other NBA players?
80
Frequency
40
60
20
0
To estimate the average relationship between average shot
distances and success rates for NBA players in the 2015-16
season, you’ll be estimating the slope and intercept
parameters for an SLR model: swish
= β 0 + β1dist , using
6
data posted to NBAshots16.xlsx. For this model, the y
variable, swish, is a player’s success rate, and the x
variable, dist, is the average shot distance (in ft.) for that
player.
100
SLR analytics to the rescue!
0
500
1000
(count) dist
1500
2000
NBA players in the dataset took as few as just one shot in
the entire season, and as many as 1,896 (guess who?) shots. And they averaged 460 shots
(median = 369). Since the bottom 25% of the NBA 2015-16 players took fewer than 131 shots,
limit your analysis to players with nshots > 131. I repeat: Limit your analysis to players with
nshots > 131.
Focusing on this SLR model, repeat 2. – 4. above and enter your answers on the Answer Sheet.
When you have finished that: The residuals from the SLR model will tell you the extent to
which players are above or below average shooters, controlling for the average distance of their
shots… and in that sense will enable us to compare Messrs. Curry, Dudley, Redick and Korver.
Working in stata, generate the residuals and merge player’s names onto the dataset using the
merge command… now you know who’s who!
On the Answer Sheet, and looking at the residuals, list the five best and five worst NBA shooters
according to your analysis, which controls for average shot distance (remember to exclude the
players in the nshots bottom 25th %tile, with 131 or fewer shots). And how would you rank
Messrs. Curry, Dudley, Redick and Korver?
6
The data were gleaned from game-by-game shots charts posted to basketball-reference.com. Here’s an example:
http://www.basketball-reference.com/boxscores/shot-chart/201606190GSW.html.
7