Economic Statistics: Stata Assignment 2 The correlation coefficient

Economic Statistics: Stata Assignment 2
The correlation coefficient can be used in several ways. One is for testing
whether a “meaningful relationship” exists between two variables. Another is for
calculating the predicted relationship between two variables. Suppose we
wanted to calculate how a vehicle’s weight affects its fuel efficiency. We might
set up the equation as:
Estimated MPG = a + b * weight
We can do this for any pair of variables. The predicted relationship between
some outcome y and some explanatory variable x is:
Estimated y = a + b * x
There are two formulas that we use to calculate the predicted values of a and b.
b = rxy
sy
sx
a = y − b⋅x
Once we calculate these values, we have the relationship between the variables.
Back to the issue of “meaningful relationships” or “statistically significant”
relationships. Stata has two commands that will give you the correlation between
variables. On their own, pwcorr x y and corr x y give the same output. (The
first command stands for “pairwise correlation”.) However, they accept different
options. I frequently find myself using these two variations on the basic
comments: corr x y, cov to give the covariance between variables, and
pwcorr x y, sig to give the “significance level” of a correlation; that is, the
probability that we observed this relationship purely by chance. Formally, this
probability is known as a “p-value”, and a low p-value is strong evidence of a
true relationship. If you want some rather arbitrary guidelines for interpretation,
here they are:
0.00 < p-value < 0.01
Extremely convincing evidence of relationship
0.01 < p-value < 0.05
Convincing evidence of a relationship
0.05 < p-value < 0.10
Suggestive of a relationship
0.10 < p-value < 0.99
Probably no relationship
If economists were jurists, p-values less than 0.05 would constitute “beyond a
reasonable doubt”.
When we graph the relationship between two variables in Stata, we usually use
the command graph twoway followed by additional instructions about the type
of graph and the variables we use. Typically, the basic command looks like:
graph twoway (graphtype ya yb yb x)
We replace the word “graphtype” with the actual type of graph we’re using:
scatter for a scatterplot, line for a line graph (like a time series), and so forth. You
can read “help graph twoway” for more suggestions. The type of graph is always
followed by a list of variable names. The last one is the independent variable on
the horizontal axis. The others are all measured on the vertical axis. There can be
one or several of these. For example,
graph twoway (line le_male year)
would produce a time series graph showing le_male over time, while
graph twoway (line le_male le_female year)
would produce a time series graph tracking both the variables le_male and
le_female over time.
We can also combine multiple graphs into the same image. Each graph is listed
within a set of parentheses. Another way to show the time-series graphs for the
variables le_male and le_female would be:
graph twoway (line le_male year) (line le_female year)
The advantage of this command is that we can combine different types of graphs.
When I create a scatterplot, I usually like to add a graph of the predicted
relationship between the two variables to the picture. In Stata, we can get the
predicted relationship by using “lfit” (linear fit) as the graphtype. You could
see this line on its own by typing:
graph twoway (lfit y x)
To show it on top of the scatterplot, you would do:
graph twoway (scatter y x) (lfit y x)
For each of the problems, do the following:
a. Create a scatterplot with a line showing the relationship between the two
variables. (Print this for me)
b. Find the correlation between the two variables and its significance level.
Does this suggest that a relationship is likely? (Write these answers on
your assignment.)
c. Find the mean and standard deviation of each variable, plus the
covariance between the two variables. Use these to calculate the values of
a and b in the prediction line y = a + b ⋅ x . (Write all of these numbers in
your answer.)
d. Calculate the prediction line using regression in Stata. Show me your
regression output, and then write out the sentences: “The predicted
relationship is y = 3.14 + 1.59 ⋅ x . In other words, when x increases by one,
then y goes up by 1.59 on average.” (Of course, replace the values in the
equation with the values that you calculate. Also replace the variable
names, and units if necessary.)
1. Use the CPS data on workers and their earnings: the outcome (y variable) is
salary, and the explanatory variable (x variable) is years of schooling.
2. Use the data on fuel efficiency: the outcome (y variable) is mpg, and the
explanatory variable (x variable) is vehicle weight.
3. Use the data on Econ 400 students: the outcome (y variable) is height, and the
explanatory variable (x variable) is age.