Class 10

Class 10
(And some of 9)
Line fitting
OLS regression
Discussion of time series and panel models
Project check in
Statistical Significance
•
Research and null hypotheses
–
–
Hypothesis states the relationship between
two variables.
The null hypothesis state that there is NO
(or a random) relationship between two
variables.
•
•
–
H: Democracies trade more with each other
than with non-democracies.
H0: Status as a democracy is not related to
trade volume
You are testing to reject H0 not accept H.
Types of Error
State of Nature
Decision based
on Sample
Reject H0
Do not Reject
H0
H0 true
H0 Untrue
Type 1 error
(false alarm)
Correct
Correct
Type 2 error
Alpha level
=.05, 5% chance of committing Type 1
error, or 95% chance of the decision to
reject the null hypothesis being
correct.
Causality
• In establishing causality there is a
dependent variable, which you are
trying to explain, and one or more
independent variables that are
assumed to be factors in the variation
of the dependent variable.
• You need a logical model to “explain”
this relationship or causality
Thinking in Models (again)
• What is a model?
– Explains which elements relate to each
other and how.
– Describing Relationships in a model
• Covariation – move in the same direction
– Direct or Positive
– Inverse or Negative
– Nonlinear
• False of spurious
– Control (confounding) variables
• Are you looking for the best model or
testing someone else’s?
Developing models
• Where does a model come from?
– From your own assessment and observation
of the problem, or from talking to others.
– From the literature.
• Elements others include or consider important
• Definitions of these elements
• Descriptions of the “expected” relationships
among variables
• Results and explanations
• Sources and strategies for data
• Suggestions of models or variations to be tested
in the future
Types of Models
1. Schematic
Capital
Econ Growth
Labor
2. Symbolic
a) Economic growth is a function of
changes to the amount of capital (K)
and changes to the amount of Labor (L).
b) G=f(K,L)
The basic linear model
(equation)
You can express many relationships as the linear
equation:
y = a + bx, where
•
•
•
•
•
y is the dependent variable
x is the independent variable
a is a constant
b is the slope of the line
For every increase of 1 in x, y changes by an
amount equal to b
A perfectly linear relationship is where each change
results in exactly the same change. i.e. a strict ad
valorem tariff.
Line Fitting
Other relationships may not
be so exact.
If you take a sample of
actual heights and weights,
you might see something
like the graph to the right.
200
Weight
Weight, is only to some
degree a function of height.
220
180
160
140
120
100
60
65
70
75
Height
Source:
http://www.tennessee.gov/tacir/Fiscal%20Capacity/Workshop/Regression%2
0Analysis%20Handout%20(Methodology%20Part%201).ppt
Line Fitting (cont.)
The line is the “average”
relationship described by the
equation:
220
y = a + bx+e
Weight
200
180
160
140
120
100
60
65
70
Height
75
The difference between the line
and any individual observation
is the error (e).
The observations that
contributed to this analysis
were all for heights between 5’
and 6’4”. You cannot,
extrapolate the results to
heights outside of those
observed. The regression
results are only valid for the
range of actual observations.
Regression
Regression is the method by which we find the line that best fits
the observations, i.e. has the lowest error.
Since the line describes the mean of the effects of the
independent variables, by definition, the sum of the actual
errors will be zero.
If you add up all of the values of the dependent variable and you
add up all the values predicted by the model, the sum is the
same and the sum of the negative errors (for points below the
line) will exactly offset the sum of the positive errors (for
points above the line).
Therefore Summing the errors would always equal zero. So,
instead, regression must find another way to measure the
scale of the error. An Ordinary Least Squares (OLS)
regression finds the line that results in the lowest sum of
squared errors.
Multiple Regression
• What if we have multiple factors
contributing to a result or a
prediction?
– For example basic economic theory
suggests that capital and labor contribute
to economic growth.
– Hard to “see” how these two factors
contribute to growth.
The multiple regression
equation
Each of these factors has a separate relationship with the price
of a home. The equation that describes a multiple regression
relationship is:
y = a + b1L + b2K + e
This equation separates each individual independent variable
from the rest, allowing each to have its own coefficient
describing its relationship to the dependent variable. If Labor
and Capital have the same coefficient than both contribute
equally to economic growth.
In a statistics software program you will enter your dependent
variable first and then your independent variables.
You will need to make sure the data and the variables conform
to the assumptions of the model
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.803316
R Square
0.645316
Adjusted R Square
0.639305
Standard Error
1.788119
Observations
121
ANOVA
df
Regression
SS
MS
F
107.3455
2
686.446
343.223
Residual
118
377.2895
3.197368
Total
120
1063.735
Coefficients
Standard Error
Intercept
-15.0184
ln pop2001
ln GDP per capita
t Stat
P-value
2.064116
-7.27595
4.12E-11
0.656415
0.106162
6.183167
1.490166
0.104154
14.30738
Significance F
2.76E-27
Lower 95%
Upper 95%
Lower 95.0%
Upper 95.0%
-19.1059
-10.9309
-19.1059
-10.9309
9.31E-09
0.446186
0.866643
0.446186
0.866643
1.53E-27
1.283913
1.696418
1.283913
1.696418
How good is the model
The R2 value
• tells you what proportion of differences is
explained by the model. An R2 of .68, for
example, means that 68% of the variance in
the observed values of the dependent
variable is explained by the model, and 32%
of those differences remains unexplained in
the error term.
Returning to the model of economic
growth… Is explaining 50% of the
causes good enough?
How much should you
explain?
• Random error need not be a problem.
– There is always error, a larger R-square is not a goal in
and of itself.
• Some error is due to latent variables that can not
be observed.
– There may be additional variables that can be logically
assumed to measure these causes of variation indirectly in
some way.
– But even if they empirically appear to “explain” the
variation within the regression model, variables should
not necessarily be added unless there appears to be a
logical way in which they might explain variation in the
independent variable.
Statistical Significance
• Each independent variable has a “p-value” or significance
level in the results. Sometimes it is explicitly given,
sometimes just the test statistic with which significance can
be derived.
• The p-value is a percentage. It tells you how likely it is that
the coefficient for that independent variable emerged by
chance and does not describe a real relationship (type I
error).
• A p-value of .05 means that there is a 5% chance that the
relationship emerged randomly and a 95% chance that the
relationship is real.
• It is generally accepted practice to consider variables with a
p-value of less than .1 as significant, though the only basis
for this cutoff is convention.
Direction and Size
Look at the signs of the B coefficients.
Do they have the expected signs?
– Your model and hypothesis should give you an
expectation of the direction of each independent
variable’s influence.
• Is the effect large or small?
– Even if it is significant and in the right direction,
does a change in the independent variable yield
a large or small change in the independent
variable or vice versa?
F-Test
There is also a significance level for the model
as a whole.
• The F-test or “Significance F” value in Excel
measures the likelihood that the model as a
whole describes a relationship that emerged
at random, rather than a real relationship.
As with the p-value, the lower the
significance F value, the greater the chance
that the relationships in the model are real.
Other Errors or Problems
• Multicollinearity
• Omitted Variables
• Endogeneity
• Other
Presenting Regression Results