here

The "missing variables" problem
1.
The basic regression model
Consider the basic regression model
(1)
Yi = a + b1X1i + b2X2i + b3X3i + b4X4i + …
where i subscripts denote individual observations (e.g., people). (Thus, Yi refers to the value of Y for
individual i and likewise for the X's.) If we can't measure X2, X3, X4, etc., then we may as well treat the
terms involving these variables as a single variable U, referring to omitted variables: thus, for simplicity,
we may now write
(2)
Yi = a + bXi + Ui
where b is shorthand for b1, X is shorthand for X1, and U is shorthand for the "omitted variables," i.e., for
the term b2X2 + b3X3 + b4X4 + … in equation (1) above. (Note that, by definition, an increase in U must
increase the dependent variable, Y.)
Now consider the derivation of the OLS estimate of the parameters of model (2). First, average
the data using (2), and note that, by (2), we must have
(3)
where bars over the variables denote means. (For example, to get , add up the individual values of Y
and divide by the total number of individuals.) Next, let lower-case letters denote the difference between
the value of a variable and its mean, e.g., yi = Yi – . (This is often called the "mean deviation" of the
variable.) We can get this by subtracting (3) from (2), i.e.,
Yi = a + bXi + Ui
(4)
____________________
yi =
bxi + ui
Now for the tricky part. Take equation (4) and multiply each yi deviation by its corresponding xi
deviation, to get
(5)
yixi = bxi2 + uixi
Next, add up the individual values in (5), to obtain
(6)
Σyixi = b Σxi2 + Σuixi
Next, divide both sides of (6) by Σxi2, to obtain
(7)
Σyixi = b + Σuixi
Σxi2
Σxi2
We can certainly measure Σyixi and Σxi2 – the values of these two terms can be obtained directly from the
data on Y and X. However, seemingly, we have no way to figure out what b is, unless we can say
something about the numerator of the second term on the right-hand side of (7), i.e., about Σuixi . But
what can we say about this?
The key insight that we need here is that if U and X are uncorrelated, then Σuixi will equal zero in
expected value. Remember that u i = U i – , and that x i = X i – . Thus, if Ui and Xi are uncorrelated,
then, whenever xi > 0 (i.e., if Xi is "above average," in the sense that x i = X i – > 0), u i will have no
particular tendency to be either positive or negative (so that u ix i will be expected to equal zero).
Likewise, whenever x i < 0 (i.e., if Xi is "below average," in the sense that x i = X i – < 0), ui will once
again have no particular tendency to be either positive or negative (so that, once again, uix i will be
expected to equal zero).
So, if U and X are uncorrelated, then (7) tells us that
(8)
= Σyixi / Σxi2
where the "hat" over the in equation (8) means that it is an estimate of the true value of b, derived on
the assumption that U and X are uncorrelated. In fact, (8) is the expression for the conventional ordinary
least squares (OLS) estimate of the slope parameter, in the regression of Y on X as given by equation
(2). Equivalently, from (7) and (8) we may write
(9)
= b + Σuixi
Σxi2
so that, if Σxiui equals zero in expected value, then the expected value of the estimate of b, , will be
equal to the true (or actual) value, b.
2.
The "missing variables" problem
The derivation in Part 1 shows that, even if there are missing variables, the OLS estimate of the
parameter b may nevertheless be unbiased (i.e., its expected value will equal the true value). Thus, the
"missing variables" problem may not be a problem at all! There is a key proviso to this conclusion,
however: in order for the OLS estimate of b, , to be equal to the true value, b, it must be the case that
Σxiui equals zero in expected value. What happens, however, if this condition doesn't hold?
First, suppose that u and x are positively correlated. In this case, whenever ui is "above average"
(i.e., u i = U i – > 0), x i will also tend to be "above average" (i.e., x i = X i – > 0). This means that, on
average, each xiui term will tend to be positive in expected value. It follows that, in this case, Σxiui will
also be positive in expected value, and so, by (9), the OLS estimate of b, b, will be upward-biased: in
other words, by (9), b will equal b plus a number that is positive in expected value – so that the expected
value of b will be greater than the true value, b.
Likewise, if u and x are negatively correlated, then whenever ui is "above average" (i.e., u i = U i –
> 0), x i will tend to be "below average" (i.e., x i = X i – < 0). This means that, on average, each xiui
term will tend to be negative in expected value. It follows that, in this case, Σxiui will also be negative in
expected value, and so, by (9), the OLS estimate of b, , will be downward-biased: in other words, by
(9), will equal b (the true value) plus a number that is negative in expected value – so that the expected
value of will be smaller than the true value, b.
Thus, the crucial question about the "missing-variables" problem has to do with whether the
missing variables (the U of equation (2)) is or is not positively correlated with the included variable, X:



If U and X are uncorrelated, then the OLS estimate of the coefficient on X – in other words,
will be unbiased.
If U and X are positively correlated, then the OLS estimate of b will be upward-biased.
If U and X are negatively correlated, then the OLS estimate of b will be downward-biased.
–
For example: suppose we're analyzing pay (Y), and that only two variables affect pay: education
(E) and ability (A). Suppose we can measure education, E, but that we don't have a measure of ability, A.
Thus, A is an "omitted variable," and we will have to run the regression of Y on the single independent
variable we are able to measure, education (E). Our regression will be
(10)
Y = a + bE + U
where U here denotes ability (= A), the "missing variable." (Remember that, given the way (10) is
written, an increase in U must increase Y. By assumption, however, U simply refers to ability, and the
notion that higher ability is associated with higher pay, ceteris paribus, would probably not be terribly
controversial.)
As just noted, the mere fact that that A isn't measured does not, in and of itself, mean that the
OLS estimate of the effect of education on pay, , will be biased. However, what if the people who have
more education are typically the more able persons in the population – in other words, what if E and U
(or, equivalently, E and A) are positively correlated? In this case, as we just saw, the OLS estimate of b
will be upward biased: in other words, ability (U) raises pay, but – both because we can't measure ability
and because education is positively correlated with ability – when we see pay rising, we give all of the
"credit" to having more education, even though some of the credit should go to the greater ability that
more-educated people tend to have.
One final comment: the missing variables argument gets somewhat trickier when we have
multiple measured variables that are included in the regression. In this case, our general propositions
about whether the regression parameter for a particular variable X will be unbiased or biased go through
as before, provided we modify them to add, "with all other included variables held constant." Thus, for
example, suppose we have the regression equation
(11)
Yi = a + b1X1i + b2X2i + b3X3i + b4X4i + Ui
where the variables X1 through X4 are measured, and where U refers to omitted variables. Suppose we
are interested in the estimate of the coefficient b1. Then …



If U and X1 are uncorrelated (with all other included variables – X2, X3, and X4 – held constant),
then the OLS estimate of the coefficient on X1 – in other words, of b1 – will be unbiased.
If U and X are positively correlated (with all other included variables – X2, X3, and X4 – held
constant), then the OLS estimate of b1 will be upward-biased.
If U and X are negatively correlated (with all other included variables – X2, X3, and X4 – held
constant), then the OLS estimate of b1 will be downward-biased.