The Assumptions of the Classical Regression Model

Assumptions of the Regression Model
These assumptions are broken down into parts to allow discussion case-by-case. The first
assumption, model produces data, is made by all statistical models. That's what a statistical
model is, by definition: it is a producer of data. It is an assumption that your data are generated
by a probabilistic process. The rest of the assumptions characterize the process more
specifically.
0. Model Produces Data Assumption
Have you ever heard this phrase “Model produces data?” Well, it’s true. Statistical models,
including regression models, are statements about how the data are produced, about how they
will appear. They are quantifications of your subject matter theory. If you are writing a research
paper, you will typically explain all this theory in words, and then you will write your model,
which is a concise summary of all that you have just said with your words. Then you will state
your research hypotheses, which are also statements about how your data will appear, and are
also quantified in terms of your data-producing model.
Usually, you don't see any "model produces data" assumption explicitly stated in research
articles or other texts. Instead, the assumption is implicit, often stated in model form such as
"Y = 0 + 1x + , where  is random variation."
The fact that "random variation" is specified for  means that the model assumes random
generation of the data. In other words, it is an assumption that the data are produced from a
probabilistic model. This assumption is necessary because the data Y are not a deterministic
function of X. If your relationships are in fact deterministic, then drop this class immediately!
You should be studying differential equations instead.
Here is the specific assumption:
Model produces data assumption: For every X (boldface indicates a vector, or list of values; X
is a (1×k) vector in multiple regression and (1×1) scalar in simple regression), the value of Y is
produced at random from a probability distribution. This distribution is allowed to depend on the
specific value(s) x of the random vector X: Y|X=x ~ p(y|x). In other words, the regression
model states that for a given X=x, the value of Y is produced by the model p(y|x).
Recall that p(y|x) is called the conditional distribution of Y given X = x. This entire course is
about specific models for these conditional distributions p(y|x).
(Note: It is possible that Y is independent of X, in which case Y|X=x ~ p(y), a distribution that is
the same, no matter what is x, and this violates no assumption of the regression model.)
Example: The number of bottles of wine purchased by a customer is modeled by the Poisson
distribution, with a mean that depends on X=time in store. Here
p(y|x) = e y / y! .
The dependence of the distribution p(y|x) on X=x may be expressed by
exp(0 + 1x).
The parameters 0 , 1 are unknown and can be estimated using data. (Recall: Model has
unknown parameters. Data reduce the uncertainty about the unknown parameters.)
Note that the random generation assumption (0.) by itself makes no assumption about
distributions, Poisson, normal or otherwise, and makes no specific assumptions about the
functional relationships between Y and X (linear, quadratic, logarithmic, etc.). As such, the
model is fairly generic, and therefore quite benign. Statistical data really do look as if generated
from distributions, simply because randomly produced data exhibit variability, and because
variability is real. The "model produces (random) data" assumption is therefore quite realistic, in
that the data produced by such a model look like the data you actually see. If there is no
variability in the data that you see, then you should drop this class immediately and take a course
in differential equations instead! (Is there an echo in here?)
The following assumptions make more specific statements about distributions and functional
forms of relationships. They define the classic regression model. When you use the usual output
from any standard regression software, you are making all these assumptions. These assumptions
are very restrictive, though, and much of the course will be about alternative models that are
more realistic. Much of the course is also about identifying when you can use the more
restrictive models, despite their being wrong. Recall: All models are wrong but some are useful.
(- George Box)
1. Correct Functional Specification Assumption
The means of the conditional distributions p(y|x) (a different mean for every x) fall exactly on a
function that is in the family f (x;) that you specify, for some vector  of fixed, unknown
parameters. (Note 1: Model has unknown parameters. Note 2: Some of the values in the vector
 can be 0's without violating the assumption.)
Examples:





Linear Regression: you assume f (x; ) = 0 + 1x (Note: flat lines are in this family and
hence do not violate the assumption).
Quadratic Regression: you assume f (x;) = 0 + 1x2x2 (Note: straight lines and flat
lines are in this family and hence do not violate the assumption).
Another Example: you assume f (x; ) = 0 + 1(1/x)
Multiple regression with interaction: you assume f (x; ) = 0 + 1x1 + 2x23x1x2
... whatever form you assume for f (x;) is the assumption you make under Assumption
1. See the Poisson regression example above where the assumption was that the means
fall on the function f (x;) = exp(0 + 1x).
Note: In quantile regression, this assumption refers to the relationship of a quantile of the
distribution of Y to X = x. For example, the median of the conditional distribution of Y|X = x
might be assumed to be a linear function of x. Also, in other applications such as logistic
regression, this assumption refers to the relationship of a transformation of the mean of Y (called
a link function), to the X vector.
2. Constant Variance, or Homoscedasticity Assumption
The variances of the conditional distributions p(y|x) are constant (i.e., they are all the same
number, 2) for all specific values X=x.
3. Uncorrelated Errors (or Conditional Independence) Assumption
Unlike the previous assumptions, which do not refer to any specific data set, this assumption
refers to the set of DATA that you can collect, with observations i =1, 2, …, n.
The residual i = Yi – f (xi; ) is uncorrelated with the residual j = Yj – f (xj; ), for all sample
pairs (i,j), 1 ≤ i, j ≤ n.
Alternatively, the components of Y(n×1) are independent, given X(n×k )= x(n×k ). This alternative
form is used in maximum likelihood models such as logistic regression.
4. Normality Assumption
The conditional probability distribution function p(y|x) is a normal distribution for every X=x
(i.e., for every X=x, the distribution of Y is normal, as opposed to Bernoulli, Poisson,
multinomial, uniform, etc.). Specifically, the normality assumption states that for a given X=x, Y
is produced by a model having the function form of the normal distribution:
p( y | x ) 
  { y   ( x )}2 
exp

2
2  ( x )
 2 ( x ) 
1
This model allows nonlinearity, where (x) is a curved function, and also allows
heteroscedasticity, where (x) is non-constant.
Putting Them All Together: The Classical Linear Regression Model
The assumptions 1. – 4. can be all true, all false, or some true and others false. They are not
connected. But when they are all true, and when the function f (x; ) is linear in the  values so
that f (x;) = 0 + 1x1 + 2x2 + … + kxk, you have the classical regression model:
Yi | Xi = xi ~independent N(0 + 1xi1 + 2xi2 + … + kxik , 2), for i = 1, 2, …, n.