Let`s quickly review the theoretical simple linear regression model

Let’s quickly review the theoretical simple linear regression model including all
assumptions and then discuss how we estimate the parameters in the model.
1
Our theoretical model for the i-th individual can be written as:
•
Y_i = Beta_0 + Beta_1(X_i) + Epsilon_i
The values in this theoretical model represent those for the entire population.
• Y_i = the value of the outcome for the i-th indiviudal
• X_i = the value of the predictor for the i-th individual and
• Beta_0 = the intercept in the entire population – this is a PARAMETER we wish to
estimate
• Beta_1 = the slope in the entire population – this is a PARAMETER we wish to
estimate and interpret
• Epsilon_i = the true error of the i-th individual
2
These errors (epsilon_i) are also parameters, there is one for each member of the
population, their estimates will be used to validate our assumptions about the error term
which are that the errors are
• Normally distributed with a mean of zero and constant variance (denoted by sigmasquared)
• Statistically independent of each other
We will be able to check normality and constant variance through our estimates of these
parameters.
Checking statistical independence is difficult and will generally be assumed to be satisfied
for all datasets we use in examples or assignments.
The errors will automatically have a mean of zero by design of the estimation process we
are about to briefly discuss.
The notation at the top represents, in statistical notation, that the epsilon_i values are
independent and identically distributed normal distributions with mean zero and
variance sigma-squared and is a short-hand notation for the assumptions in simple linear
regression - you are welcome to use or ignore.
3
We wish to estimate beta_0 and beta_1 using a random sample (dataset at hand)
from the population.
One method for finding the “best line” through the data is through ordinary least
squares or OLS.
Using this method we find estimates of beta_0 and beta_1 that minimize:
• Q = the sum of the squared vertical differences between the data points and the
line.
4
Here is an illustration of the worst case – where we estimate all Y-values using the
overall average Y value regardless of the X-value. This would only work if there is
no relationship between X and Y.
And the best case, the line using the estimates of beta_0 and beta_1 which
minimize the sum of squared vertical differences.
We won’t work out the mathematics of PROVING which values minimize Q but it
can be done with relatively simple calculus.
5
Here is an image illustrating how much each observation is WEIGHTED in the sum
of squared vertical differences. Since the values are squared, the image is more
like that on the right than that of our original model on the left.
Outliers can have a large impact on the estimates because we have minimized the
sum of the squared differences.
6
These equations for beta_1-hat and beta_0-hat are those which minimize the
squared vertical difference for a given sample and are our STATISTICS or
PARAMETER ESTIMATES which are estimated from our data and will be used to
estimate the unknown population parameters Beta_1 and Beta_0.
We won’t calculate these by hand and won’t need to use these equations ourselves.
If you look at the equation of the slope and compare it to the calculation of the
correlation coefficient you will see the similarities and in fact they are directly related
– for simple linear regression, the correlation coefficient is the unit-less conversion
of the slope, sometimes called the standardized coefficient.
7
Once we have the parameter estimates, we can write the estimated model for our
data as
• Y_i-hat = beta_0-hat + beta_1-hat(X_i) (with our beta-hat values substituted)
Clearly we are using the “hat” notation to denote our estimates of Y_i, beta_0, and
beta_1.
We will see an example soon.
It is important to remember that Y_i-hat represents the ESTIMATED or PREDICTED
MEAN value of Y for individuals who have X = X_i. When we plug in an X-value
then we get the predicted mean response for that particular x-value which we will
generally simply denote Y-hat but do notice that Y-hat is our estimate of E(Y|X) from
our earlier development of the theoretical model.
8
The residuals can be calculated for each observation by subtracting the predicted
value FROM the observed value. So we have the residual as observed minus
predicted or Y_i minus Y_i-hat.
These are our estimates of the epsilon_i parameters in the original model and again,
we will use them to check the assumptions we made earlier about our true error
term of normality and constant variance as well as the assumption of linearity made
about the mean response.
Although you should be able to calculate predicted values and residuals by hand for
one or two observations, we won’t be doing this process by hand for large samples,
instead we will learn how to get the software to provide the values for us as needed.
9
In minimizing the sum of squared errors, regression is based upon understanding
and accounting for the variation in the response.
We define the total, model, and residual (or error) sum of squares.
The total SS gives the total variation of Y_i around the overall mean Y. (left side of
equation at the bottom of the slide)
The model SS gives the variation that can be accounted for by the model, the
difference between Predicted Y and the overall mean Y (last term in equation at
bottom).
The residual SS gives the variation that cannot be accounted for by the model, the
difference between Y_i and the Predicted Y.
10
Here is a visual representation of the breakdown of the contribution of a particular
point to the sums of squares for the total, the model (EXPLAINED), and the error
(UNEXPLAINED).
This may help some of you to see and understand what is being measured with
these sums of squares.
11
In these illustrations, the picture on the left represents the total sum of squares via
the distances between the points and the overall mean.
The picture on the right represents the regression line and the distances between
the points and the regression line contribute to the residual sum of squares.
The model sum of squares is not represented here but would need the deviations
between the regression line and the overall mean.
12
The residual and model sums of squared add to the total sum of squares. These
values are presented in the standard ANOVA table at the beginning of the
regression. There are degrees of freedom associated with this process.
On the left side for the total sum of squares – we only need to estimate one quantity
from the data (the overall mean Y) so we only lose one degree of freedom and have
n-1.
For the residual sum of squares, we need to estimate both the slope and intercept
in order to estimate Y-hat and thus we have lost two degrees of freedom and have
n-2 for this term.
The model sum of squares only has one degree of freedom – by subtraction. Notice
that the model degrees of freedom can be considered to be = 1 because we only
have one variable in our model, X.
This is just our introduction to these concepts, we will have the opportunity to
become more comfortable with these ideas throughout the semester. It is almost
impossible to learn everything at once!!
13
To review – we have a linear model which precisely defines each observation in our
population (of which we will obtain a sample).
• That model can be written theoretically to define individual responses or average
(expected) responses.
• We have an intercept, slope, random error component.
• By creating this model,
• We have assumed that linearity is reasonable – we can check with a scatterplot.
• We have explicitly assumed that the error term is normally distributed, with mean
zero (which is automatic), constant standard deviation and that the errors are
independent of each other.
The last will not be easy to check as it is more of a result of the sampling or design process.
We will assume all of our samples are reasonably random samples from the populations of
interest. In practice, carefully consider if there are any dependencies in your subjects that
should be considered (all from same household, genetically related, similarity in location or
time or ideas, time trends, etc.)
We will be estimating the parameters from data from which we can provide predicted values
of the mean of the outcome variable. We can also calculate residuals for each observations
which estimate our error term.
14