Regression Analysis

Regression Analysis


A scatter plot gives a compact illustration of
the distribution of two variables and the
association / correlation between the
variables
When there is causation/ or hypothesize
causation,we want to find the relatioship
between the two variables.



Given two variables ,say X ( height )and
Y( weight ) and if X is the cause leading to
changes in Y ( effect ) , then our interest will
be to determine a mathematical relationship (
model ) between X and Y and use it to predict
Y for given X.
Here X is called independent variable and Y is
called dependent variable.
SIMPLE LINEAR REGRESSION



Regression analysis is a statistical procedure
used to find relationships among a set of
variables
In simple linear regression ,we find linear
relationship between one independent ( X )
and one dependent variable ( Y ).
The model is called simple linear regression
model .
Some assumptions


The values of the independent variable are
fixed and measured without error/ negligible
error ( this is not a rigid assumption; the
values need not be fixed in the sense they are
nonrandom. They could be the values realised
by a random variable also )
For each value of X there is a subpopulation
of Y values



For the validity of statistical inference, this
subpopulation is normally distributed with
equal variances
The means of all the subpopulations fall on
the same straight line. This is the assumption
of linearity
The Y values are statistically independent




Based on these assumptions we formulate
the following statistical model,called the
simple linear regression model:
y = α +β x + e
Here ‘y’ is a typical value from one of the
subpopulations of Y, α and β are the regression
coefficients and e is called the error term.
The error term is normally distributed
Given sample data on X and Y, we try to
estimate the parameters( regression
coefficients) in the model and construct the
sample regression equation.
The sample regression equation is a linear
equation, such as:
y = a + bx
Where ‘a’ and ‘b’ are the estimates of α and β
y = a + bx
 y is the dependent variable
 x is the independent variable
 a is a constant ( called intercept )
 b is the slope of the line
 For every increase of one unit in x, y changes by an
amount equal to b
 Some relationships are perfectly linear and fit this
equation exactly.


Using a sample of n pairs of (x , y ) values,
the coefficients ‘a’ and ‘b’ are given by the
following formulas:
( xi  x)( yi  y )
b
2
( xi  x)

After getting ‘b’ ‘a’ can be determined from
the following formula:
yand a arebthex sample means of
y

Where
x and y respectively.
x
y
220
200
Weight
Weight, for instance, is to
some degree a function
of height, but there are
variations that height
does not explain.
On average, you might
have an equation like:
Weight = -222 +
5.7*Height
If you take a sample of
actual heights and
weights, you might see
something like the graph
to the right.
180
160
140
120
100
60
65
70
Height
75
220
Weight
200
180
160
140
120
100
60
65
70
Height
75
The line in the graph shows the
average relationship described by
the equation. Often, none of the
actual observations lie on the
line. The difference between the
line and any individual
observation is the error.
The new equation is:
Weight = -222 + 5.7*Height + e
This equation does not mean that
people who are short enough will
have a negative weight. The
observations that contributed to
this analysis were all for heights
between 5’ and 6’4”. The model
will likely provide a reasonable
estimate for anyone in this height
range. You cannot, however,
extrapolate the results to heights
outside of those observed. The
regression results are only valid
for the range of actual
observations.
Regression finds the line that best fits the observations. It
does this by finding the line that results in the lowest sum
of squared errors. This principle is called the Principle of
Least Squares.
Since the line describes the mean of the effects of the
independent variables, by definition, the sum of the actual
errors will be zero. If you add up all of the values of the
dependent variable and you add up all the values predicted
by the model, the sum is the same.
That is, the sum of the negative errors (for points below
the line) will exactly offset the sum of the positive errors
(for points above the line). Summing just the errors
wouldn’t be useful because the sum is always zero. So,
instead, regression uses the sum of the squares of the
errors
One of the measures of how well the model
explains the data is the R2 value. Differences
between observations that are not explained
by the model remain in the error term. The R2
value tells you what percent of those
differences is explained by the model. An R2
of .68 means that 68% of the variance in the
observed values of the dependent variable is
explained by the model, and 32% of those
differences remains unexplained in the error
term.
Some precautions
Regression and correlation are two powerful
statistical tools when properly used.incorrect
usage can lead to meaningless results.
Here are some precautionary notes.
 Go through the assumptions underlying
correlation and regression before collecting
data. It may be difficult to assess all the
assumptions and the gap existing between the
assumptions and the data. First judge the
adequacy of the approach and when you are
confident,use it for analysing your data.

The following sum-of –squares(SS) are
defined:
2
SStot   ( y  y )
2

SSreg   ( y  y)
 2
SS res   ( y  y )

Where 
y
is the predicted value of y and
SStot  SS reg  SS res

Then
R
2
is given by
SS res
R  1
SS tot
2
R 
2
SS reg
SStot

A significant correlation between two
variables,say X and Y may mean one of
several things below;
◦ X causes Y
◦ Y causes X
◦ Effect of large sample giving high correlation
while,in fact, X and Y are not correlated.
◦ Some third factor ,directly or indirectly induces
correlation between X and Y
A Checklist…



Identify the model – decide whether only correlation
analysis is relevant or regression analysis has to be
carried out.
Review the assumptions since validity of the conclusions
depend on the choice of the model and the validity of
the model assumptions
If regression analysis is carried out, validate the model
using proper diagnostic checks, such as
value,ANOVA
( for testing the significance of
and testing for
significance of the regression coefficients). If the model
is to be used for prediction purpose, it is better to
validate the model with an independent sample.Do not
use the model to predict Y for values of X beyond its
range of values.

After validating the model,it can be used for
◦ Predicting the value of Y for a given value of X
◦ To estimate the mean of a subpopulation of Y for a
given value of X



Peat & Barton: Medical Statistics - A Guide to
Data Analysis, Blackwell/BMJ Publishers
Harris & Taylor : Medical Statistics Made Easy,
Martin Dunitz Publishers
Dunn & Clark : Basic Statistics – A Primer for
the Biomedical Sciences