Chapter 7 Regression and Correlation1

LINEAR REGRESSION &
CORRELATION
CHAPTER 7
BCT 2053
CONTENT




7.1 Simple Linear Regression Analysis and
Correlation
7.2 Relationship Test and Prediction
in Simple Linear Regression Analysis
7.3 Multiple Linear Regression Analysis and
Correlation
7.4 Model Selection
7.1: Simple Linear
Regression Analysis and
Correlation
OBJECTIVE:
1. Find a mathematical equation that can relate a
dependent and independent variables x and y.
2. Plot a scatter diagram and graph the regression line
3. Calculate the strength of the linear relationship
between x and y by correlation coefficient .
INTRODUCTORY CONCEPTS

Suppose you wish to investigate the
relationship between a dependent variable (y)
and independent variable (x)



Independent variable (x) – the variables has
been controlled
Dependent variable (y) – the response variables
In other word, the value of y depends on the value
of x.
Example A
Suppose you wish to investigate the relationship between
the numbers of hours student’s spent studying for an
examination and the mark they achieved.
Students
A
B
C
D
E
F
G
H
numbers of hours (x)
5
8
9
10
10
12
13
15
Final marks ( (y)
49
60
55
72
65
80
82
85
Numbers of hours
student’s spent studying for
an examination
( x – Independent
variable )
will cause
the mark (y) they achieved.
( y – Dependent variable )
Other Examples

The weight at the end of a spring (x) and the
length of the spring (y)

A student’s mark in Statistics test (x) and the
mark in a Programming test (y)

The diameter of the stem of a plant (x) and the
average length of leaf of the plant (y)
SCATTER DIAGRAM

When pairs of values are plotted (x vs y), a scatter diagram is
produced

To see how the data looks like and relate with each other
scatter plot Yield of hay (y) versus Amount of water (x)
Yield of hay (y)
9
8
7
6
5
4
3
2
1
0
0
20
40
60
80
100
120
Amount of water (x)

Exercise: Plot a scatter diagram for Example A
140
LINEAR CORRELATION AND
SIMPLE LINEAR LINE

Linear correlation
 If the points on the scatter diagram appear to lie near a
straight line ( Simple regression line )

Or you would say that there is a linear correlation /linear
relationship between x and y

Sometimes a scatter plot shows a curvilinear (polynomial
curve) or nonlinear (log, exp, etc…) relationship between data.

Exercise: From the scatter diagram for Example A, is there
any correlation between x and y?
Positive Linear Correlation
Negative Linear Correlation
No Correlation
No relationship between x and y
INFERENCES IN CORRELATION

The Pearson product moment
correlation coefficient, r, is a
numerical value between -1 and 1
inclusive which indicates the linear
degree of scatter.
1  r  1



S xy
r
S xx   x
S xx S yy
2
x



n
2
2
r = 1 indicates perfect positive linear
y


2
correlation
S yy   y 
n
r = -1 indicates perfect negative linear
 x y
correlation
S xy   xy 
n
r = 0 indicates no correlation
INFERENCES IN CORRELATION

The nearer the value of r is to 1 or -1, the closer the
points on the scatter diagram are to the regression line



Nearer to 1 is strong positive linear correlation/relationship
Nearer to -1 is strong negative linear correlation/relationship
Exercise: Calculate the correlation coefficient r for
Example A and interpret the result.
TIPS: USE YOUR CALCULATOR
MODE REG LIN x,y M+
shift 1 OR shift 2
LINE OF BEST FIT

A mathematical way of fitting the regression line (least of
square method).

The line of best fit must pass through the means of both sets
of data, i.e. the point  x , y  .
scatter plot Yield of hay (y) versus Amount of water (x)
Yield of hay (y)
9
8
7
6
 x, y 
5
4
3
2
1
0
0
20
40
60
80
Amount of water (x)
100
120
140
Least square regression line of y on x

Exercise: Find and draw the regression line for Example A
7.2: Relationship Test and
Prediction in Simple Linear
Regression Analysis
OBJECTIVE:
1. Test the significance of regression slope.
2. Predict and estimate the new y value from
the regression equation.
RELATIONSHIP TESTS AND
PREDICTION IN SIMPLE LINEAR
REGRESSION ANALYSIS
1.
HYPOTHESIS TESTING FOR THE
SLOPE OF REGRESSION LINE
2.
ESTIMATION AND PREDICTION
1: HYPOTHESIS TESTING FOR THE
SLOPE OF REGRESSION LINE
 To test the linear relationship between x and y
 x and y have a linear relationship if the slope   0
Test the hypothesis, H :   0 and H :   0
o
1
with statistic test.
t
 
 
Var 
~ tn  2
where
 S yy   S xy

n2
Var   
S xx
 



 If Ho is reject, x and y have a linear relationship

Exercise: Test the linearity between x and y for Example A at   0.05
2. ESTIMATION AND PREDICTION

When x is the independent variable and you want to


estimate y for a given value of x
estimate x for a given value of y.

When neither variable is controlled and you want to
estimate y for a given value of x

The regression line y on x is used to make prediction
when there is a linear correlation between x and y.
Guideline for using regression
equation

If there is no linear correlation, don’t use the regression
equation to make prediction

When using the regression equation for predictions, stay
within the scope of the available sample data

A regression equation based on old data is not necessarily
now.

Don’t make predictions about a population that is different
from the population from which the sample data were drawn.
Exercise

Use Example A to find:

the estimate of y when x = 10 hours

the estimate of x when y = 75 marks
EXAMPLE B
A study is done to see whether there is a relationship between a
mother’s age and the number of children she has. The data are
shown here.
Mother’s age (x)
18 22 29 20 27 32 33 36
No. of children (y)
2
1
3
1
2
4
3
5
1.
Plot a scatter diagram to illustrate the data.
2.
Compute the value of the correlation coefficient, r and comment on
the relationship between the value of r and the scatter plot.
3.
Find the equation of the regression line of y on x. Then predict the
number of children of a mother whose age is 34.
4.
Test the linearity between x and y when α = 0.05.
SOLVE SIMPLE LINEAR
REGRESSION BY EXCEL

Excel – key in data

Tools – Data Analysis – Regression – enter the data
range (y & x) – ok
Computer Output - Excel
Strong Linear positive correlation
yˆ  26.89  4.06 x
x and y have linear relationship ( P-value < 0.05)
7.3 Multiple Linear
Regression Analysis
and Correlation
OBJECTIVE:
1.
2.
To describe linear relationships involving more than
two variables.
Interpret the computer output for multiple linear
regression analysis and make prediction.
MULTIPLE LINEAR REGRESSION
EQUATION

A multiple regression equation is use to describe linear
relationships involving more than two variables.

A multiple linear regression equation expresses a linear
relationship between a response variable y and two or more
predictor variable (x1, x2,…,xk). The general form of a multiple
regression equation is
yˆ  b0  b1 x1  b2 x2 

 bk xk
A multiple linear regression equation identify the plane that
gives the best fit to the data
Notation: Multiple regression equation
yˆ  b0  b1 x1  b2 x2 
 bk xk
where
n : sample size
k : number of independent variables
yˆ : predicted value of y
x1 , x2
, xk : independent variables
b0 : estimate value of y -intercept
b1 , b2 ,
, bk : estimate value of the independent variables coefficient
Examples of real situation
1)
A manufacturer of jams wants to know where to
direct its marketing efforts when introducing a new
flavour. Regression analysis can be used to help
determine the profile of heavy users of jams. For
instance, a company might predict the number of
flavours of jam a household might have at any one
time on the basis of a number of independent
variables such as, number of children living at
home, age of children, gender of children, income
and time spent on shopping.
Examples of real situation
2)
Many companies use regression to study
markets segments to determine which
variables seem to have an impact on market
share, purchase frequency, product
ownerships, and product & brand loyalty, as
well as many other areas.
Examples of real situation
3)
Personals directors explore the relationships of
employee salary levels to geographic location,
unemployment rates, industry growth, union
membership, industry type, or competitive salaries.
4)
Financial analysts look for causes of high stock
prices by analysing dividend yields, earning per
share, stock splits, consumer expectation of interest
rates, savings levels and inflation rates.
Examples of real situation
5)
Medical researchers use regression analysis to
seek links between blood pressure and
independent variables such as age, social
class, weight, smoking habits and race.
6)
Doctors explore the impact of
communications, number of contacts, and age
of patient on patient satisfaction with service.
Computing the Multiple Linear
Regression Equation

By using the least square method, the multiple linear
regression equation is given by
yˆ  b0  b1 x1  b2 x2 

 bk xk
Where the estimated regression coefficients

b X X
T
1 x1,1
 Y1 
1 x
Y 
2,1
Y   2, X  

 

 
Y
 n
1 xn ,1

1
x1,2
x2,2
xn ,2
T
X Y
x1, k 
b0 
b 
x2, k 
, b   1

 

 
xn , k 
bk 
EXAMPLE C

Assume that, a sales manager of Tackey Toys,
needs to predict sales of Tackey products in
selected market area. He believes that
advertising expenditures and the population in
each market area can be used to predict sales.
He gathered sample of toy sales, advertising
expenditures and the population as below.
Find the linear multiple regression equation
which the best fit to the data.
Example, cont…
Market Area
Advertising Expenditures
(Thousands of Dollars) x1
Population
(Thousands) x2
Toy sales
(Thousands of Dollars) y
A
1.0
200
100
B
5.0
700
300
C
8.0
800
400
D
6.0
400
200
E
3.0
100
100
F
10.0
600
400
Solution

Since we have 2 independent variables, so the
multiple regression equation is given by:
yˆ  b0  b1 x1  b2 x2
where
n  6 and k  2
yˆ  predicted value of y
x1  the value of the first independent variable
x2  the value of the second independent variable
b0  estimate value of y -intercept
b1  slope associated with x1
(the change in yˆ if x2 is held constant and x1 varies by 1 unit)
b2  slope associated with x2
(the change in yˆ if x1 is held constant and x2 varies by 1 unit)
SOLVE MULTIPLE LINEAR
REGRESSION BY EXCEL

Excel – key in data

Tools – Data Analysis – Regression – enter the data
range (y & x) – ok
Computer Output – Microsoft Excel
yˆ  6.3972  20.4921x1  0.2805x2
Interpreting the Values in the Equation

b0 = 6.3972



The value of estimated y when x1 and x2 are both zero is given by b0.
When the population in thousands and the advertising expenditures in
thousands dollars are zero then the estimated toy sales is given by 6.3972
thousands dollars .
b1 = 20.4921



yˆ  6.3972  20.4921x1  0.2805x2
When the x2 is constant then the estimated y is increases by b1 for each x1.
When the population in thousands is constant then the estimated toy sales
increases by 20.4921 thousands dollars for each 1000 dollars of advertising
expenditures.
b2 = 0.2805


When the x1 is constant then the estimated y is increases by b2 for each x2.
When the advertising expenditures in thousands dollars is constant then the
estimated toy sales increases by 0.2805 thousands dollars for each 1000
people in the population.
Making preliminary predictions with the
multiple regression equation

Assume that the sales manager needs a sales
forecast for a market area Tackey Toys has
recently spend $4,000 advertising in this market,
which has a population of 500,000 people. So the
point estimate of toy sales is given by:
yˆ  6.3972  20.4921x1  0.2805 x2
 6.3972  20.4921 4   0.2805 500 
 228.613  $228613
The Coefficient of Multiple Determinations


Measure the percentage of variation in the y variable associated with the
use of the set x variables
A percentage that shows the variation in the y variable that’s explain by its
relation to the combination of x1 and x2.
.
R2 
SSR
SST
1
SSR  b T X T Y   Y T 11T Y
n
where
and
1
SST  Y T Y   Y T 11 T Y
n
bt  0  t  1,
,k
R2  0
when all
R 1
when all observations fall directly
on the fitted response surface, i.e. when
(the regression equation is good)
2
0  R2  1
Yt  Yˆt for all t.
Computer Output – Microsoft Excel
97.4% of Tackey Toy sales in the market
area is explained by advertising
expenditures and population size
yˆ  6.3972  20.4921x1  0.2805x2
Multiple R and Adjusted R²
The coefficient of multiple correlations R is the positive square root of R².
R  R2
The value of R can range from 0 to +1. The closer to +1, the stronger the
relationship. The closer to 0, the weaker the relationship.
The adjusted coefficient of determination is the multiple coefficient of
determination R² modified to account for the number of variables and
the sample size.
adjusted R
2
n  1

 1
1  R2 

 n   k  1
When we compare a multiple regression equation to others, it is better
to use the adjusted R².
Computer Output – Microsoft Excel
Strong relationship exist between sales and advertising
expenditures and population size
97.4% of Tackey Toy sales in the market
area is explained by advertising
expenditures and population size
yˆ  6.3972  20.4921x1  0.2805x2
Standard error and ANOVA test

Standard error




Measure the extent of the scatter, or dispersion, of the sample data
points about the multiple regression plane.
Compare the standard error between two or more regression
equation
Smallest standard error, less disperse
ANOVA test




H0: neither of the independent variables is related to the dependent
variables (b1 = b2 = 0)
H1: At least one of the independent variables is related to the
dependent variables (b1 or b2 or both ≠ 0)
Reject H0 if significance F < α
significance F = P value
Computer Output – Microsoft Excel
Strong relationship exist between sales and advertising
expenditures and population size
97.4% of Tackey Toy sales in the market
area is explained by advertising
expenditures and population size
< 0.05 So
reject H0
yˆ  6.3972  20.4921x1  0.2805x2
7.4: Model Selection
OBJECTIVE:
1. Select the best multiple linear regression
model for any given data set.
TIPS: Model Selection in simple way

Use common sense and practical considerations to include or
exclude variables.

Consider the P-value (the measure of the overall significance
of multiple regression equation-significance F value)
displayed by computer output. The smaller the better.

Consider equation with high values of adjusted R² and try
include only a few variables.

Find the linear correlation coefficient r for each pair of
variables being considered. If 2 predictor values have a very
high r, there is no need to include them both. Exclude the
variable with the lower value of r.
EXAMPLE D

The following table summarize the multiple regression analysis for the
response variable (y) is weight (in pounds), and the predictor (x) variables
are H ( height in inches), W (waist circumference in cm), and C
(cholesterol in mg)
x
P-value
R
Adjusted R²
Regression equation
H, W, C
0.0000
0.880
0.870
-199 + 2.55H + 2.18W 0.005C
H, W
0.0000
0.877
0.870
-206 + 2.66H + 2.15W
H, C
0.0002
0.277
0.238
-148 + 4.65H + 0.006C
W, C
0.0000
0.804
0.793
-42.8 + 2.41W + 0.01C
H
0.0001
0.273
0.254
-139 + 4.55H
W
0.0000
0.790
0.785
-44.1 + 2.37W
C
0.874
0.001
0.000
173 - 0.003C
Example D, cont…
1.
If only one predictor variable is used to predict
weight, which single variable is best? Why?
2.
If exactly two predictor variables are used to predict
weight, which two variables should be chosen?
Why?
3.
Which regression equation is best for predicting
weight? Why?
CONCLUSION

This chapter introduces important methods
(regression) for making inferences about a
relationship between two or more variables and
describing such a relationship with an equation that
can be used for predicting value of one variable given
the value of the other variable.
Thank You