download

:
Pertemua 19
Regresi Linier
1
Outline Materi :
 Koefisien korelasi dan determinasi
 Persamaan regresi
 Regresi dan peramalan
2
Simple Correllation and Linear
Regression
• Types of Regression Models
• Determining the Simple Linear Regression
Equation
• Measures of Variation
• Assumptions of Regression and
Correlation
• Residual Analysis
• Measuring Autocorrelation
• Inferences about the Slope
3
Simple Correlation and…
• Correlation - Measuring the Strength (continued)
of the
Association
• Estimation of Mean Values and Prediction
of Individual Values
• Pitfalls in Regression and Ethical Issues
4
Purpose of Regression
Analysis
• Regression Analysis is Used Primarily to
Model Causality and Provide Prediction
– Predict the values of a dependent (response)
variable based on values of at least one
independent (explanatory) variable
– Explain the effect of the independent variables
on the dependent variable
5
Types of Regression Models
Positive Linear Relationship
Negative Linear Relationship
Relationship NOT Linear
No Relationship
6
Simple Linear Regression
Model
• Relationship between Variables is
Described by a Linear Function
• The Change of One Variable Causes the
Other Variable to Change
• A Dependency of One Variable on the
Other
7
Simple Linear Regression
Model
(continued)
Population regression line is a straight line that
describes the dependence of the average value
(conditional mean) of one variable on the other
Population
Slope
Coefficient
Population
Y Intercept
Dependent
(Response)
Variable
Random
Error
Yi      X i   i
Population
Regression
Y |X
Line
(Conditional Mean)

Independent
(Explanatory)
Variable
8
Simple Linear Regression
Model
(continued)
Y
(Observed Value of Y) = Yi
     X i   i
 i = Random Error

Y | X      X i

(Conditional Mean)
X
Observed Value of Y
9
Linear Regression Equation
Sample regression line provides an estimate of
the population regression line as well as a
predicted value of Y
Sample
Y Intercept
Yi  b0  b1 X i  ei
Ŷ  b0  b1 X
Sample
Slope
Coefficient
Residual
Simple Regression Equation
(Fitted Regression Line, Predicted Value)
10
Linear Regression Equation
•
(continued)
b0 and b1 are obtained by finding the values
of b0 and b1 that minimize the sum of the
squared residuals

n
i 1
•
•
b0
Yi  Yˆi
  e
2
n
i 1
2
i
provides an estimate of
b1 provides an estimate of


11
Linear Regression Equation
(continued)
Yi  b0  b1 X i  ei
Y
ei
Yi      X i   i
b1
i

Y | X      X i

b0
Observed Value
Yˆi  b0  b1 X i
X
12
Interpretation of the Slope
and Intercept
•   E Y | X  0 is the average value of Y
when the value of X is zero
change in E Y | X 
1 
change in X
•
measures the
change in the average value of Y as a
result of a one-unit change in X
13
Interpretation of the Slope
and Intercept
(continued)
• b  Eˆ Y | X  0  is the estimated average
value of Y when the value of X is zero
change in Eˆ Y | X 
• b1 
is the estimated
change in X
change in the average value of Y as a result
of a one-unit change in X
14
Simple Linear Regression:
Example
You wish to examine
the linear dependency
of the annual sales of
produce stores on their
sizes in square footage.
Sample data for 7
stores were obtained.
Find the equation of
the straight line that
fits the data best.
Store
Square
Feet
Annual
Sales
($1000)
1
2
3
4
5
6
7
1,726
1,542
2,816
5,555
1,292
2,208
1,313
3,681
3,395
6,653
9,543
3,318
5,563
3,760
15
Scatter Diagram: Example
Annua l Sa le s ($000)
12000
10000
8000
6000
4000
2000
0
0
1000
2000
3000
4000
5000
6000
S q u a re F e e t
Excel Output
16
Simple Linear Regression
Equation: Example
Yˆi  b0  b1 X i
 1636.415  1.487 X i
From Excel Printout:
C o e ffi c i e n ts
I n te r c e p t
1 6 3 6 .4 1 4 7 2 6
X V a ria b le 1 1 .4 8 6 6 3 3 6 5 7
17
Annua l Sa le s ($000)
Graph of the Simple Linear
Regression Equation: Example
12000
10000
8000
6000
4000
2000
0
0
1000
2000
3000
4000
5000
6000
S q u a re F e e t
18
Interpretation of Results:
Example
Yˆi  1636.415  1.487 X i
The slope of 1.487 means that for each increase of
one unit in X, we predict the average of Y to
increase by an estimated 1.487 units.
The equation estimates that for each increase of 1
square foot in the size of the store, the expected
annual sales are predicted to increase by $1487.
19
Simple Linear Regression
in PHStat
• In Excel, use PHStat | Regression | Simple
Linear Regression …
• Excel Spreadsheet of Regression Sales
on Footage
20
Measures of Variation:
The Sum of Squares
(continued)
• SST = Total Sum of Squares
– Measures the variation of the Yi values
around their mean,
Y
• SSR = Regression Sum of Squares
– Explained variation attributable to the
relationship between X and Y
• SSE = Error Sum of Squares
– Variation attributable to factors other than the
relationship between X and Y
21
The Coefficient of Determination
•
SSR Regression Sum of Squares
r 

SST
Total Sum of Squares
2
• Measures the proportion of variation in Y
that is explained by the independent
variable X in the regression model
22
Venn Diagrams and
Explanatory Power of
Regression
r 
2
Sales
Sizes
SSR

SSR  SSE
23
Coefficients of Determination (r 2)
and Correlation (r)
Y r2 = 1, r = +1
^
Y =b
i
0
Y r2 = 1, r = -1
^=b +b X
Y
i
0
1 i
+ b1 X i
X
Y r2 = .81,r = +0.9
X
Y
^=b +b X
Y
i
0
1 i
X
r2 = 0, r = 0
^=b +b X
Y
i
0
1 i
X
24
Standard Error of Estimate

n
•
SYX
SSE


n2
i 1
Y  Yˆi

2
n2
• Measures the standard deviation
(variation) of the Y values around the
regression equation
25
Measures of Variation:
Produce Store Example
Excel Output for Produce Stores
R e g r e ssi o n S ta ti sti c s
M u lt ip le R
R S q u a re
0 .9 4 1 9 8 1 2 9
A d ju s t e d R S q u a re
0 .9 3 0 3 7 7 5 4
S t a n d a rd E rro r
6 1 1 .7 5 1 5 1 7
O b s e r va t i o n s
r2 = .94
0 .9 7 0 5 5 7 2
n
7
94% of the variation in annual sales can be
explained by the variability in the size of the
store as measured by square footage.
Syx
26
Linear Regression
Assumptions
• Normality
– Y values are normally distributed for each X
– Probability distribution of error is normal
• Homoscedasticity (Constant Variance)
• Independence of Errors
27
Consequences of Violation
of the Assumptions
• Violation of the Assumptions
– Non-normality (error not normally distributed)
– Heteroscedasticity (variance not constant)
• Usually happens in cross-sectional data
– Autocorrelation (errors are not independent)
• Usually happens in time-series data
• Consequences of Any Violation of the Assumptions
– Predictions and estimations obtained from the sample
regression line will not be accurate
– Hypothesis testing results will not be reliable
• It is Important to Verify the Assumptions
28
Variation of Errors Around
the Regression Line
f(e)
• Y values are normally distributed
around the regression line.
• For each X value, the “spread” or
variance around the regression line is
the same.
Y
X2
X1
X
Sample Regression Line
29
Purpose of Correlation
Analysis
(continued)
• Sample Correlation Coefficient r is an
Estimate of  and is Used to Measure the
Strength of the Linear Relationship in the
Sample Observations
n
r
 X
i 1
n
 X
i 1
i
i
 X Yi  Y 
X
2
n
 Y  Y 
2
i
i 1
30
Features of  and r
• Unit Free
• Range between -1 and 1
• The Closer to -1, the Stronger the
Negative Linear Relationship
• The Closer to 1, the Stronger the Positive
Linear Relationship
• The Closer to 0, the Weaker the Linear
Relationship
31
Pitfalls of Regression Analysis
• Lacking an Awareness of the Assumptions
Underlining Least-Squares Regression
• Not Knowing How to Evaluate the
Assumptions
• Not Knowing What the Alternatives to
Least-Squares Regression are if a
Particular Assumption is Violated
• Using a Regression Model Without
Knowledge of the Subject Matter
32
Strategy for Avoiding the
Pitfalls of Regression
• Start with a scatter plot of X on Y to
observe possible relationship
• Perform residual analysis to check the
assumptions
• Use a histogram, stem-and-leaf display,
box-and-whisker plot, or normal
probability plot of the residuals to uncover
possible non-normality
33
Strategy for Avoiding the
Pitfalls of Regression
(continued)
• If there is violation of any assumption, use
alternative methods (e.g., least absolute
deviation regression or least median of squares
regression) to least-squares regression or
alternative least-squares models (e.g.,
curvilinear or multiple regression)
• If there is no evidence of assumption violation,
then test for the significance of the regression
coefficients and construct confidence intervals
and prediction intervals
34
Chapter Summary
• Introduced Types of Regression Models
• Discussed Determining the Simple Linear
Regression Equation
• Described Measures of Variation
• Addressed Assumptions of Regression
and Correlation
• Discussed Residual Analysis
• Addressed Measuring Autocorrelation
35
Chapter Summary
(continued)
• Described Inference about the Slope
• Discussed Correlation - Measuring the
Strength of the Association
• Addressed Estimation of Mean Values and
Prediction of Individual Values
• Discussed Pitfalls in Regression and
Ethical Issues
36