Linear Regression

Antonio Stasi
1.
2.
3.
4.
5.
6.
7.
8.
Regression: a general set-up
Linear regression
Linear regression by OLS
What about more variables
Beyond the lines
Measures of fit
Variables selection
How to run a linear regression
Regression: a general set-up
• You have a set of data on two
variables, X and Y, represented in a
scatter plot
• You wish to find a simple,
convenient mathematical function
that comes close to most of the
points, thereby describing succinctly
the relationship between X and Y
Linear Regression
• The straight line is a particularly
simple function.
• When we fit a straight line to data,
we are performing linear
regression analysis.
Linear Regression
• The Goal:
• Find the “best fitting” straight line for a set
of data. Since every straight line fits the
equation:
Y = bX + c
• with slope b and Y-intercept c, it follows
that our task is to find a b and c that
produce the best fit.
• There are many ways of operationalizing
the notion of a “best fitting straight line.” A
very popular choice is the “Least Squares
Criterion.”
Linear Regression
• For every data point , define the X,
Y “predicted score” to be the value
of the straight line evaluated at .
The “regression residual,” or “error
of prediction” is the distance from
the straight line to the data point in
the up-down direction.
Linear Regression by OLS
Y
40
20
0
0
10
20
X
Linear Regression by OLS
40
Y
20
predicte
d
0
0
20
X
Error or “residual”
Observation
Prediction
0
0
20
Temperature
What about more variables?
26
24
22
20
30
40
20
30
20
10
10
0
0
Temperature
What about more variables?
Same Story, Same
Procedure
26
24
22
20
30
40
20
30
20
10
10
0
0
Y = c + b1X + b2 Temperature
Beyond lines
Y = c + b1 X + b2 X2
40
still linear inX
20
0
0
10
20
everything is the same with
Beyond the lines
• Use of LOGARITHM
– Both sides of equality sign = b are
elasticities
– Only left hand side = b are
percentages of the Y
Measures of fit
• R-square
– Measures the percentage of data variability
explained by the linear regression
• St. error
– Measures the dispersion (variability) of
parameters
• T-test and P-value
– Measures whether that value of the
parameter significantly predict the Y, the Pvalue indicates the probability of committing a
mistake when I say that the parameter is
significant
• F-test and P-value
– Measures whether the model is overall good
Variables selection
• Structural variables should be
included in the model
– Variables from economic theory, e.g.
education vrb when I wanna predict
income
For other variables
• One man One rule…When the Pvalue relative to the variable is < 0.15
I keep the variable
How to run a linear regression
• Download poptools from the internet
• http://www.cse.csiro.au/poptools/download
.htm
• It is a add-in (componente aggiuntiva), so
you must install it jointly with your excel
software
• Go to extra-stats
• Select regression
• Select X variables matrix, select Y vector
and run
Boosting Adult System Education In Agriculture –
AGRI BASE
Financed by:
Partners: