download

What is regression?
(source: Roger Hadgraft
[email protected]


Fitting models to data sets
Linear regression is most common

“Line of best fit”
Example
Look at the data
Fuel efficiency
Fuel consumption (L)
25
20
15
e(i)
10
5
0
800
1000
1200
Mass (kg)
1400
1600
Chart | Add trendline
Fuel efficiency
y = 0.0126x - 0.8763
R2 = 0.3427
Fuel consumption (L)
25
20
15
10
5
0
800
1000
1200
Mass (kg)
1400
1600
Some maths


Assume we have paired data [x(i), y(i)]
Simplest model is:





Y(i) = b1.x(i) + bo
Where Y is model and y is original data
This notation matches Excel’s
Residual or Error: e(i) = Y(i) - y(i)
Minimise e(i)2 - Least Squares approach
Chart | Add trendline
Fuel efficiency
y = 0.0126x - 0.8763
R2 = 0.3427
Fuel consumption (L)
25
20
e(i)
15
10
5
0
800
1000
1200
Mass (kg)
1400
1600
Regression coefficients
ei   yi  b1.xi  bo
2
2
SSE = Sum of the
Squares of the Errors
Choose b1 and b0 to minimise SSE   ei
n
 (SSE)
 2 ( yi  b1.xi  bo)  0
bo
n
 (SSE)
 2 ( yi  b1.xi  bo) xi  0
b1
n
2
Rearranging
n.bo  b1 xi  yi
n
n
bo xi  b1 xi   xi yi
2
n
Thus : b1 
n
n
n xi yi   xi  yi
n
n
n


n xi    xi 
n
 n

2
bo  y  b1.x
2
Data Analysis | Regression


We can do the calculations by hand, or we
can use Excel’s Data Analysis Toolpak
Tools | Add-ins | Data Analysis



Once only to activate it
Tools | Data Analysis | Regression
Demonstration
Example
Chart | Add trendline
Fuel efficiency
y = 0.0126x - 0.8763
R2 = 0.3427
Fuel consumption (L)
25
20
15
10
5
0
800
1000
1200
Mass (kg)
1400
1600
Tools | Data Analysis |
Regression
This means that 34% of
the variance in fuel
consumption is
explained by vehicle
mass. The remaining
66% belongs to other
factors (eg driver
behaviour, etc
Is the model any good?


R2 = proportion of
variance of y data
explained by
regression equation
=SSR/SST
SSR = unexplained
variance
Total Sum of Squares

SST   yi  y

2
n
Error Sum of Squares
SSE   ei
2
n
Regression Sum of Squares
SSR  SST  SSE
Tools | Data Analysis |
Regression
R = sqrt(R2)
Tools | Data Analysis |
Regression
Compensates for different
number of model
parameters (in multiple
linear regression).
Text page 587
Tools | Data Analysis |
Regression
standard deviation of the
residuals (but divide by (n2) rather than (n-1))
Questions?
Tools | Data Analysis |
Regression
ANOVA = Analysis of
Variance
Tools | Data Analysis |
Regression
SSR, SSE and SST
Tools | Data Analysis |
Regression
Regression df = k-1
Total df = n-1
Residual df=(n-1)-(k-1)=(n-k)
k=number of parameters
n=number of data points
Tools | Data Analysis |
Regression
Regression MS = SSR/df1
Residual MS = SSE/df2
Tools | Data Analysis |
Regression
F = Reg MS / Residual MS
Tools | Data Analysis |
Regression
Probability of F statistic given
df1=1 and df2=18.
This is the probability of no
relationship.
Analisis
Other regressions


Multilinear regression
Non-linear equations


Transform the variables, eg logs, powers, etc
use multi-linear regression to determine
coefficients