PowerPoint Template

Regression Analysis
Lecturer: Dr. Bo Yuan
LOGO
E-mail: [email protected]
Regression
 To express the relationship between
variables by a mathematical formula.
 x: predictor (independent) variable
 y: response (dependent) variable
 Identify how y varies as a function of x.
 y is also considered as a random variable.
two
or
more
 Real-Word Example:
 Footwear impressions are commonly observed at crime
scenes. While there are numerous forensic properties that can
be obtained from these impressions, one in particular is the
shoe size. The detectives would like to be able to estimate the
height of the impression maker from the shoe size.
 The relationship between shoe sizes and heights
2
Shoe Size vs. Height
3
Shoe Size vs. Height
 What is the predictor?
 What is the response?
 Can the height by accurately
estimated from the shoe size?
 If a shoe size is 11, what
would you advise the police?
 What if the size is 7 or 12.5?
4
General Regression Model
 The systematic part m(x) is deterministic.
 The error ε(x) is a random variable.
 Measurement Error
 Natural Variations
 Additive
y ( x )  m( x )   ( x )
( x)  y( x)  m( x)
5
Example: Sine Function
y ( x)  A  sin( x   )   ( x)
6
Standard Assumptions
7
Back to Shoes
8
Simple Linear Regression
m( x)   0  1 x
9
Model Parameters
10
Derivation
n
R (  0 , 1 )    yi   0  1 xi 
2
i 1
R
 0
n
0
 2  yi   0  1 xi   0
i 1
  0  y  1 x
R
1
n
0
 2 xi  yi   0  1 xi   0
i 1
n


  xi yi  xi y  1 x  1 xi
i 1
n
 1 
x y
i 1
n
i
x
i 1
2
i
i
 nx y
 nx
11
2
  0
Standard Deviations
n
1
2
 
i

n  2 i 1
2

0


2
1

x

    n
2
2
n
x  nx



i 1
1/ 2

1
12




1

   n
 x2  nx2 
 

i 1
1/ 2
Polynomial Terms
 Modeling the data as a line is not always adequate.
 Polynomial Regression
p
m( x)   0  1 x  ...   p x p    k x k
k 0
 This is still a linear model!
 m(x) is a linear combination of β.
 Danger of Overfitting
13
Matrix Representation
p
yi    k xik   i
k 0
Y  X  
14
Matrix Representation
R(  )  Y  X  Y  X 
T
R
  0



Y
T
Y  Y T X   T X T Y   T X T X
 X X  X Y
T
T
  X X  X Y
T
1
T
15

0
Model Comparison
n

Sum of Squares Total : SST   yi  y

2
i 1


Sum of Squares Error : SSE    yi  yi 

i 1 
n
16
^
2
R2
SST  SSE
SSE
R 
 1
SST
SST
2
2
adj
R
SSE / (n  ( p  1))
 1
SST / (n  1)
17
Example
30
Y=X2+N(0,1)
25
20
Y= -3.6029+4.8802X
R2=0.9131
Y
15
10
5
Y= 0.7341-0.4303X+1.0621X2
R2=0.9880
0
-5
0
0.5
1
1.5
2
2.5
X
3
3.5
18
4
4.5
5
Tricky Relationship
Youth
Fitness
Elderly
Exercise Time
19
Violent Crime vs. Video Game
600
18
16
500
14
400
12
Violent Crime
10
300
Aggravated Assault
Robbery
8
Murder & Manslaughter
Forcible Rape
200
6
4
100
2
0
0
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
20
Video Game Sales
这是真的吗?
21
时间去哪儿了?
22
Summary
 Regression is the oldest data mining technique.
 Probably the first thing that you want to try on a new data set.
 No need to do programming!
 Matlab, Excel …
 Quality of Regression
 R2
 Residual Plot
 Cross Validation
 What you should learn after class:
 The Influence of Outliers
 Confidence Interval
 Nonlinear Regression
24