Regression Analysis Lecturer: Dr. Bo Yuan LOGO E-mail: [email protected] Regression To express the relationship between variables by a mathematical formula. x: predictor (independent) variable y: response (dependent) variable Identify how y varies as a function of x. y is also considered as a random variable. two or more Real-Word Example: Footwear impressions are commonly observed at crime scenes. While there are numerous forensic properties that can be obtained from these impressions, one in particular is the shoe size. The detectives would like to be able to estimate the height of the impression maker from the shoe size. The relationship between shoe sizes and heights 2 Shoe Size vs. Height 3 Shoe Size vs. Height What is the predictor? What is the response? Can the height by accurately estimated from the shoe size? If a shoe size is 11, what would you advise the police? What if the size is 7 or 12.5? 4 General Regression Model The systematic part m(x) is deterministic. The error ε(x) is a random variable. Measurement Error Natural Variations Additive y ( x ) m( x ) ( x ) ( x) y( x) m( x) 5 Example: Sine Function y ( x) A sin( x ) ( x) 6 Standard Assumptions 7 Back to Shoes 8 Simple Linear Regression m( x) 0 1 x 9 Model Parameters 10 Derivation n R ( 0 , 1 ) yi 0 1 xi 2 i 1 R 0 n 0 2 yi 0 1 xi 0 i 1 0 y 1 x R 1 n 0 2 xi yi 0 1 xi 0 i 1 n xi yi xi y 1 x 1 xi i 1 n 1 x y i 1 n i x i 1 2 i i nx y nx 11 2 0 Standard Deviations n 1 2 i n 2 i 1 2 0 2 1 x n 2 2 n x nx i 1 1/ 2 1 12 1 n x2 nx2 i 1 1/ 2 Polynomial Terms Modeling the data as a line is not always adequate. Polynomial Regression p m( x) 0 1 x ... p x p k x k k 0 This is still a linear model! m(x) is a linear combination of β. Danger of Overfitting 13 Matrix Representation p yi k xik i k 0 Y X 14 Matrix Representation R( ) Y X Y X T R 0 Y T Y Y T X T X T Y T X T X X X X Y T T X X X Y T 1 T 15 0 Model Comparison n Sum of Squares Total : SST yi y 2 i 1 Sum of Squares Error : SSE yi yi i 1 n 16 ^ 2 R2 SST SSE SSE R 1 SST SST 2 2 adj R SSE / (n ( p 1)) 1 SST / (n 1) 17 Example 30 Y=X2+N(0,1) 25 20 Y= -3.6029+4.8802X R2=0.9131 Y 15 10 5 Y= 0.7341-0.4303X+1.0621X2 R2=0.9880 0 -5 0 0.5 1 1.5 2 2.5 X 3 3.5 18 4 4.5 5 Tricky Relationship Youth Fitness Elderly Exercise Time 19 Violent Crime vs. Video Game 600 18 16 500 14 400 12 Violent Crime 10 300 Aggravated Assault Robbery 8 Murder & Manslaughter Forcible Rape 200 6 4 100 2 0 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 20 Video Game Sales 这是真的吗? 21 时间去哪儿了? 22 Summary Regression is the oldest data mining technique. Probably the first thing that you want to try on a new data set. No need to do programming! Matlab, Excel … Quality of Regression R2 Residual Plot Cross Validation What you should learn after class: The Influence of Outliers Confidence Interval Nonlinear Regression 24
© Copyright 2024 Paperzz