Multi Linear Regression Lab

Multi Linear Regression Lab
Lab – Multiple Linear Regression
• Basic tasks you need to know:
•
•
•
•
•
Can you read in a CSV file?
Can you create multiple scatter plots and a correlation matrix?
Can you see if it is worth creating a linear regression model from the task 2?
Can you build a model using the correct function?
Can you print a summary and interpret the coefficients, i.e., have added some
unnecessary x variables?
• Can you refine the model?
• Can you plot the residuals and interpret them?
Example 1: Predicting life expectancy
•
•
•
•
•
•
•
•
•
•
•
We need data to build to build our regression model (Task 1):
> stateX77 <- read.csv( 'C:/state_data_77.csv', header=T )
This data set includes:
"State": names of states
"Popul": population estimate as of July 1, 1975
"Income": per capita income (1974)
"Illit": illiteracy (1970, percent of population)
"LifeExp": life expectancy in years (1969–71)
"Murder": murder and non-negligent manslaughter rate per 100,000 population (1976)
"HSGrad": percent high-school graduates (1970)
"Frost": mean number of days with minimum temperature below freezing (1931–1960) in capital
or large city
• "Area": land area in square miles
Visualise
• Lets create a multiple scatter plots (Task 2):
• > pairs(stateX77)
• There appears to be some negative and positive linear relationships between the data. We can
confirm this by looking at the correlation matrix (Task 2):
• > cor(stateX77)
• Error in cor(stateX77) : 'x' must be numeric
• One of the variables is not numeric.
• > str(stateX77)
• There are 9 variables and the first column is not numeric. We can access individual columns as
follows:
• > cor(stateX77[,2:9])
• To make it appear more neat we can round the numbers:
• > round( cor(stateX77[,2:9]), 2)
Build Model
• Let’s build a model for life expectancy and see if we can predict life
expectancy based on the other variables. There some weak to strong
correlations between that the life expectancy variable and some of
the others and the plots indicate that they may be linear (Task 3).
• In building the model we may have some theory that will suggest
what variables we should be interested in or need to include. For us,
we will start with all the variables and then remove variables which
aren’t contributing much to the model.
• For the present let’s not use State when we build the model (Task 4).
• > stateX77.fit1 <- lm( LifeExp ~ Popul + Income + Illit + Murder +
HSGrad + Frost + Area, data=stateX77)
• Now we have built the model lets inspect it (Task 5):
• > summary( stateX77.fit1 )
Summary
• Call:
• Illit
• lm(formula = LifeExp ~ Popul + Income + Illit +
Murder + HSGrad +
• Murder
-3.011e-01 4.662e-02 -6.459 8.68e-08 ***
• HSGrad
4.893e-02 2.332e-02 2.098 0.0420 *
•
Frost + Area, data = stateX77)
• Residuals:
•
Min
1Q Median
3Q
Max
3.382e-02 3.663e-01 0.092 0.9269
• Frost
-5.735e-03 3.143e-03 -1.825 0.0752 .
• Area
-7.383e-08 1.668e-06 -0.044 0.9649
• -1.48895 -0.51232 -0.02747 0.57002 1.49447
• ---
• Coefficients:
• Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
•
• Residual standard error: 0.7448 on 42 degrees of
freedom
Estimate Std. Error t value Pr(>|t|)
• (Intercept) 7.094e+01 1.748e+00 40.586 < 2e-16
***
• Popul
5.180e-05 2.919e-05 1.775 0.0832 .
• Multiple R-squared: 0.7362,
0.6922
• Income
-2.180e-05 2.444e-04 -0.089 0.9293
• F-statistic: 16.74 on 7 and 42 DF, p-value: 2.534e-10
Adjusted R-squared:
Eject Insignificance
• Let’s look at how the coefficients contribute to the prediction.
Roughly speaking, the higher the p-value the less it contributes to the
model. Let’s remove the variables that aren’t significant – all of them!
(Task 6)
• > stateX77.fit2 <- lm( LifeExp ~ Popul + Murder + HSGrad + Frost,
data=stateX77)
Can R Help?
• Don what should be kept or left out in a regression model has been
investigated for many years and there are number of standard
approaches.
• There is even a function in R that does this for us:
• > final.stateX77 <- step( stateX77.fit1 )
• How does ‘final.stateX77’ compare to ‘stateX77.fit2’?
Residual Checking
• We still need to check the residuals of our model. Ideally, they should
follow a normal distribution. If they don’t then we need to examine the
source data more carefully and rebuild the model. At this stage we will only
check the residuals (Task 7).
• We want to see all the points along the line or as close to it as possible.
• > qqnorm( stateX77.fit2$residuals )
• > qqnorm( stateX77.fit2$residuals, ylab="Residuals" )
• > qqline( stateX77.fit2$residuals )
• What can you see? There is a problem here – the points do not seem to fall
along the line.
• Our conclusions are that the residuals show some departure from a
normal distribution and we need to revise the model further.
• Despite the fact that we have a relatively high adjusted R-Squared
value which indicates that the model explains 0.69% the variability of
the response data around its mean;
• The residual plot suggest there is some bias in our model.
Your Turn – Milk Production
• The Dairy Herd Improvement Cooperative (DHI) in Upstate New York
collects and analyzes data on milk production. One question of
interest here is how to develop a suitable model to predict current
milk production from a set of measured variables. The response
variable (current milk production in pounds) and the predictor
variables are given in Table 1.1. Samples are taken once a month
during milking. The period that a cow gives milk is called lactation.
Number of lactations is the number of times a cow has calved or
given milk. The recommended management practice is to have the
cow produce milk for about 305 days and then allow a 60- day rest
period before beginning the next lactation.
• http://eu.wiley.com/WileyCDA/WileyTitle/productCd0470905840.html
What can you discover?
• Given the dataset in ‘milk_production.csv’ build a multiple liner
regression model.
• Show the multiple scatter plots and correlation matrix for the data.
(10 marks)
• Build the initial model for predicting milk production and show the
summary of this model. (20 marks)
• In your opinion can the model be improved further? If so show the
summary of the improved model. (10 marks)
• Plot the residuals and give a short interpretation of them. (10 marks)