Department of Statistics COURSE STATS 330/762 Model answers to Assignment 4, 2013 Task 1. Read in the data and make a data frame. Do the usual checks for typographical errors. Print out the first 10 lines of the data file. [5 marks] # read in data # change next line as required to suit your system infile = "C:\\Users\\alee044\\Documents\\Teaching\\760\\2013\\spam.txt" # Note no headers in the data file spam.df = read.table(infile, header=FALSE) [2 marks] # checks # First check all variables V1‐V54 are in the range 0‐100. Can use the apply function and the function range > var.range = apply(spam.df[,1:54], 2, range)
V1
V2 V3
V4 V5
V6
V7
V8
V9
V10 V11 V12
[1,] 0.00 0.00 0.0 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
[2,] 4.54 14.28 5.1 42.81 10 5.88 7.27 11.11 5.26 18.18 2.61 9.67
V13 V14 V15 V16 V17 V18
V19
V20
V21 V22 V23 V24
[1,] 0.00
0 0.00
0 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0
[2,] 5.55 10 4.41 20 7.14 9.09 18.75 18.18 11.11 17.1 5.45 12.5
V25
V26
V27 V28
V29 V30 V31 V32
V33 V34 V35
[1,] 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.00 0.00
0
[2,] 20.83 16.66 33.33 9.09 14.28 5.88 12.5 4.76 18.18 4.76 20
V36 V37 V38
V39 V40 V41
V42 V43 V44
V45
V46
[1,] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0 0.00 0.00
[2,] 7.69 6.89 8.33 11.11 4.76 7.14 14.28 3.57 20 21.42 22.05
V47 V48
V49
V50
V51
V52
V53
V54
[1,] 0.00
0 0.000 0.000 0.000 0.000 0.000 0.000
[2,] 2.17 10 4.385 9.752 4.081 32.478 6.003 19.829 This matrix has the minimum value of each variable in its top row and the maximum in its bottom row. We can then check that the mimima are >=0 and the maxima <=100: > all(var.range[1,]>=0)
[1] TRUE
> all(var.range[2,]<=100)
[1] TRUE
[1 mark] 1
Some of the values of V55, V56 and V57 are very large. But this could be because of a long message typed all in capital letters, so we can’t be sure that they are errors. We can however check that V55<=V56<=V57: see the definitions. > all((spam.df$V55<=spam.df$V56)&(spam.df$V56<=
spam.df$V57))
[1] TRUE
We can identify the records with large values of V55, V56 and V57: > large = order(spam.df$V57, decreasing=TRUE)[1:3]
> spam.df[large, 55:57]
V55 V56
V57
1489
2.383
21 15841
1754 239.571 9989 10062
905
12.332 1171 9163
We will check that these points don’t have too much influence when we come to fit models. [1 mark] Finally, V58: > table(spam.df$V58)
0
1
2788 1813
V58 is OK. [1 mark] Task 2. Write an R function called get.error that takes the glm object representing the fitted logistic model as an argument and returns the in‐sample prediction error. [5 marks] Here is my version # get.error
get.error = function(glm.obj){
mytable = table(glm.obj$y, predict(glm.obj)>0)
1-sum(diag(mytable))/sum(mytable)
2
}
[5 marks, 1 mark deducted for each error]
Task 3. Do an exploratory analysis to identify the variables that seem most closely related to the email status. You may use any suitable index here, such as the deviance of a one‐variable model. [10 marks] We will use two measures of association: the absolute difference of the mean frequencies between the spam and genuine groups, and the deviance when fitting a one variable model. We need to be careful with V55‐V57 since they are not on the same scale as V1‐V54. First, the differences: result=numeric(57)
for(i in 1:57){
##standardise if V55-V57
x = if(i<55) spam[,i] else (spam[,i]mean(spam[,i])/sd(spam[,i])
result[i]=abs(diff(tapply(spam[,i],spam[,58], mean)))
}
Note the use of the very useful function tapply. We will label the variables and sort them into descending order (biggest differences are the most important variables.) > sort(result, decreasing=TRUE)
V21
V23
V7
0.78419410 0.68505964 0.67959691
V17
V25
V57
0.53858246 0.52532054 0.50985333
V26
V6
V9
0.47671129 0.47596770 0.47381327
V18
V3
V15
0.41786192 0.40308762 0.40086637
V30
V28
V35
0.35010315 0.32494580 0.30535284
V42
V36
V43
0.27954988 0.27856529 0.27760371
V1
V39
V33
0.25825288 0.25134284 0.24540943
V41
V44
V22
0.19925373 0.19356353 0.18796918
V40
V51
V14
0.13259978 0.13241180 0.12283119
V38
V2
V12
V53
0.66222705
V5
0.49503090
V24
0.44221817
V20
0.38829969
V46
0.29903599
V29
0.27322249
V32
0.23371103
V50
0.18349174
V49
0.12201750
3
V19
0.55996030
V52
0.49496527
V56
0.44218893
V27
0.37529136
V45
0.28731152
V13
0.27200197
V34
0.23072381
V48
0.17192594
V4
0.11739649
V16
0.53860438
V11
0.47990669
V8
0.42318287
V37
0.36432641
V10
0.28435191
V31
0.25969463
V55
0.22508629
V54
0.13314312
V47
0.09142373
0.06350611 0.06184515 0.01583952
We can repeat using the deviances: >
>
+
+
+
>
>
result=numeric(57)
for(i in 1:57){
x = spam[,i]
result[i]=glm(V58~x, data=spam, family=binomial)$aic
}
names(result)=dimnames(spam)[[2]][1:57]
sort(result, decreasing=TRUE)
V53
4847.149
V27
5408.889
V20
5800.516
V6
5879.681
V37
5959.255
V39
6019.118
V10
6073.669
V54
6135.422
V12
6173.878 V56
5183.910
V21
5411.282
V19
5819.650
V8
5887.116
V18
5959.712
V33
6024.452
V13
6086.895
V49
6150.519
V7
5209.846
V26
5500.366
V29
5837.397
V9
5894.629
V28
5971.049
V44
6026.418
V1
6098.960
V40
6150.838
V25
5243.087
V16
5557.045
V30
5863.916
V46
5911.216
V32
5989.506
V41
6031.122
V50
6120.741
V14
6157.191
V23
5276.169
V24
5562.614
V5
5864.294
V42
5917.134
V3
5992.864
V43
6033.561
V4
6126.981
V47
6160.081
V55
5324.040
V17
5752.493
V11
5866.717
V31
5927.454
V45
6010.814
V36
6034.592
V22
6134.193
V38
6167.345
V52
5367.302
V57
5757.526
V35
5869.286
V15
5949.404
V34
6015.228
V48
6065.336
V51
6135.145
V2
6169.667
Somewhat different ordering. Note that a small deviance corresponds to a large association between variable and response. [10 marks for the list] Task 4. Fit a logistic regression model (or models) to the data, carrying out the usual diagnostic checks. Your aim here is to find models that will be a good predictor of email status. [15 marks] First we fit a model to all the data, using all the variables: > spam.glm =
glm(V58~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10+V11+V12+V13+V14+V15+V16+V17+
+ V18+V19+V20+V21+V22+V23+V24+V25+V26+V27+V28+V29+V30+V31+V32+
+ V33+V34+V35+V36+V37+V38+V39+V40+V41+V42+V43+V44+V45+V46+V47+
+ V48+V49+V50+V51+V52+V53+V54+V55+V56+V57, data=spam,
family=binomial)
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
> spam.glm$aic
[1] 1931.765
> spam.glm$deviance
[1] 1815.765
4
Deleting the 3 emails with large values of V55‐V57 makes no measurable difference to the AIC or deviance, so we use all the data from now on. We now proceed to fit two models using stepwise variable selection. Using backward elimination results in the selection of the full model, while the stepwise method results in a model retaining 30 variables. model.step = step(model0, formula(spam.glm), direction =
"both", trace=0)
We can’t look at all possible regressions, there are too many. However, we can use the variable importance ordering in Question 3. For k = 1,2,…,57, we can fit a model to the first k variables in the ordering and record the AIC and in‐sample prediction error, using the function we wrote in Question 2. The code is ( for the deviance ordering) > indices = as.numeric(substr(names(sort(result)),2,3))
# this has taken the labels from the vector of results, stripped off
the V’s
# and converted the strings to numbers.
> indices
[1] 53 56 7 25 23 55 52 27 21 26 16 24 17 57 20 19 29 30 5 11 35
[22] 6 8 9 46 42 31 15 37 18 28 32 3 45 34 39 33 44 41 43 36 48
[43] 10 13 1 50 4 22 51 54 49 40 14 47 38 2 12
> for(i in 1:57){
+ X = as.matrix(spam[,indices[1:i]])
+ my.glm = glm(V58~X, data=spam, family=binomial)
+ result[i]=get.error(my.glm)
+ }
> plot(result, type="l")
This gives the picture
5
We can see how the in‐sample error declines steadily, suggesting the full model might be a good one. We need to check this by cross‐validation (see below). The spike at index 51 was caused by the maximum likelihood routine not converging (a rare failure). Note also that there is a slight dip at index 30. For comparison purposes, we can also fit the model with the first 30 variables: model30 = glm(V58~V53+V56+V7+V25+V23+V55+V52+V27+V21+
V26+V16+V24+V17+V57+V20+V19+V29+V30+V5+V11+V35+V6+V8+
V9+V46+V42+V31+V15+V37+V18, data=spam, family=binomial)
[10 marks for 3 or more models, 5 for some diagnostics. 4 marks deduced if only one model] Task 5. Discuss the predictive ability of your model (s). With an email filter, we want the false negative rate (taking ``genuine’’ as negative) to be low: we don’t want to miss any genuine emails. We will probably tolerate more false positives (spam that gets through the filter) to achieve this. Discuss the tradeoff between false positives and false negatives, using a suitable graph. [15 marks]. 6
We now have three models under consideration, the full model, the stepwise model, and the`` first30’’ model. We can work out a cross‐validated estimate of the prediction error for each model, along with other information: Model Full Step first 30 No. of vars 57 33 30 Deviance CV 1815.765
1824.876
0.926 0.928 0.921 CV error 0.074 0.072 0.079 AIC 1912.876
1931.765
2109.775
In‐sample error 0.068 0.068 0.075 The cross‐validated errors are obtained from > cross.val(model30)
Mean Specificity = 0.9543973
Mean Sensitivity = 0.8707552
Mean Correctly classified = 0.9213804
Seems like the stepwise model is OK, about the same as the full model but with fewer variables. The tradeoff between false positives and false negatives is expressed by the ROC curve ROC.curve(model.step)
7
The predictor is good, with the area under the curve 0.9726. The false positive rate (the chance of classifying a genuine email as spam) is about 5%. If this deemed too high, we will have to accept a lower true positive rate (from the graph, for a FPR of 98%, we would need a TPR of about 72%.) [5 marks for in‐sample estimates of prediction error, 5 marks for crossval/bootstrap estimates of prediction error, 3 for ROC curve, 2 for discussion] Total for assignment: 50 marks 8
© Copyright 2026 Paperzz