R: 2/13/17 KNN Homework Problems When you partition the data in

R: 2/13/17
KNN Homework Problems
When you partition the data in VisMiner for all of the problems in this homework
assignment, select Fix random sequence.
Download the following file: Ch7hwStudent.xlsx. The file contains answer tables,
with some check figures. Record your answers in this file.
Problem 7.2 (UniversalBank)
Download the following data file: UniversalBank.csv
Read the first three paragraphs of the problem description for this problem on page
167 of the Schmueli textbook but answer the questions below instead of the
questions shown in the textbook. In this problem, you will build a series of
classifiers, that is, algorithms to make categorical predictions.
This is a KNN classification problem (predicting a category), so use the KNN
modeler under Classification in VisMiner; do not use the KNN modeler under the
Regression option in VisMiner.
Data Description
Field Name
ID
ZIPCode
Age
Experience
Education
Family
Income
CCAvg
Mortgage
CreditCard
yPersonal Loan
Securities Account
CD Account
Online
Field Description
Customer ID
Home Address ZIP code.
Customer's age
Number of years of professional experience
Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
Customer's family size
Annual income of the customer ($000)
Avg. spending on credit cards from all banks combined per month
($000)
Value of house mortgage if any. ($000)
Does the customer use a credit card issued by UniversalBank? (1 =
Yes, 0 = No)
Did this customer accept the personal loan offered in the last
campaign? (1 = Yes, 0 = No)
Does the customer have a securities account with the bank?
Does the customer have a certificate of deposit (CD) account with the
bank? (1 = Yes, 0 = No)
Does the customer use internet banking facilities? (1 = Yes, 0 = No)
Data Preparation: Prepare the dataset for modeling by removing ID and ZIPCode.
Partition the data into 60% training and 40% validation. If there are collinear input
variables (IVs) where the correlation between the IVs is stronger than ± .75, retain
only the one most strongly correlated with the outcome variable.
Where to put your answers: For each of questions below, fill in the classification
matrix related cells and the AUC values highlighted in blue in the Ch7hwStudent.xlsx
spreadsheet. Note: The fastest way to get these results is to mouse over the model.
a. This is a conceptual question. A classifier that follows the Naïve rule always
predicts all observation to be the most frequently occurring class. What
PersonalLoan class would this default classifier pick? What would be the
error rate? Would the errors be made up of false positive, false negatives, or
both? Explain. (Note: If the classifier built does not perform better than the
Naïve rule classifier then it is of no value.)
The table below summarizes the KNN algorithm settings for each of the following
KNN portions of this problem. Use these settings to create a KNN classifier for each
problem shown in the table. Additional directions are provided as needed.
KNN Algorithm Settings
7.2.b
Max
K
15
7.2.c
15
7.2.d
15
Relative to
distance
Based on Correlation
matrix
7.2.f
15
Relative to
distance
Based on best
predictor scores
found in 7.2.e
Problem
Observation
Weights
Equal
Relative to
distance
Attribute Weights
None
None
b. Create the classifier. Based on the results, is the classifier more likely to
misclassify an actual positive (1) or actual negative (0)?
c. Create the classifier. Did setting the model to Relative to distance
substantially improve model performance?
d. Create a correlation matrix to evaluate inputs most correlated with the
response variable yPersonalLoan. Use your judgment to set attribute
weights in the KNN model specification according to information gleaned
from the correlation matrix. Insert a screen capture of your attribute
weights. Create the classifier. Did the weighting of attributes improve this
classifier’s performance over the previous two classifiers?
2
e. Build a classifier using the LogRegression modeler. (Note: Even though the
Log Regression modeler has not yet been introduced, you can create a
LogRegression model by dragging that modeler icon onto the partitioned
dataset.) How does the classification performance of the LogRegression
model compare to the previously built KNN models? Drag and drop the
model to a display and select "Predictor Scores." Predictor scores show how
influential variables were in helping a model predict outcomes. Insert a
screen capture of the predictor scores in this document.
f. Create a KNN model by using the predictor scores resulting from the Log
Regression model produced in (e) to set attribute weights. Insert a screen
capture of your attribute weights. Compare the performance of the resulting
classifier to the classifier in problem (d) that used correlations to set
attribute weights. Explain how they are different.
Problem 7.3 (BostonHousing)
Download the following data file: BostonHousing.csv
Ignore the problem description in the Schmueli textbook. Use this description and
these instructions instead.
The data file contains information on over 500 census tracts in Boston, where for
each tract multiple variable values are recorded. yMedVal is the outcome variable.
yMedValOver100 was derived from yMedVal, such that the value is 1 if yMedVal is
greater than 100, otherwise the value is zero.
Data Description
Expected
Min
Max
1
506
0
1
0
100
1
15
1
25
5
30
0
100
1
40
0
1
0
100
100
800
10
25
2
10
3
150
0
1
Field Name
IDNum
ByRiver
CrimeRate
DistToWork
HiwayAccess
IndustrialPerc
LargeLotsPerc
LowIncomePerc
NoxAirPerc
OldHomePerc
ProptyTaxRate
PupilsPerTeach
RoomsPerHome
yMedVal
yMedValOver100
Description
Unique ID for each record
Charles River dummy variable (1 if tract bounds river; 0 otherwise)
Per capita crime rate by town
Weighted distances to five Boston employment centres
Index of accessibility to radial highways
Proportion of non-retail business acres per town.
Proportion of residential land zoned for lots over 25,000 sq.ft.
% lower income status of the population
Nitric oxides concentration (parts per 10 million)
Proportion of owner-occupied units built prior to 1940
Full-value property-tax rate per $10,000
Pupil-teacher ratio by town
Average number of rooms per dwelling
Median value of owner-occupied homes in $1000
Is median value of owner-occupied homes > $100K? (1=Yes, 2=No)
3
This is a KNN regression problem (predicting a continuous number value), so select
the KNN modeler under Regression in VisMiner. Do not use the KNN algorithm
under the classification option.
Data Preparation: Prepare the dataset for modeling by removing any unnecessary
input columns (be sure to remove yMedValOver100). Resolve collinearity problems
among input variables (IVs) where the correlation between the IVs exceeds ± .75.
Partition the data into 60% training and 40% validation.
Where to put your answers: For each of questions below, fill in the spreadsheet
cells (Best K, MAPE, RMSE, and R2) in the Ch7hwAns.xlsx file. The fastest way to get
these results is to mouse over the model.
The table below summarizes the KNN algorithm settings for specific problems.
Additional directions are provided as needed.
KNN Algorithm Settings
7.3.b
Max
K
15
7.3.c
15
7.3.d
15
Problem
Observation
Weights
Equal
Relative to
distance
Relative to
distance
Attribute Weights
None
None
Based on Correlation
matrix
a. In addition to yMedValOver100, what other columns did you remove? Why?
For each problem below, build the model as directed and fill in the results in this
table. Also, answer the questions.
b. Build the KNN regression model as specified in the KNN Algorithm Settings
table.
c. Build a second KNN regression model as specified in the KNN Algorithm
Settings table.
d. Use the correlation matrix to evaluate inputs most correlated with the
response variable yMedVal. Build a third KNN regression with attribute
weights set according to information gleaned from the correlation matrix.
Save a screen capture of your attribute weights. Did the weighting of
attributes improve KNN regression performance over the previous two?
4
e. Build a multiple linear regression model using the same variables. Compare
its performance to that of the three KNN models just built.
f. Remove any MLR coefficients that are not statistically significant at .05 and
rerun the MLR model. Does the best MLR model perform as well as the best
KNN model on this problem?
g. If the purpose was to be to predict yMedVal for ten thousand new tracts,
what would be the disadvantage of using KNN? List the operations (steps)
the algorithm goes through in order to produce each prediction.
5