R: 2/13/17 KNN Homework Problems When you partition the data in VisMiner for all of the problems in this homework assignment, select Fix random sequence. Download the following file: Ch7hwStudent.xlsx. The file contains answer tables, with some check figures. Record your answers in this file. Problem 7.2 (UniversalBank) Download the following data file: UniversalBank.csv Read the first three paragraphs of the problem description for this problem on page 167 of the Schmueli textbook but answer the questions below instead of the questions shown in the textbook. In this problem, you will build a series of classifiers, that is, algorithms to make categorical predictions. This is a KNN classification problem (predicting a category), so use the KNN modeler under Classification in VisMiner; do not use the KNN modeler under the Regression option in VisMiner. Data Description Field Name ID ZIPCode Age Experience Education Family Income CCAvg Mortgage CreditCard yPersonal Loan Securities Account CD Account Online Field Description Customer ID Home Address ZIP code. Customer's age Number of years of professional experience Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional Customer's family size Annual income of the customer ($000) Avg. spending on credit cards from all banks combined per month ($000) Value of house mortgage if any. ($000) Does the customer use a credit card issued by UniversalBank? (1 = Yes, 0 = No) Did this customer accept the personal loan offered in the last campaign? (1 = Yes, 0 = No) Does the customer have a securities account with the bank? Does the customer have a certificate of deposit (CD) account with the bank? (1 = Yes, 0 = No) Does the customer use internet banking facilities? (1 = Yes, 0 = No) Data Preparation: Prepare the dataset for modeling by removing ID and ZIPCode. Partition the data into 60% training and 40% validation. If there are collinear input variables (IVs) where the correlation between the IVs is stronger than ± .75, retain only the one most strongly correlated with the outcome variable. Where to put your answers: For each of questions below, fill in the classification matrix related cells and the AUC values highlighted in blue in the Ch7hwStudent.xlsx spreadsheet. Note: The fastest way to get these results is to mouse over the model. a. This is a conceptual question. A classifier that follows the Naïve rule always predicts all observation to be the most frequently occurring class. What PersonalLoan class would this default classifier pick? What would be the error rate? Would the errors be made up of false positive, false negatives, or both? Explain. (Note: If the classifier built does not perform better than the Naïve rule classifier then it is of no value.) The table below summarizes the KNN algorithm settings for each of the following KNN portions of this problem. Use these settings to create a KNN classifier for each problem shown in the table. Additional directions are provided as needed. KNN Algorithm Settings 7.2.b Max K 15 7.2.c 15 7.2.d 15 Relative to distance Based on Correlation matrix 7.2.f 15 Relative to distance Based on best predictor scores found in 7.2.e Problem Observation Weights Equal Relative to distance Attribute Weights None None b. Create the classifier. Based on the results, is the classifier more likely to misclassify an actual positive (1) or actual negative (0)? c. Create the classifier. Did setting the model to Relative to distance substantially improve model performance? d. Create a correlation matrix to evaluate inputs most correlated with the response variable yPersonalLoan. Use your judgment to set attribute weights in the KNN model specification according to information gleaned from the correlation matrix. Insert a screen capture of your attribute weights. Create the classifier. Did the weighting of attributes improve this classifier’s performance over the previous two classifiers? 2 e. Build a classifier using the LogRegression modeler. (Note: Even though the Log Regression modeler has not yet been introduced, you can create a LogRegression model by dragging that modeler icon onto the partitioned dataset.) How does the classification performance of the LogRegression model compare to the previously built KNN models? Drag and drop the model to a display and select "Predictor Scores." Predictor scores show how influential variables were in helping a model predict outcomes. Insert a screen capture of the predictor scores in this document. f. Create a KNN model by using the predictor scores resulting from the Log Regression model produced in (e) to set attribute weights. Insert a screen capture of your attribute weights. Compare the performance of the resulting classifier to the classifier in problem (d) that used correlations to set attribute weights. Explain how they are different. Problem 7.3 (BostonHousing) Download the following data file: BostonHousing.csv Ignore the problem description in the Schmueli textbook. Use this description and these instructions instead. The data file contains information on over 500 census tracts in Boston, where for each tract multiple variable values are recorded. yMedVal is the outcome variable. yMedValOver100 was derived from yMedVal, such that the value is 1 if yMedVal is greater than 100, otherwise the value is zero. Data Description Expected Min Max 1 506 0 1 0 100 1 15 1 25 5 30 0 100 1 40 0 1 0 100 100 800 10 25 2 10 3 150 0 1 Field Name IDNum ByRiver CrimeRate DistToWork HiwayAccess IndustrialPerc LargeLotsPerc LowIncomePerc NoxAirPerc OldHomePerc ProptyTaxRate PupilsPerTeach RoomsPerHome yMedVal yMedValOver100 Description Unique ID for each record Charles River dummy variable (1 if tract bounds river; 0 otherwise) Per capita crime rate by town Weighted distances to five Boston employment centres Index of accessibility to radial highways Proportion of non-retail business acres per town. Proportion of residential land zoned for lots over 25,000 sq.ft. % lower income status of the population Nitric oxides concentration (parts per 10 million) Proportion of owner-occupied units built prior to 1940 Full-value property-tax rate per $10,000 Pupil-teacher ratio by town Average number of rooms per dwelling Median value of owner-occupied homes in $1000 Is median value of owner-occupied homes > $100K? (1=Yes, 2=No) 3 This is a KNN regression problem (predicting a continuous number value), so select the KNN modeler under Regression in VisMiner. Do not use the KNN algorithm under the classification option. Data Preparation: Prepare the dataset for modeling by removing any unnecessary input columns (be sure to remove yMedValOver100). Resolve collinearity problems among input variables (IVs) where the correlation between the IVs exceeds ± .75. Partition the data into 60% training and 40% validation. Where to put your answers: For each of questions below, fill in the spreadsheet cells (Best K, MAPE, RMSE, and R2) in the Ch7hwAns.xlsx file. The fastest way to get these results is to mouse over the model. The table below summarizes the KNN algorithm settings for specific problems. Additional directions are provided as needed. KNN Algorithm Settings 7.3.b Max K 15 7.3.c 15 7.3.d 15 Problem Observation Weights Equal Relative to distance Relative to distance Attribute Weights None None Based on Correlation matrix a. In addition to yMedValOver100, what other columns did you remove? Why? For each problem below, build the model as directed and fill in the results in this table. Also, answer the questions. b. Build the KNN regression model as specified in the KNN Algorithm Settings table. c. Build a second KNN regression model as specified in the KNN Algorithm Settings table. d. Use the correlation matrix to evaluate inputs most correlated with the response variable yMedVal. Build a third KNN regression with attribute weights set according to information gleaned from the correlation matrix. Save a screen capture of your attribute weights. Did the weighting of attributes improve KNN regression performance over the previous two? 4 e. Build a multiple linear regression model using the same variables. Compare its performance to that of the three KNN models just built. f. Remove any MLR coefficients that are not statistically significant at .05 and rerun the MLR model. Does the best MLR model perform as well as the best KNN model on this problem? g. If the purpose was to be to predict yMedVal for ten thousand new tracts, what would be the disadvantage of using KNN? List the operations (steps) the algorithm goes through in order to produce each prediction. 5
© Copyright 2026 Paperzz