Cortana Intelligence Suite Workshop – Azure Machine Learning Student Lab Last Updated: Buck Woody, 4/7/2016 In this lab, you’ll use the Azure ML Platform to create and test a predictive model for customer churn. This is a canonical example that you can use in many solutions, and you’ll learn the general process of working with Azure ML, ingression, cleaning, and conforming data, and creating a testing a model. Lab 1 – Create Experiment, Load Source Data Lab Steps If you do not already have an Azure ML Account, navigate to http://studio.azureml.net and sign up for either a free or paid environment. 1. Download the source data from KDD: a. http://kdd.org/cupfiles/KDDCupData/2009/orange_small_train.data.zip b. Unzip, save out the main file (not the Checksum file) to your disk as orange_small_train_churn_data 2. Download the source labels from KDD: a. http://kdd.org/cupfiles/KDDCupData/2009/orange_small_train_churn.labels b. Save as orange_small_train_churn_labels 3. Read more about this data: http://kdd.org/kdd-cup/view/kdd-cup-2009 4. Read the data into your experiment a. In Azure ML Studio, import those files as generic TSV (data with headers, labels with no headers) into two new datasets, use the same names as the name of the files 5. Visualize the data a. Find the Descriptive Statistics module, and connect the main data file as input (not the labels) b. Save and run the experiment with the name "Customer Churn" c. Visualize the results, save as output locally to your system Lab 2 – Clean the Data Lab Steps 2. 3. 4. 5. 6. 7. 8. 1. Project out indices 1-190 from the output of the orange_small_train_data dataset (Hint: use the Project Columns module) From that result project out all columns with the exception of the following columns, to remove sparse data: a. Var6, Var8, Var15, Var20, Var31, Var32, Var39, Var42, Var48, Var52, Var55, Var79, Var141, Var167, Var175, Var185 You want to distinguish between your missing values of 0 and your actual columns that are 0. To do this, from Project Columns, connect the output to an Apply Math Operation module a. Category = Operations b. Basic Operation = Add c. Operation argument type = Constant d. Constant operation argument = 1 e. Column set = Column Type Numeric, all f. Output mode = ResultOnly Now add a Clean Missing Data module connected to the output you just created a. Selected Columns = All columns b. Minimum missing value ratio = 0 c. Maximum missing value ratio = 1 d. Cleaning Mode = Custom substitution value e. Replacement value = 0 f. Connect the cleaned dataset output Return to the orange_small_train_data dataset, Project Columns values 191-230 from the output of the dataset From that result, Clean Missing Data: a. Selected Columns = All columns b. Minimum missing value ratio = 0 c. Maximum missing value ratio = 1 d. Cleaning Mode = Custom substitution value e. Replacement value = 0 Connect the cleaned dataset output to a new Metadata Editor a. Select all columns and features b. Don't change the datatype c. Make the data Categorical d. Do not change the fields Save and run your experiment, right-click and visualize the results F IGURE 1 VISUAL OUTPUT FROM LAB 2 Lab 3 – Quantize and Combine Data Lab Steps 2. 3. 4. 5. 6. 1. The next step is to bin the values to make the algorithm operate more effectively. To do this, connect the left side of the data output you just created to a Quantize Data module a. Binning mode = EqualWidth b. Number of bins = 50 c. Quantile normalization = Percent d. Columns to bin = Column type Numeric, all e. Output mode = In place f. Tag columns as categorical = checked Next combine the data columns you have cleaned into one dataset a. Connect the left output of the Quantize Data module to the left input of an Add Columns module b. Connect the output from the right-hand side of your Experiment's Metadata Editor module to the right-hand side of that same Add Columns module It's time to bring in the labels for the Experiment. Add in the orange_small_train_churn_labels Dataset from your Saved Datasets to the right-hand side of your Experiment Connect the output of the orange_small_train_churn_labels Dataset to the right-hand side input of a new Add Columns module Connect output of the left-hand side Add Columns module to the left-input of this new Add Columns module From the Add Columns module you just created, connect the output to a new Metadata Editor module a. Column names = Col1 b. Data type = Unchanged c. Categorical = Unchanged d. Fields = Unchanged e. New column names = ChurnLabel F IGURE 2 OUTPUT AFTER LAB 3 Lab 4 – Create Models, Train Lab Steps 1. a. b. c. d. e. f. g. Add a Two-Class Boosted Decision Tree module Create trainer mode = Single Parameter Maximum number of leaves per tree = 20 Minimum number of samples per leaf node = 50 Learning rate = 0.2 Number of trees constructed = 500 Random number seed = 5678 Allow unknown categorical levels = checked 2. At this point we need to train the data with this model. We'll send 70% of the data to the training system, and 30% to the scoring system a. Connect the last Metadata Editor module output to the input of a new Split Data module i. Splitting mode = Split Rows ii. Fraction of rows in the first output dataset = 0.7 iii. Randomized split = checked iv. Random seed = 789 v. Stratified split = False b. Now connect the output of the Two-Class Boosted Decision Tree module to the leftinput of a new Train Model module c. Connect the left-output of the Split Data module to the right-input of that Train Model module i. Label column 1. Columns Names = ChurnLabel 3. Move to the right-hand side of your Experiment to create the next model a. Add a Two-Class Decision Forest module. b. Resampling method = Bagging c. Create trainer mode = Single parameter d. Number of decision trees = 500 e. Maximum depth of the decision trees = 20 f. Number of random splits per node = 100 g. Minimum number of samples per leaf node = 5 h. Allow unknown values for categorical features = checked 4. Train this new model a. Connect the output of the Two-Class Decision Forest to the left-input of a new Train Model module b. Connect the left output of the Split Data module to the right-input of the new Train Model module i. Label column 1. Columns Names = ChurnLabel 5. Save and run your experiment F IGURE 3 OUTPUT AFTER LAB 4 Lab 5 – Evaluate the Models in Your Experiment Lab Steps 1. The next step is to score the Two-Class Boosted Decision Tree a. Connect the output of the Train Model module you just created to the left-input of a Score Model module b. Connect the right-output of the Split Data module to the right-input of the Score Model module i. Append score columns to output = checked 2. Now score the Two-Class Decision Forest Model a. Connect the output of the Train Model module you created to the left-input of a Score Model module b. Connect the right-output of the Split Data module to the right-input of the Score Model module i. Append score columns to output = checked 3. Now, evaluate the models a. Add a new Evaluate Model module to the bottom of your Experiment b. Connect the output of the left-hand Score Model module to the left-input of the Evaluate Model module c. Connect the output of the right-hand Score Model module to the right-input of the Evaluate Model module 4. Save and Run your experiment F IGURE 4 FINAL OUTPUT AFTER L AB 5 1. Do these models work well? How do you know? 2. What changes can you make to these modules to make them perform better? 3. Should you use other Models? Which ones? Why?
© Copyright 2026 Paperzz