CISW-AML-Lab

Cortana Intelligence Suite Workshop –
Azure Machine Learning
Student Lab
Last Updated: Buck Woody, 4/7/2016
In this lab, you’ll use the Azure ML Platform to create and test a predictive model for
customer churn. This is a canonical example that you can use in many solutions, and
you’ll learn the general process of working with Azure ML, ingression, cleaning, and
conforming data, and creating a testing a model.
Lab 1 – Create Experiment, Load Source Data
Lab Steps
If you do not already have an Azure ML Account, navigate to http://studio.azureml.net and sign
up for either a free or paid environment.
1. Download the source data from KDD:
a. http://kdd.org/cupfiles/KDDCupData/2009/orange_small_train.data.zip
b. Unzip, save out the main file (not the Checksum file) to your disk as
orange_small_train_churn_data
2. Download the source labels from KDD:
a. http://kdd.org/cupfiles/KDDCupData/2009/orange_small_train_churn.labels
b. Save as orange_small_train_churn_labels
3. Read more about this data: http://kdd.org/kdd-cup/view/kdd-cup-2009
4. Read the data into your experiment
a. In Azure ML Studio, import those files as generic TSV (data with headers, labels with no
headers) into two new datasets, use the same names as the name of the files
5. Visualize the data
a. Find the Descriptive Statistics module, and connect the main data file as input (not the
labels)
b. Save and run the experiment with the name "Customer Churn"
c. Visualize the results, save as output locally to your system
Lab 2 – Clean the Data
Lab Steps
2.
3.
4.
5.
6.
7.
8.
1.
Project out indices 1-190 from the output of the orange_small_train_data dataset (Hint:
use the Project Columns module)
From that result project out all columns with the exception of the following columns, to remove
sparse data:
a. Var6, Var8, Var15, Var20, Var31, Var32, Var39, Var42, Var48, Var52, Var55, Var79,
Var141, Var167, Var175, Var185
You want to distinguish between your missing values of 0 and your actual columns that are 0. To
do this, from Project Columns, connect the output to an Apply Math Operation module
a. Category = Operations
b. Basic Operation = Add
c. Operation argument type = Constant
d. Constant operation argument = 1
e. Column set = Column Type Numeric, all
f. Output mode = ResultOnly
Now add a Clean Missing Data module connected to the output you just created
a. Selected Columns = All columns
b. Minimum missing value ratio = 0
c. Maximum missing value ratio = 1
d. Cleaning Mode = Custom substitution value
e. Replacement value = 0
f. Connect the cleaned dataset output
Return to the orange_small_train_data dataset, Project Columns values 191-230 from the
output of the dataset
From that result, Clean Missing Data:
a. Selected Columns = All columns
b. Minimum missing value ratio = 0
c. Maximum missing value ratio = 1
d. Cleaning Mode = Custom substitution value
e. Replacement value = 0
Connect the cleaned dataset output to a new Metadata Editor
a. Select all columns and features
b. Don't change the datatype
c. Make the data Categorical
d. Do not change the fields
Save and run your experiment, right-click and visualize the results
F IGURE 1 VISUAL OUTPUT FROM LAB 2
Lab 3 – Quantize and Combine Data
Lab Steps
2.
3.
4.
5.
6.
1.
The next step is to bin the values to make the algorithm operate more effectively. To do
this, connect the left side of the data output you just created to a Quantize Data module
a. Binning mode = EqualWidth
b. Number of bins = 50
c. Quantile normalization = Percent
d. Columns to bin = Column type Numeric, all
e. Output mode = In place
f. Tag columns as categorical = checked
Next combine the data columns you have cleaned into one dataset
a. Connect the left output of the Quantize Data module to the left input of an Add
Columns module
b. Connect the output from the right-hand side of your Experiment's Metadata Editor
module to the right-hand side of that same Add Columns module
It's time to bring in the labels for the Experiment. Add in the orange_small_train_churn_labels
Dataset from your Saved Datasets to the right-hand side of your Experiment
Connect the output of the orange_small_train_churn_labels Dataset to the right-hand side
input of a new Add Columns module
Connect output of the left-hand side Add Columns module to the left-input of this new Add
Columns module
From the Add Columns module you just created, connect the output to a new Metadata Editor
module
a. Column names = Col1
b. Data type = Unchanged
c. Categorical = Unchanged
d. Fields = Unchanged
e. New column names = ChurnLabel
F IGURE 2 OUTPUT AFTER LAB 3
Lab 4 – Create Models, Train
Lab Steps
1.
a.
b.
c.
d.
e.
f.
g.
Add a Two-Class Boosted Decision Tree module
Create trainer mode = Single Parameter
Maximum number of leaves per tree = 20
Minimum number of samples per leaf node = 50
Learning rate = 0.2
Number of trees constructed = 500
Random number seed = 5678
Allow unknown categorical levels = checked
2. At this point we need to train the data with this model. We'll send 70% of the data to the
training system, and 30% to the scoring system
a. Connect the last Metadata Editor module output to the input of a new Split Data
module
i. Splitting mode = Split Rows
ii. Fraction of rows in the first output dataset = 0.7
iii. Randomized split = checked
iv. Random seed = 789
v. Stratified split = False
b. Now connect the output of the Two-Class Boosted Decision Tree module to the leftinput of a new Train Model module
c. Connect the left-output of the Split Data module to the right-input of that Train Model
module
i. Label column
1. Columns Names = ChurnLabel
3. Move to the right-hand side of your Experiment to create the next model
a. Add a Two-Class Decision Forest module.
b. Resampling method = Bagging
c. Create trainer mode = Single parameter
d. Number of decision trees = 500
e. Maximum depth of the decision trees = 20
f. Number of random splits per node = 100
g. Minimum number of samples per leaf node = 5
h. Allow unknown values for categorical features = checked
4. Train this new model
a. Connect the output of the Two-Class Decision Forest to the left-input of a new Train
Model module
b. Connect the left output of the Split Data module to the right-input of the new Train
Model module
i. Label column
1. Columns Names = ChurnLabel
5. Save and run your experiment
F IGURE 3 OUTPUT AFTER LAB 4
Lab 5 – Evaluate the Models in Your Experiment
Lab Steps
1.
The next step is to score the Two-Class Boosted Decision Tree
a. Connect the output of the Train Model module you just created to the left-input of a
Score Model module
b. Connect the right-output of the Split Data module to the right-input of the Score Model
module
i. Append score columns to output = checked
2. Now score the Two-Class Decision Forest Model
a. Connect the output of the Train Model module you created to the left-input of a Score
Model module
b. Connect the right-output of the Split Data module to the right-input of the Score Model
module
i. Append score columns to output = checked
3. Now, evaluate the models
a. Add a new Evaluate Model module to the bottom of your Experiment
b. Connect the output of the left-hand Score Model module to the left-input of the
Evaluate Model module
c. Connect the output of the right-hand Score Model module to the right-input of the
Evaluate Model module
4. Save and Run your experiment
F IGURE 4 FINAL OUTPUT AFTER L AB 5
1. Do these models work well? How do you know?
2. What changes can you make to these modules to make them
perform better?
3. Should you use other Models? Which ones? Why?