tutorial uses R

How to Create an Entry Using Your Favorite R Environment and
Submit to Cortana Intelligence “Women’s Health Risk Assessment” Competition
In this tutorial, you will build a solution in your favorite R environment outside of Azure ML, and then
create a valid entry to enter the Women’s Health Risk Assessment Competition. You should also feel free
to engineer other features based on your own understanding of the task and the data. Please keep in
mind, when you deploy your solution in Azure ML, the pipeline of data processing, feature engineering,
etc., that you develop in your own R environment has to be implemented in Azure ML too, so that it
works properly.
1. Download the training data (optional)
Download the training dataset to your local machine. In this tutorial, it is saved to local directory on
your PC as “C:\WHRA_Onprem\WomenHealth_Training.csv”.
This step is optional since R can directly read the data from the URL into a R data frame. This is how
we read the data in this tutorial.
2. Copy the Starter Experiment
Enter the competition by following steps 1 and 2 in tutorial “15 Minutes to Build Your First Solutions
for the Microsoft Cortana Intelligence Competition: Women’s Health Risk Assessment”.
The important step here is that you make a copy of the Starter Experiment into your Azure ML
workspace. You will need to modify this Starter Experiment to build the web service API from the
machine learning model you train on your own machine.
In this tutorial, you will learn how to write R scripts to implement the steps of cleaning missing data
and training machine learning models, and to bring the trained model into Azure ML for submission.
3. Download the R script files
From the Tutorial section of the Competition page, download a R script file that will help you get
started: WHRA_Onprem_Solution.R to a local directory. In this example, it is “E:\ WHRA_Onprem”.
4. Open the R script file in R Tools for Visual Studio or RStudio
Open the R script file WHRA_Onprem_Solution.R in an IDE of your choice for R development. This
tutorial uses R Tools for Visual Studio (RTVS), which allows you to develop and run R scripts in Visual
Studio. But you can use RStudio or other tools, and you can go bare-knuckle R as well if you’d like.
After opening it in RTVS, you will see the following UI:
5. Create a new project in your R environment and add an existing item, script file
WHRA_Onprem_Solution.R
Before running the R script file WHRA_Onprem_Solution.R, make sure you that you have package nnet
installed. If it is missing, use the following line to install it:
install.packages(“nnet”)
This R script file WHRA_Onprem_Solution.R is the file to run in this tutorial to build an on-premises
solution for this competition. After this R file completes, the trained multiclass logistic regression model
will be saved in a file specified in variable model_rda_file. In this example, the path to the file is
C:/WHRA_Onprem/logitmodel.rda. If you want to save your model somewhere else, you need to update
the variable model_rda_file in the first line of this file.
model_rda_file <- "C:/WHRA_Onprem/logitmodel.rda"
This R script file has the following blocks in sequence:
5.1 Read the data from a URL into a data frame
When reading the data from URL, we specify the types of all columns as integers, except the 36th
column as character. After the data is read into a data frame named dataset1, use
summary(dataset1) to understand the basic statistics of each variable.
5.2 Concatenate columns geo, segment, and subgroup into a single column
Then, we concatenate columns geo, segment, and subgroup into a single column combined_label so
that in the later stage, we can build a single multiclass classification model to predict the segment
and subgroup of a subject simultaneously. Here, although column geo is given for each subject and
it is not a column to be predicted, we still concatenate it with segment and subgroup since the same
segment and subgroup in different geolocations have different meanings.
We convert the column combined_label to a factor variable for this multiclass classification task.
5.3 Clean missing data
The original data has missing values in many of the columns. We replace the missing values in the
numeric columns with 0, and in the character column (religion) with “0”. After that, we convert
column religion into a factor variable.
5.4 Split data into training and validation data
The entire dataset is split into training (75%) and validation (25%) data randomly. In later steps, we
are going to use the training data to train a model, and evaluate it on the validation data.
5.5 Remove columns segment, subgroup, and INTNR from feature set
Since the new label column combined_label is created from the three columns geo, segment, and
subgroup, we exclude columns segment and subgroup from the model training step. We keep
column geo in the feature set since this column is given, and we hope keeping this variable in the
model can help us on only predicting the segment and subgroup in the given geolocation.
Variable INTNR is also cleared from the feature set since this is an internal patient number. It is
unique for each patient in each geolocation.
5.6 Train a multiclass logistic regression model on the training data
Now we are ready to train a multiclass logistic regression model using the multinom() function in
nnet library. We need to explicitly specify the formula of the model, by the following line:
model_formula <- formula(paste('combined_label ~ ', paste(col_names[feature_index], collapse=' +
'), sep=''))
Then, train a multiclass logistic regression model:
glmmodel <- multinom(model_formula, data = train, MaxNWts=3000, maxit = 500)
In the multinom() function, we have to specify MaxNWts and maxit since the default values are
1000 and 100, respectively. Not specifying MaxNWts to an adequately large number (like 3000 in
this example) will result in error complaining that the number of weights needed is larger than the
default (1000). Parameter maxit controls the number of training iterations.
To know how the model performs on the holdout validation data, apply this model glmmodel to
predict the validation data, and calculate the performance (accuracy). Please note that accuracy is
the performance metrics used to rank entries in this competition. This tutorial script yields a decent
accuracy value. Your job is to figure out creative ways to improve this number without overfitting.
predicted_labels <- predict(glmmodel, validation)
accuracy <- round(sum(predicted_labels==validation$combined_label)/nrow(validation) *
100,6)
print(paste("The accuracy on validation data is ", accuracy, "%", sep=""))
Assuming that we are satisfied with the performance on the validation data, we use the entire data
to train a model.
glmmodel <- multinom(model_formula, data = data.set, MaxNWts=3000, maxit = 500)
6. Save the model object to a local .rda file
Save the logistic regression model as a local .rda file.
save(glmmodel, file = model_rda_file)
7. Zip the .rda file into a .zip file
Go to the directory where you store the.rda files, in this example, “E:\WHRA_OnPrem” and zip
logitmodel.rda into a new zip file named logitmodel.zip.
8. Upload the files into the Azure ML
8.1 Go to your workspace in Azure ML Studio where you copied the Starter Experiment into, and
click the “+ NEW” button at the left bottom corner of the page.
Then, select DATASET and FROM LOCAL FILE
8.2 Upload these zip file logitmodel.zip from your local directory:
The dialog box will automatically infer the data file type based on the file extension, which is in this
case ZIP. You can also give them new name if you want. In this tutorial, we provide a name to this
zip file whra logistic model.
9. Build a predictive experiment in Azure ML to operationalize the model
In this competition, testing data is not shared with you. Instead, you will need to create a web
service API from the R code you uploaded, and let the evaluation process invoke it to make a
prediction on the testing data. The web service API is created and deployed out of a predictive
experiment. Therefore, you will need to create a predictive experiment first that is able to generate
the same set of features from the test data as from the training data in the training process, and
then call the model to make predictions based on the features of test data.
9.1 Open the Starter Experiment you copied to your workspace when you entered the competition
in Step 2. Keep the Reader module, and delete all other modules. Save this experiment using a
different name.
Please note that you should not build your predictive experiment from scratch via +New >
Experiment because it doesn’t carry metadata for this competition, and therefore cannot be
used to generate a valid entry for this competition.
9.2 Add an Execute R Script module, and add the whra logistic model that we just upload in Step 8 to
the experiment from the Saved Dataset, My Dataset section in the toolbox. Also, add a Web
service input module and a Web service output module to the experiment. Connect them as
follows:
9.3 Replace the R script in Execute R Script module with the following scripts. Please note the
logitmodel.zip file is automatically unzipped and the logitmodel.rda is dropped into the src
folder of the sandbox R runtime in Azure ML, which is why you can load it directly by
load('src/logitmodel.rda'). See this article for more information on how to work with R in Azure
ML.
The scripts in Execute R Script module also implements the steps of replacing missing values
with 0 for numeric variables or “0” for character variable (religion). After the R script makes
predictions, the predicted labels are split into three columns: Geo_Pred, Segment_Pred, and
Subgroup_Pred. Columns patientID, together with these three columns, are output from the
Execute R Script module. These four columns are the required output schema of this
competition.
# import library nnet
library(nnet)
# Map 1-based optional input ports to variables
dataset1 <- maml.mapInputPort(1) # class: data.frame
# Replace missing values with 0 ("0" for character variable religion)
religion <- dataset1$religion
religion[is.na(religion)] <- "0"
dataset1$religion <- factor(religion)
dataset1[is.na(dataset1)] <- 0
# Load the logistic regression model from the zip file
load('src/logitmodel.rda') #load the .rda file into R, as logitmodel model object
predicted_labels <- predict(glmmodel, dataset1) #make predictions on the data dataset1
# Extract the geo, segment, and subgroup from the predicted labels
nrows <- length(predicted_labels)
predictions <- matrix(rep(0,nrows*3), nrow=nrows, ncol=3)
for (i in 1:nrows){
x <- as.character(predicted_labels[i])
predictions[i,] <- as.numeric(unlist(strsplit(x, "")))
}
predictions <- as.data.frame(predictions)
data.set <- data.frame(dataset1[,1], predictions)
colnames(data.set) <- c("patientID", "Geo_Pred","Segment_Pred", "Subgroup_Pred")
maml.mapOutputPort("data.set");
9.4 Run the experiment
Click the Run button at the bottom of the studio, the experiment will start running. It might take
around 1 minute to complete.
10. Deploy web service, and submit for evaluation
After the experiment completes successfully, click “DEPLOY WEB SERVICE,” and a web service API
will be created from this predictive experiment.
Click the SUBMIT COMPETITION ENTRY button of the web service API page, and an entry submission
wizard will be launched and walk you through the steps to submit.
One quick tip here is to properly name your entry. This competition allows you to submit multiple
entries. The name, once submitted, cannot be changed. And this name is visible only to yourself. So
you might consider an easily recognizable name for your own reference.
Also, you will likely see the following warning upon validation in the wizard. Simply ignore it. The
reason of the warning is that it can’t detect a Trained Model module in the graph. This is OK since
we create our trained model using R into the .rda file. There is no Trained Model module produced
from a training experiment.
11. Improve your model and resubmit a new entry
After you are able to successfully submit the first entry, you can go back to step 5 and start to
refactor your script to achieve higher accuracy in your R code. You can then repackage it and upload
it into Azure ML. You can overwrite the same .zip file when uploading. But please make sure you
remove the old one from the experiment before re-adding the updated one back. This is because
Azure ML has a versioning capability that it remembers old versions of the uploaded assets until you
physically remove them from the graph. Then you can re-run your experiment, re-deploy (essentially
update) your web service, and submit a new entry.
Good luck, and we will see you on the leaderboard!

Download Report

tutorial uses R

Paperzz.com

Your Paperzz