Overall Learning Objectives

Machine Learning - Course Exercises
Task Description:
In this exercise you will be working with a data set that contains petrographical
(https://en.wikipedia.org/wiki/Petrography)
and
core
plug
analysis
results
(https://en.wikipedia.org/wiki/Core_sample) from the Rotliegend Formation in the Netherlands1.
The Rotliegend sandstones (https://en.wikipedia.org/wiki/Rotliegend) compose the reservoir rock of
the Groningen natural gas field, which is operated by the Nederlandse Aardolie Maatschappij BV
(NAM), a joint venture between Shell and ExxonMobil.
Accurately estimating permeability of the reservoir rock, from which the hydrocarbons are
produced, is an important task when characterizing reservoirs. Permeability can be measured in
laboratories via specialized equipment; however, obtaining those measurements takes time and may
incur additional cost. Hence, it is useful to obtain possibly accurate estimates of permeability before
the actual laboratory work has been conducted. Since permeability is strongly related to the pore
geometries, which can be obtained quite easily by microscopy, we will try to build models that
express permeability in terms of those properties in this exercise.
Figure 1: Core sample analysis performed in a laboratory (Source: Wikipedia)
Overall Learning Objectives:






Understand the basic syntax of R
Import (csv) data into R
Basic exploratory data analysis and visualization
Build a simple linear model and interpret its results
Understanding the necessity for out-of-sample testing to estimate the generalization error
of a model
Implementation of a basic n-fold cross-validation scheme
1
Amthor, J.E. and J. Okkerman, 1998, Influence of early diagenesis on reservoir quality of Rotliegende
Sandstones, Northern Netherlands: American Association of Petroleum
-1-


Use the MLR package to setup a tuning and benchmarking experiment that compares the
performance of several algorithms
Work with a data set that is related to the hydrocarbon industry, that illustrates some of the
challenges which the industry is facing
Pre-Work:
Learning objectives:
 Understand the basic syntax of R
 Understand some of the fundamentals of how core samples are analysed.
Work items:
1. If you are unfamiliar with the R language start working through a tutorial, that explains the
basic syntax and data types. E.g. http://www.r-tutor.com/r-introduction. For more details
please have a look at http://www.cyclismo.org/tutorial/R/.
2. We will be using the MLR package for model tuning and benchmarking. Consider reading
http://mlr-org.github.io/mlr-tutorial/release/html/ for an introduction and general
overview.
3. Have a look at the following Wikipedia articles to get a better understanding of the data,
which we are handling:
a. https://en.wikipedia.org/wiki/Core_sample
b. https://en.wikipedia.org/wiki/Petrography
c. https://en.wikipedia.org/wiki/Rotliegend
d. http://petrowiki.org/Permeability_determination
Exercise Group 1:
Learning objectives:
 Import (csv) data into R
 Basic exploratory data analysis and visualization
 Build a simple linear model and interpret its results
Work items:
1. Startup R-Studio. Open the exerciseRockStudents.R file, via File->Open File. Set the R
working directory to the location of the exerciseRockStudents.R file, for instance via Session>Set Working Directory->To Source File Location.
2. Run the code until line 31 in order to import and tabulate the data.
3. Run the code up to line 46 to create histogram plots of the individual columns in the data
set.
4. Write code to visualize the histograms of the columns which are currently not generated via
the existing code.
5. Once you plot Plug.Permeability.md what do you observe? If we try to predict a property
which has such a distribution which problems could arise? What could be done to mitigate
this potential problem?
-2-
6. Create a scatter plot matrix of the different properties using the pairs command. We are
most interested in relationships between Log.Plug.Permeability.md and the other
properties. What do you observe? Which properties seem to have a high correlation with it?
7. In order to quantify our visual observations, use the cor function to compute the Pearson
correlation between the numerical columns and Log.Plug.Permeability.md. Order and report
the covariates in descending order with respect to the absolute value of the correlation
coefficient. Are your numerical findings in agreement with the visual inspection?
8. Run the given code to create a linear model that tries to predict Log.Plug.Permeability.md
from the given numerical columns. Read the documentation on linear models. Specifically,
understand the definition of the 𝑅 2 error measure. Is the model we have just built a “good”
model? On what do you base your judgement? The error statistics are reported in sample.
How is the observed in sample error related to the expected generalization error of the
model?
Exercise Group 2:
Learning objectives:
 Understanding the necessity for out-of-sample testing to estimate the generalization error
of a model
 Implementation of a basic n-fold cross-validation scheme
Work items:
1. Implement n-fold cross validation without using any external packages and test the out of
sample performance of the linear model, which you have built at the end of Exercise Group
2 with 20-fold cross validation. As an error metric, report at least the 𝑅 2 performance of the
model. Commands that you may find useful are nrow, sample, rSQ, and predict. You can use
set.seed to make your results reproducible.
2. How does the out of sample performance compare with the in-sample performance?
3. How would you expect linear models to behave in sample and out of sample that have
additional degrees of freedom (for instance by using additional or derived covariates)?
4. Do you think it is a good idea to implement cross-validation schemes yourself? Think about
scenarios in which you are combining cross-validation with imputation
(https://en.wikipedia.org/wiki/Imputation_(statistics)) and parameter tuning.
Exercise Group 3:
Learning objectives:
 Use the MLR package to setup a tuning and benchmarking experiment that compares the
performance of several machine learning algorithms to some other baseline measures.
Work items:
1. Execute the given code for the MLR experiment and try to understand the different
components of the setup.
2. Why is it sensible to include a naïve baseline like predicting the mean in the benchmarking
experiment?
-3-
3. Why is the 𝑅 2 value for predicting the mean (sometimes) negative? What does this imply?
4. What difference in model performance do you observe between the different techniques?
5. How do the performance metrics change for the different techniques if you rerun the model
and how is this related to the reported standard error?
6. Use listLearners(reg.task) to figure out which other methods are supported by the MLR
package. Pick at least two additional learning algorithms and make them available within the
benchmarking workflow and re-run the benchmarking exercise.
7. Run the example code that shows how to define a learner that is automatically tuned on a
subsample of the test set.
8. Setup a tuning experiment in which you tune the regr.nnet learner which is currently not
performing too well.
Exercise Group 4:
Learning objectives:
 Explore the capabilities of R/MLR. Work with a data set that is related to the hydrocarbon
industry which illustrates some of the challenges which the industry is facing.
Work items:
1. Try to further improve your model(s). For instance, by including categorical covariates,
additional/other input data transformations, choosing different ML algorithms and tuning
them properly. What is the best model that you can build?
Remark: In practice one needs to trade off model performance with model complexity since
acquiring additional data may incur additional cost, while only adding marginal additional
value.
-4-

Download Report

Overall Learning Objectives

Paperzz.com

Your Paperzz