Machine Learning - Course Exercises Task Description: In this exercise you will be working with a data set that contains petrographical (https://en.wikipedia.org/wiki/Petrography) and core plug analysis results (https://en.wikipedia.org/wiki/Core_sample) from the Rotliegend Formation in the Netherlands1. The Rotliegend sandstones (https://en.wikipedia.org/wiki/Rotliegend) compose the reservoir rock of the Groningen natural gas field, which is operated by the Nederlandse Aardolie Maatschappij BV (NAM), a joint venture between Shell and ExxonMobil. Accurately estimating permeability of the reservoir rock, from which the hydrocarbons are produced, is an important task when characterizing reservoirs. Permeability can be measured in laboratories via specialized equipment; however, obtaining those measurements takes time and may incur additional cost. Hence, it is useful to obtain possibly accurate estimates of permeability before the actual laboratory work has been conducted. Since permeability is strongly related to the pore geometries, which can be obtained quite easily by microscopy, we will try to build models that express permeability in terms of those properties in this exercise. Figure 1: Core sample analysis performed in a laboratory (Source: Wikipedia) Overall Learning Objectives: Understand the basic syntax of R Import (csv) data into R Basic exploratory data analysis and visualization Build a simple linear model and interpret its results Understanding the necessity for out-of-sample testing to estimate the generalization error of a model Implementation of a basic n-fold cross-validation scheme 1 Amthor, J.E. and J. Okkerman, 1998, Influence of early diagenesis on reservoir quality of Rotliegende Sandstones, Northern Netherlands: American Association of Petroleum -1- Use the MLR package to setup a tuning and benchmarking experiment that compares the performance of several algorithms Work with a data set that is related to the hydrocarbon industry, that illustrates some of the challenges which the industry is facing Pre-Work: Learning objectives: Understand the basic syntax of R Understand some of the fundamentals of how core samples are analysed. Work items: 1. If you are unfamiliar with the R language start working through a tutorial, that explains the basic syntax and data types. E.g. http://www.r-tutor.com/r-introduction. For more details please have a look at http://www.cyclismo.org/tutorial/R/. 2. We will be using the MLR package for model tuning and benchmarking. Consider reading http://mlr-org.github.io/mlr-tutorial/release/html/ for an introduction and general overview. 3. Have a look at the following Wikipedia articles to get a better understanding of the data, which we are handling: a. https://en.wikipedia.org/wiki/Core_sample b. https://en.wikipedia.org/wiki/Petrography c. https://en.wikipedia.org/wiki/Rotliegend d. http://petrowiki.org/Permeability_determination Exercise Group 1: Learning objectives: Import (csv) data into R Basic exploratory data analysis and visualization Build a simple linear model and interpret its results Work items: 1. Startup R-Studio. Open the exerciseRockStudents.R file, via File->Open File. Set the R working directory to the location of the exerciseRockStudents.R file, for instance via Session>Set Working Directory->To Source File Location. 2. Run the code until line 31 in order to import and tabulate the data. 3. Run the code up to line 46 to create histogram plots of the individual columns in the data set. 4. Write code to visualize the histograms of the columns which are currently not generated via the existing code. 5. Once you plot Plug.Permeability.md what do you observe? If we try to predict a property which has such a distribution which problems could arise? What could be done to mitigate this potential problem? -2- 6. Create a scatter plot matrix of the different properties using the pairs command. We are most interested in relationships between Log.Plug.Permeability.md and the other properties. What do you observe? Which properties seem to have a high correlation with it? 7. In order to quantify our visual observations, use the cor function to compute the Pearson correlation between the numerical columns and Log.Plug.Permeability.md. Order and report the covariates in descending order with respect to the absolute value of the correlation coefficient. Are your numerical findings in agreement with the visual inspection? 8. Run the given code to create a linear model that tries to predict Log.Plug.Permeability.md from the given numerical columns. Read the documentation on linear models. Specifically, understand the definition of the 𝑅 2 error measure. Is the model we have just built a “good” model? On what do you base your judgement? The error statistics are reported in sample. How is the observed in sample error related to the expected generalization error of the model? Exercise Group 2: Learning objectives: Understanding the necessity for out-of-sample testing to estimate the generalization error of a model Implementation of a basic n-fold cross-validation scheme Work items: 1. Implement n-fold cross validation without using any external packages and test the out of sample performance of the linear model, which you have built at the end of Exercise Group 2 with 20-fold cross validation. As an error metric, report at least the 𝑅 2 performance of the model. Commands that you may find useful are nrow, sample, rSQ, and predict. You can use set.seed to make your results reproducible. 2. How does the out of sample performance compare with the in-sample performance? 3. How would you expect linear models to behave in sample and out of sample that have additional degrees of freedom (for instance by using additional or derived covariates)? 4. Do you think it is a good idea to implement cross-validation schemes yourself? Think about scenarios in which you are combining cross-validation with imputation (https://en.wikipedia.org/wiki/Imputation_(statistics)) and parameter tuning. Exercise Group 3: Learning objectives: Use the MLR package to setup a tuning and benchmarking experiment that compares the performance of several machine learning algorithms to some other baseline measures. Work items: 1. Execute the given code for the MLR experiment and try to understand the different components of the setup. 2. Why is it sensible to include a naïve baseline like predicting the mean in the benchmarking experiment? -3- 3. Why is the 𝑅 2 value for predicting the mean (sometimes) negative? What does this imply? 4. What difference in model performance do you observe between the different techniques? 5. How do the performance metrics change for the different techniques if you rerun the model and how is this related to the reported standard error? 6. Use listLearners(reg.task) to figure out which other methods are supported by the MLR package. Pick at least two additional learning algorithms and make them available within the benchmarking workflow and re-run the benchmarking exercise. 7. Run the example code that shows how to define a learner that is automatically tuned on a subsample of the test set. 8. Setup a tuning experiment in which you tune the regr.nnet learner which is currently not performing too well. Exercise Group 4: Learning objectives: Explore the capabilities of R/MLR. Work with a data set that is related to the hydrocarbon industry which illustrates some of the challenges which the industry is facing. Work items: 1. Try to further improve your model(s). For instance, by including categorical covariates, additional/other input data transformations, choosing different ML algorithms and tuning them properly. What is the best model that you can build? Remark: In practice one needs to trade off model performance with model complexity since acquiring additional data may incur additional cost, while only adding marginal additional value. -4-
© Copyright 2026 Paperzz