Machine Learning Algorithms for soil mapping

Machine Learning Algorithms for soil mapping - GSIF (tutorials)

2017/07/23 20:31
1/17
Machine Learning Algorithms for soil mapping
Machine Learning Algorithms for soil
mapping
Edited by: T. Hengl
Part of: ISRIC Spring School
This tutorial looks at some common Machine learning algorithms (MLA's) that are potentially of
interest to soil mapping projects i.e. for generating spatial prediction. We put especial focus on using
the tree-based algorithms such as random forest and/or gradient boosting. For a more in-depth
overview of machine learning algorithms used in statistics refer to the CRAN Task View on Machine
Learning & Statistical Learning. Some other examples of how MLA's can be used to fit Pedo-TransferFunctions can be found here.
The code used in this tutorial can be downloaded from here.
Loading the packages and data
We start by loading all required packages:
> library(plotKML)
plotKML version 0.5-5 (2015-12-24)
URL: http://plotkml.r-forge.r-project.org/
> library(sp)
> library(randomForest)
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
>
>
>
>
>
>
>
>
>
library(nnet)
library(e1071)
library(GSIF)
library(plyr)
library(raster)
library(caret)
library(Cubist)
library(GSIF)
library(xgboost)
Next, we load the (Ebergotzen) data set which consists of soil augers and stack of rasters containing
all covariates:
> data(eberg)
> data(eberg_grid)
> coordinates(eberg) <- ~X+Y
GSIF (tutorials) - http://gsif.isric.org/
Last update: 2017/07/23 17:18
wiki:soilmapping_using_mla http://gsif.isric.org/doku.php/wiki:soilmapping_using_mla
> proj4string(eberg) <- CRS("+init=epsg:31467")
> gridded(eberg_grid) <- ~x+y
> proj4string(eberg_grid) <- CRS("+init=epsg:31467")
The covariates can be further converted to principal components:
> eberg_spc <- spc(eberg_grid, ~ PRMGEO6+DEMSRT6+TWISRT6+TIRAST6)
Converting PRMGEO6 to indicators...
Converting covariates to principal components...
> eberg_grid@data <- cbind(eberg_grid@data, eberg_spc@predicted@data)
All further analysis is run using the regression matrix (produced using overlay of points and grids),
which contains values of the target variable and all covariates for all training points:
> ov <- over(eberg, eberg_grid)
> m <- cbind(ov, eberg@data)
> dim(m)
[1] 3670
44
In this case the regression matrix consists of 3670 observations and has 44 columns.
Spatial prediction of soil classes using MLA's
In the first example we focus on mapping soil types using the auger data. First, we need to filter out
some classes that do not come frequently enough to allow for statistical modelling. As a rule of
thumb, a class should have at least 5 observations:
> xg = summary(m$TAXGRSC, maxsum=length(levels(m$TAXGRSC)))
> str(xg)
Named int [1:13] 790 704 487 376 252 215 186 86 71 23 ...
- attr(*, "names")= chr [1:13] "Braunerde" "Parabraunerde" "Pseudogley"
"Regosol" ...
>
>
>
>
>
selg.levs = attr(xg, "names")[xg > 5]
m$soiltype <- m$TAXGRSC
m$soiltype[which(!m$TAXGRSC %in% selg.levs)] <- NA
m$soiltype <- droplevels(m$soiltype)
str(summary(m$soiltype, maxsum=length(levels(m$soiltype))))
Named int [1:11] 790 704 487 376 252 215 186 86 71 43 ...
- attr(*, "names")= chr [1:11] "Braunerde" "Parabraunerde" "Pseudogley"
"Regosol" ...
In this case 2 classes had less than 5 observations and have been removed from further analysis. We
http://gsif.isric.org/
Printed on 2017/07/23 20:31
2017/07/23 20:31
3/17
Machine Learning Algorithms for soil mapping
should also remove all points that contain missing values for any combination of covariates and target
variable:
> m <- m[complete.cases(m[,1:(ncol(eberg_grid)+2)]),]
> m$soiltype <- as.factor(m$soiltype)
We can now test fitting a MLA i.e. a random forest model using four covariate layers (parent material
map, elevation, TWI and Aster thermal band):
> s <- sample.int(nrow(m), 500)
> TAXGRSC.rf <- randomForest(x=m[-s,paste0("PC",1:10)], y=m$soiltype[-s],
xtest=m[s,paste0("PC",1:10)], ytest=m$soiltype[s])
> TAXGRSC.rf$test$confusion[,"class.error"]
<code rd>
Auenboden
Braunerde
Gley
0.9000000
0.4817518
0.7692308
Kolluvisol Parabraunerde Pararendzina
0.6923077
0.5454545
0.6818182
Pelosol
Pseudogley
Ranker
0.5882353
0.6842105
1.0000000
Regosol
Rendzina
0.6833333
0.6666667
By specifying xtest and ytest we run both model fitting and cross-validation with 500 excluded
points. The results show relatively high prediction error of about 60% i.e. relative classification
accuracy of about 40%.
We can also test some other MLA's that are suited for this data — multinom from the nnet package,
and svm (Support Vector Machine) from the e1071 package:
> TAXGRSC.rf <- randomForest(x=m[,paste0("PC",1:10)], y=m$soiltype)
> fm = as.formula(paste("soiltype~", paste(paste0("PC",1:10),
collapse="+")))
> fm
soiltype ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 +
PC10
> TAXGRSC.mn <- multinom(fm, m)
# weights: 132 (110 variable)
initial value 6119.428736
iter 10 value 4161.338634
iter 20 value 4118.296050
iter 30 value 4054.454486
iter 40 value 4020.653949
iter 50 value 3995.113270
iter 60 value 3980.172669
iter 70 value 3975.188371
iter 80 value 3973.743572
iter 90 value 3973.073564
iter 100 value 3973.064186
final value 3973.064186
GSIF (tutorials) - http://gsif.isric.org/
Last update: 2017/07/23 17:18
wiki:soilmapping_using_mla http://gsif.isric.org/doku.php/wiki:soilmapping_using_mla
stopped after 100 iterations
> TAXGRSC.svm <- svm(fm, m, probability=TRUE, cross=5)
> TAXGRSC.svm$tot.accuracy
[1] 40.12539
This shows about the same accuracy levels as for random forest. Because all three methods produce
comparable accuracy, we can also merge predictions by calculating a simple average (an alternative
would be to use the caretEnsemble package to optimize merging of multiple models):
> probs1 <na.pass)
> probs2 <na.pass)
> probs3 <na.action =
predict(TAXGRSC.mn, eberg_grid@data, type="probs", na.action =
predict(TAXGRSC.rf, eberg_grid@data, type="prob", na.action =
attr(predict(TAXGRSC.svm, eberg_grid@data, probability=TRUE,
na.pass), "probabilities")
Derive an average:
>
>
>
>
>
leg <- levels(m$soiltype)
lt <- list(probs1[,leg], probs2[,leg], probs3[,leg])
probs <- Reduce("+", lt) / length(lt)
eberg_soiltype = eberg_grid
eberg_soiltype@data <- data.frame(probs)
Check that all predictions sum up to 100%:
> ch <- rowSums(eberg_soiltype@data)
> summary(ch)
Min. 1st Qu.
1
1
Max.
1
Median
1
Mean 3rd Qu.
1
1
To plot the result we can use the raster package:
> plot(raster::stack(eberg_soiltype), col=SAGA_pal[[1]], zlim=c(,1))
http://gsif.isric.org/
Printed on 2017/07/23 20:31
2017/07/23 20:31
5/17
Machine Learning Algorithms for soil mapping
By using the produced predictions we can further derive Confusion Index (to map thematic
uncertainty) and see if some classes could be aggregated. We can also generate factor-type map by
selecting for each pixel class which is most probable, by using e.g.:
> eberg_soiltype$cl <- apply(eberg_soiltype@data,1,which.max)
Modelling numeric soil properties using h2o
Random forest is suited for both classification and regression problems (it is one of the most popular
MLA's for soil mapping), hence we can use it also for modelling numeric soil properties i.e. to fit
models and generate predictions. However, because randomForest package in R is not suited for
large data sets, we can also use some parallelized version of random forest (or more scalable) i.e. the
one implemented in the h2o package (Richter et al. 2015). h2o is a Java-based implementation hence
installing the package requires Java libraries (size of package is about: 51MB) and all computing is in
principle run outside of R i.e. within the JVM (Java Virtual Machine).
In the following example we look at mapping sand content for top horizons. To initiate h2o we run:
> library(h2o)
> localH2O = h2o.init()
java version "1.7.0_79"
Java(TM) SE Runtime Environment (build 1.7.0_79-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode)
GSIF (tutorials) - http://gsif.isric.org/
Last update: 2017/07/23 17:18
wiki:soilmapping_using_mla http://gsif.isric.org/doku.php/wiki:soilmapping_using_mla
Starting H2O JVM and connecting: .... Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime:
9 seconds 596 milliseconds
H2O cluster version:
3.8.1.3
H2O cluster total nodes:
1
H2O cluster total memory:
1.71 GB
H2O cluster total cores:
4
H2O cluster allowed cores: 2
H2O cluster healthy:
TRUE\
H2O Connection ip:
localhost
H2O Connection port:
54321
H2O Connection proxy:
NA\
R Version:
R version 3.2.4 (2016-03-16)
Note:
As started, H2O is limited to the CRAN default of 2 CPUs.
Shut down and restart H2O as shown below to use all your CPUs.
> h2o.shutdown()
> h2o.init(nthreads = -1)
This shows that only 2 cores will be used for computing (to control the number of cores you can use
the nthreads argument). Next, we need to prepare regression matrix and prediction locations using
the as.h2o function so that they are visible to h2o:
> eberg.hex = as.h2o(m, destination_frame = "eberg.hex")
|=====================================| 100%
> eberg.grid = as.h2o(eberg_grid@data, destination_frame = "eberg.grid")
|=====================================| 100%
We can now fit a random forest model by using all computing power we have:
> RF.m = h2o.randomForest(y = which(names(m)=="SNDMHT_A"), x =
which(names(m) %in% paste0("PC",1:10)), training_frame = eberg.hex, ntree =
50)
|=====================================| 100%
> RF.m
Model Details:
==============
H2ORegressionModel: drf
Model ID: DRF_model_R_1462738948867_1
Model Summary:
number_of_trees model_size_in_bytes
http://gsif.isric.org/
Printed on 2017/07/23 20:31
2017/07/23 20:31
7/17
Machine Learning Algorithms for soil mapping
1
50
594057
min_depth max_depth mean_depth
1
20
20
20.00000
min_leaves max_leaves mean_leaves
1
972
1077 1023.34000
H2ORegressionMetrics: drf
** Reported on training data. **
Description: Metrics reported on Out-Of-Bag training samples
MSE: 223.0514
R2 : 0.5087279
Mean Residual Deviance :
223.0514
This shows that the model fitting R-square is about 50%. This is also visible from the predicted vs
observed plot:
> plot(m$SNDMHT_A, as.data.frame(h2o.predict(RF.m, eberg.hex,
na.action=na.pass))$predict, asp=1, xlab="measured", ylab="predicted")
|=====================================| 100%
To produce a map based on predictions we use:
> eberg_grid$RFx <- as.data.frame(h2o.predict(RF.m, eberg.grid,
na.action=na.pass))$predict
|=====================================| 100%
> plot(raster(eberg_grid["RFx"]), col=SAGA_pal[[1]], zlim=c(10,90))
> points(eberg, pch="+")
GSIF (tutorials) - http://gsif.isric.org/
Last update: 2017/07/23 17:18
wiki:soilmapping_using_mla http://gsif.isric.org/doku.php/wiki:soilmapping_using_mla
h2o has another MLA of interest to soil mapping called “deep learning” (a feed-forward multilayer
artificial neural network). Fitting the model is equivalent to using random forest:
> DL.m <- h2o.deeplearning(y = which(names(m)=="SNDMHT_A"), x =
which(names(m) %in% paste0("PC",1:10)), training_frame = eberg.hex)
|=====================================| 100%
> DL.m
Model Details:
==============
H2ORegressionModel: deeplearning
Model ID: DeepLearning_model_R_1462738948867_2
Status of Neuron Layers: predicting SNDMHT_A, regression, gaussian
distribution, Quadratic loss, 42.601 weights/biases, 508, KB, 25.520
training samples, mini-batch size 1
layer units
type dropout
l1
1
1
10
Input 0.00 %
2
2
200 Rectifier 0.00 % 0.000000
3
3
200 Rectifier 0.00 % 0.000000
4
4
1
Linear
0.000000
l2 mean_rate rate_RMS momentum
1
2 0.000000 0.025641 0.017642 0.000000
3 0.000000 0.227769 0.253055 0.000000
4 0.000000 0.002558 0.001907 0.000000
mean_weight weight_RMS mean_bias
1
2
0.005033
0.105783 0.355502
3
-0.017964
0.071243 0.954336
http://gsif.isric.org/
Printed on 2017/07/23 20:31
2017/07/23 20:31
4
0.000362
bias_RMS
9/17
0.053619
Machine Learning Algorithms for soil mapping
0.115043
1
2 0.061916
3 0.020658
4 0.000000
H2ORegressionMetrics: deeplearning
** Reported on training data. **
Description: Metrics reported on full training frame
MSE: 225.3349
R2 : 0.5036984
Mean Residual Deviance :
225.3349
Which shows a performance comparable to random forest model. The output predictions (map) does
look somewhat different from the random forest predictions:
> eberg_grid$DLx <- as.data.frame(h2o.predict(DL.m, eberg.grid,
na.action=na.pass))$predict
|=====================================| 100%
> plot(raster(eberg_grid["DLx"]), col=SAGA_pal[[1]], zlim=c(10,90))
> points(eberg, pch="+")
Which of the two methods should we use? Since they both have comparable performance, the most
logical option is to generate ensemble (merged) predictions i.e. to produce a map that shows patterns
between the two methods (note: many sophisticated MLA such as random forest, neural nets, SVM
and similar will often produce comparable results i.e. they are often equally applicable and there is no
GSIF (tutorials) - http://gsif.isric.org/
Last update: 2017/07/23 17:18
wiki:soilmapping_using_mla http://gsif.isric.org/doku.php/wiki:soilmapping_using_mla
clear winner). We can use weighted average i.e. R-square as a simple approach to produce merged
predictions:
> rf.R2 = RF.m@model$training_metrics@metrics$r2
> dl.R2 = DL.m@model$training_metrics@metrics$r2
> eberg_grid$SNDMHT_A <- rowSums(cbind(eberg_grid$RFx*rf.R2,
eberg_grid$DLx*dl.R2), na.rm=TRUE)/(rf.R2+dl.R2)
> plot(raster(eberg_grid["SNDMHT_A"]), col=SAGA_pal[[1]], zlim=c(10,90))
> points(eberg, pch="+")
Indeed, the output map now shows patterns of both methods and is more likely slightly more accurate
than any of the individual MLA's (Krogh, 1996).
Spatial prediction of 3D (numeric) variables
In the last exercise we look at another two ML-based packages that are also of interest for soil
mapping projects — cubist (Kuhn et al, 2012) and xgboost (Chen and Guestrin, 2016). The object is
now to fit models and predict continuous soil properties in 3D. To fine-tune some of the models we
will also use the caret package, which is highly recommended for optimizing model fitting and crossvalidation.
We look at another soil mapping data set from Australia called "Edgeroi" and which is described in
detail in Malone et al. (2010). We can load the profile data and covariates by using:
> data(edgeroi)
> edgeroi.sp = edgeroi$sites
> coordinates(edgeroi.sp) <- ~ LONGDA94 + LATGDA94
> proj4string(edgeroi.sp) <- CRS("+proj=longlat +ellps=GRS80
+towgs84=0,0,0,0,0,0,0 +no_defs")
> edgeroi.sp <- spTransform(edgeroi.sp, CRS("+init=epsg:28355"))
http://gsif.isric.org/
Printed on 2017/07/23 20:31
2017/07/23 20:31
11/17
Machine Learning Algorithms for soil mapping
> con <url("http://gsif.isric.org/lib/exe/fetch.php?media=edgeroi.grids.rda")
> load(con)
> gridded(edgeroi.grids) <- ~x+y
> proj4string(edgeroi.grids) <- CRS("+init=epsg:28355")
We are interested in modelling soil organic carbon content in g/kg for different depths. We again start
by producing the regression matrix:
> ov2 <- over(edgeroi.sp, edgeroi.grids)
> ov2$SOURCEID = edgeroi.sp$SOURCEID
Because we will run 3D modelling, we also need to add depth of horizons. We use a small function to
assign depths to center of all horizons (as shown in figure below). Because we know where the
horizons start and stop, we can copy values of target variables two times so that the model knows at
which depth values of properties change.
## Convert soil horizon data to x,y,d regression matrix for 3D modeling:
hor2xyd = function(x, U="UHDICM", L="LHDICM", treshold.T=15){
x$DEPTH <- x[,U] + (x[,L] - x[,U])/2
x$THICK <- x[,L] - x[,U]
sel = x$THICK < treshold.T
## begin and end of the horizon:
x1 = x[!sel,]; x1$DEPTH = x1[,L]
x2 = x[!sel,]; x2$DEPTH = x1[,U]
y = do.call(rbind, list(x, x1, x2))
return(y)
}
> h2 = hor2xyd(edgeroi$horizons)
> ## regression matrix:
GSIF (tutorials) - http://gsif.isric.org/
Last update: 2017/07/23 17:18
wiki:soilmapping_using_mla http://gsif.isric.org/doku.php/wiki:soilmapping_using_mla
> m2 <- plyr::join_all(dfs = list(edgeroi$sites, h2, ov2))
Joining by: SOURCEID
Joining by: SOURCEID
Now we can fit the model of interest by using:
> formulaStringP2 = ORCDRC ~
DEMSRT5+TWISRT5+PMTGEO5+EV1MOD5+EV2MOD5+EV3MOD5+DEPTH
> mP2 <- m2[complete.cases(m2[,all.vars(formulaStringP2)]),]
Note that DEPTH is used as a covariate, which makes this model 3D as one can predict anywhere in
3D space. To improve random forest modelling, we use the caret package that tries to pick up also
the optimal mtry parameter i.e. based on the cross-validation performance:
> ctrl <- trainControl(method="repeatedcv", number=5, repeats=1)
> sel <- sample.int(nrow(mP2), 500)
> tr.ORCDRC.rf <- train(formulaStringP2, data=mP2[sel,], method = "rf",
trControl = ctrl, tuneLength = 3)
> tr.ORCDRC.rf
Random Forest
500 samples
18 predictor
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 1 times)
Summary of sample sizes: 399, 401, 398, 401, 401
Resampling results across tuning parameters:
mtry RMSE
2
4.500389
7
4.171750
12
4.209349
Rsquared SD
0.10781622
0.09216059
0.09272870
Rsquared
0.5331524
0.5497456
0.5439468
RMSE SD
0.8175690
0.8331998
0.8250320
RMSE was used to select the optimal
model using the smallest value.
The final value used for the model was mtry
= 7.
In this case mtry = 7 seems to achieve best performance. Note that we sub-set the initial matrix to
speed up fine-tuning of the parameters (otherwise the computing time could easily blow up). Next, we
can fit the final model using:
> ORCDRC.rf <- train(formulaStringP2, data=mP2, method = "rf",
http://gsif.isric.org/
Printed on 2017/07/23 20:31
2017/07/23 20:31
13/17
Machine Learning Algorithms for soil mapping
tuneGrid=data.frame(mtry=7), trControl=trainControl(method="none"))
Variable importance plot shows that DEPTH is definitively the most important predictor:
> varImpPlot(ORCDRC.rf$finalModel)
We can also try fitting models using the xgboost package and the cubist package. On the end of the
statistical modelling process, we can merge the predictions by using the R-square estimated for each
technique:
> edgeroi.grids$DEPTH = 2.5
> edgeroi.grids$Random_forest <- predict(ORCDRC.rf, edgeroi.grids@data,
na.action = na.pass)
> edgeroi.grids$Cubist <- predict(ORCDRC.cb, edgeroi.grids@data, na.action =
na.pass)
> edgeroi.grids$XGBoost <- predict(ORCDRC.gb, edgeroi.grids@data, na.action
= na.pass)
> edgeroi.grids$ORCDRC_5cm <(edgeroi.grids$Random_forest*w1+edgeroi.grids$Cubist*w2+edgeroi.grids$XGBoos
t*w3)/(w1+w2+w3)
>
plot(stack(edgeroi.grids[c("Random_forest","Cubist","XGBoost","ORCDRC_5cm")]
), col=SAGA_pal[[1]], zlim=c(5,65))
GSIF (tutorials) - http://gsif.isric.org/
Last update: 2017/07/23 17:18
wiki:soilmapping_using_mla http://gsif.isric.org/doku.php/wiki:soilmapping_using_mla
The final plot shows that xgboost possibly overpredicts and that cubist possibly underpredicts values
of ORCDRC, while random forest is somewhere in-between the two. Again, merged predictions are
probably the safest option considering that all three MLA's have a similar performances.
We can quickly test the overall performance using a script on github prepared for testing performance
of merged predictions:
> source_https <- function(url, ...) {
+
require(RCurl)
+
cat(getURL(url, followlocation = TRUE, cainfo = system.file("CurlSSL",
"cacert.pem", package = "RCurl")), file = basename(url))
+
source(basename(url))
+ }
>
source_https("https://raw.githubusercontent.com/ISRICWorldSoil/SoilGrids250m
/master/grids/cv/cv_functions.R")
We can hence run 5-fold cross validation:
> test.ORC <- cv_numeric(formulaStringP2, rmatrix=mP2, nfold=5,
idcol="SOURCEID", Log=TRUE)
Running 5-fold cross validation with model re-fitting method ranger ...
snowfall 1.84-6.1 initialized (using snow 0.4-1): parallel execution on 2
CPUs.
> str(test.ORC)
List of 2
http://gsif.isric.org/
Printed on 2017/07/23 20:31
2017/07/23 20:31
15/17
Machine Learning Algorithms for soil mapping
$ CV_residuals:'data.frame':
2076 obs. of 4 variables:
..$ Observed : num [1:2076] 10.75 4.8 7.95 1.6 0.4 ...
..$ Predicted: num [1:2076] 12.21 5.06 9.55 1.36 1.09 ...
..$ SOURCEID : Factor w/ 359 levels "199_CAN_CP111_1",..: 16 16 17 17 17
18 18 19 20 20 ...
..$ fold
: int [1:2076] 1 1 1 1 1 1 1 1 1 1 ...
$ Summary
:'data.frame':
1 obs. of 6 variables:
..$ ME
: num 0.0407
..$ MAE
: num 2.2
..$ RMSE
: num 4.09
..$ R.squared
: num 0.651
..$ logRMSE
: num 0.413
..$ logR.squared: num 0.777
Which shows that the R-squared based on cross-validation is about 65% i.e. the average error of
predicting soil organic carbon content using ensemble method is about +/-4 g/kg. The final observedvs-predict plot shows that the model is unbiased and that the predictions generally match crossvalidation points:
> plt0 <- xyplot(test.ORC[[1]]$Predicted~test.ORC[[1]]$Observed, asp=1,
par.settings=list(plot.symbol = list(col=alpha("black", 0.6),
fill=alpha("red", 0.6), pch=21, cex=0.9)), scales=list(x=list(log=TRUE,
equispaced.log=FALSE), y=list(log=TRUE, equispaced.log=FALSE)),
xlab="measured", ylab="predicted (machine learning)")
> plt0
Summary points
In summary, MLA's are fairly attractive for soil mapping and soil modelling problems in general as
they often perform better than standard linear models (already recognized by Moran and Bui in 2002;
some recent comparisons can be found in Nussbaum et al. 2017). This is possible due to the following
three reasons:
GSIF (tutorials) - http://gsif.isric.org/
Last update: 2017/07/23 17:18
wiki:soilmapping_using_mla http://gsif.isric.org/doku.php/wiki:soilmapping_using_mla
1. Non-linear relationships between soil forming factors and soil properties can be efficiently
modeled using MLA's,
2. Tree-based MLA's (random forest, gradient boosting, cubist) are suitable for representing local
soil-landscape relationships, which is often important for accuracy of spatial prediction models,
3. In the case of MLA, statistical properties such as multicolinearity and non-Gaussian distribution
are dealt with inside the models, which simplifies statistical modeling steps,
On the other hand MLA's can be computationally very intensive and hence require careful planning,
especially as the number of points goes beyond few thousand and number of covariates beyond a
dozen. Note also that some MLA's such as for example support vector machines (svm) is
computationally very intensive and is probably not suited for large data sets.
References
Chen, T. and Guestrin, C., 2016. XGBoost: A Scalable Tree Boosting System. arXiv preprint
arXiv:1603.02754.
Krogh, P.S.A., 1996. Learning with ensembles: How over-fitting can be useful. In Proceedings of
the 1995 Conference (Vol. 8, p. 190).
Kuhn, M., Weston, S., Keefer, C. and Coulter, N., 2012. Cubist Models For Regression. R package
Vignette R package, 18 pp.
Kuhn, M., 2008. Building Predictive Models in R Using the caret Package. Journal of Statistical
Software, 28(5).
Malone, B.P., McBratney, A.B., Minasny, B. (2010) Mapping continuous depth functions of soil
carbon storage and available water capacity. Geoderma 154, 138-152.
Moran, C.J. and Bui, E.N., 2002. Spatial data mining for enhanced soil map modelling.
International Journal of Geographical Information Science, 16(6), pp.533-549.
Nussbaum, M., Spiess, K., Baltensweiler, A., Grob, U., Keller, A., Greiner, L., Schaepman, M.E.,
and Papritz, A., 2017?. Evaluation of digital soil mapping approaches with large sets of
environmental covariates. SOIL Discuss., in review, 2017.
Richter, A.N., Khoshgoftaar, T.M., Landset, S. and Hasanin, T., 2015, August. A MultiDimensional Comparison of Toolkits for Machine Learning with Big Data. In Information Reuse
and Integration (IRI), 2015 IEEE International Conference on (pp. 1-8). IEEE.
Wright, M.N. and Ziegler, A., 2015. ranger: A Fast Implementation of Random Forests for High
Dimensional Data in C++ and R. arXiv preprint arXiv:1508.04409.
wiki soilmapping using mla
from 2 votes (Details)
Rate
0 visitor votes
0 visitor votes
0 visitor votes
2 visitor votes
0 visitor votes
http://gsif.isric.org/
Printed on 2017/07/23 20:31
2017/07/23 20:31
17/17
From:
http://gsif.isric.org/ - GSIF (tutorials)
Permanent link:
http://gsif.isric.org/doku.php/wiki:soilmapping_using_mla
Last update: 2017/07/23 17:18
GSIF (tutorials) - http://gsif.isric.org/
Machine Learning Algorithms for soil mapping

Download Report

Machine Learning Algorithms for soil mapping - GSIF (tutorials)

Paperzz.com

Your Paperzz