Package ‘ordinalForest’ April 13, 2017 Type Package Title Ordinal Forests: Prediction and Class Width Inference with Ordinal Target Variables Version 1.0 Date 2017-04-13 Author Roman Hornung Maintainer Roman Hornung <[email protected]> Imports ranger, combinat, ggplot2 Description Ordinal forests (OF) are a method for ordinal regression with high-dimensional and low-dimensional data that is able to predict the values of the ordinal target variable for new observations and at the same time estimate the relative widths of the classes of the ordinal target variable. Using a (permutation-based) variable importance measure it is moreover possible to rank the importances of the covariates. OF will be presented in an upcoming technical report by Hornung et al.. The main functions of the package are: ordfor() (construction of OF), predict.ordfor() (prediction of the target variable values of new observations), and plot.ordfor() (visualization of the estimated relative widths of the classes of the ordinal target variable). License GPL-2 Encoding UTF-8 RoxygenNote 5.0.1 NeedsCompilation no Repository CRAN Date/Publication 2017-04-13 21:00:25 UTC R topics documented: ordinalForest-package . hearth . . . . . . . . . ordfor . . . . . . . . . perfm . . . . . . . . . plot.ordfor . . . . . . . predict.ordfor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 . 3 . 5 . 9 . 11 . 12 2 ordinalForest-package Index 14 ordinalForest-package Ordinal Forests: Prediction, Class Width Inference and Variable Ranking with Ordinal Target Variables Description Ordinal forests (OF) are a method for ordinal regression with high-dimensional and low-dimensional data that is able to predict the values of the ordinal target variable for new observations and at the same time estimate the relative widths of the classes of the ordinal target variable. Using a (permutation-based) variable importance measure it is moreover possible to rank the importances of the covariates. OF will be presented in an upcoming technical report by Hornung et al.. Details For a detailed description of OF see: ordfor The main functions are: ordfor (construction of OF), predict.ordfor (prediction of the target variable values of new observations), and plot.ordfor (visualization of the estimated relative widths of the classes of the ordinal target variable). Examples ## Not run: # Illustration of the key functionalities of the package: ########################################################## # Load example dataset: data(hearth) # Inspect the data: table(hearth$Class) dim(hearth) head(hearth) # Split into training dataset and test dataset: set.seed(123) trainind <- sort(sample(1:nrow(hearth), size=floor(nrow(hearth)*(2/3)))) testind <- setdiff(1:nrow(hearth), trainind) datatrain <- hearth[trainind,] datatest <- hearth[testind,] # Construct OF using the training dataset: hearth 3 ordforres <- ordfor(depvar="Class", data=datatrain, ndiv=1000, ntreeperdiv=100, ntreefinal=5000, perfmeasure = "equal") ordforres # Study variable importance values: sort(ordforres$varimp, decreasing=TRUE) # Take a closer look at the top variables: boxplot(datatrain$oldpeak ~ datatrain$Class, horizontal=TRUE) fisher.test(table(datatrain$exang, datatrain$Class)) # Plot the optimized partition: # using ggplot2: plot(ordforres) # without ggplot2: plot(ordforres, useggplot=FALSE) # Predict values of the ordinal target variable in the test dataset: preds <- predict(ordforres, newdata=datatest) preds # Compare predicted values with true values: table(data.frame(true_values=datatest$Class, predictions=preds$ypred)) ## End(Not run) hearth Data on Coronary Artery Disease Description This data includes 294 patients undergoing angiography at the Hungarian Institute of Cardiology in Budapest between 1983 and 1987. Format A data frame with 294 observations, ten covariates and one ordinal target variable Details The variables are as follows: • age. numeric. Age in years • sex. factor. Sex (1 = male; 0 = female) • chest_pain. factor. Chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 4 = asymptomatic) 4 hearth • trestbps. numeric. Resting blood pressure (in mm Hg on admission to the hospital) • chol. numeric. Serum cholestoral in mg/dl • fbs. factor. Fasting blood sugar > 120 mg/dl (1 = true; 0 = false) • restecg. factor. Resting electrocardiographic results (1 = having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV); 0 = normal) • thalach. numeric. Maximum heart rate achieved • exang. factor. Exercise induced angina (1 = yes; 0 = no) • oldpeak. numeric. ST depression induced by exercise relative to rest • Class. factor. Ordinal target variable - severity of coronary artery disease (determined using angiograms) (1 = no disease; 2 = degree 1; 3 = degree 2; 4 = degree 3; 5 = degree 4) The original openML dataset was pre-processed in the following way: 1. The variables were re-named according to the description given on openML. 2. The missing values which were coded as "-9" were replaced by NA values. 3. The variables slope, ca, and thal were excluded, because these featured too many missing values. 4. The categorical covariates were transformed into factors. 5. There were 6 restecg values of "2" which were replaced by "1". 6. The missing values were imputed: The missing values of the numerical covariates were replaced by the means of the corresponding non-missing values. The missing values of the categorical covariates were replaced by the modes of the corresponding non-missing values. Source OpenML: data.name: heart-h, data.id: 1565, link: https://www.openml.org/d/1565/ References Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J.-J., Sandhu, S., Guppy, K. H., Lee, S., Froelicher, V. (1989) International application of a new probability algorithm for the diagnosis of coronary artery disease. The American Journal Of Cardiology, 64, 304–310. Examples data(hearth) table(hearth$Class) dim(hearth) head(hearth) ordfor ordfor 5 Ordinal forests Description Constructs ordinal forests (OF) as presented in Hornung et al. (in prep). The following tasks can be performed using OF: 1) Predicting the values of an ordinal target variable for new observations based on covariate values (see predict.ordfor); 2) Estimation of the relative widths of the classes of the ordinal target variable; 3) Ranking the importances of the covariates with respect to predicting the values of the ordinal target variable. The default values for the following hyperparameters are appropriate for most applications: ndiv, ntreeperdiv, ntreefinal, npermtrial, and nbest. Thus, these arguments do in general NOT have to be specified by hand. For details on OF see the ’Details’ section below. Usage ordfor(depvar, data, ndiv = 1000, ntreeperdiv = 100, ntreefinal = 5000, perfmeasure = c("equal", "irrespective", "onecateg", "custom"), categimp, categweights, npermtrial = 500, nbest = 10, importance = "permutation") Arguments depvar character. Name of the dependent variable in data. data data.frame. Data frame containing the covariates and a factor-valued ordinal target variable. ndiv integer. Number of random partitions of [0,1] to consider in the optimization. ntreeperdiv integer. Number of trees in the smaller regression forests constructed for the ndiv different partitions. ntreefinal integer. Number of trees in the larger regression forest constructed using the optimized partition (i.e., the OF). perfmeasure character. Performance measure. The default is "equal". See ’Details’ subsection ’Performance measures’ below and perfm. categimp character. Class to priorize if perfmeasure="onecateg". categweights numeric. Needed if perfmeasure="custom": vector of length equal to the number of classes. Class weights - classes with higher weights are to be predicted more precisely then those with smaller weights. npermtrial integer. Number of permutations to try for each partition considered during the optimization. nbest integer. Number of forests to consider when averaging the parititions corresponding to the nbest forests with the highest levels of prediction performance. importance character. Variable importance mode, one of ’none’, ’impurity’, ’permutation’. The ’impurity’ measure is the variance of the responses for regression. Passed to the function ranger from the package of the same name. 6 ordfor Details Introduction: Ordinal forests (OF) are a method for ordinal regression with high-dimensional and low-dimensional data that is able to predict the values of the ordinal target variable for new observations and at the same time estimate the relative widths of the classes of the ordinal target variable. Using a (permutation-based) variable importance measure it is moreover possible to rank the importances of the covariates. OF will be presented in an upcoming technical report by Hornung et al.. Methods: The concept of OF is based on the following assumption: There exists a (possibly latent) refined continuous variable y* underlying the observed ordinal target variable y (y in {1,...,J}, J number of classes), where y* determines the values of y. The functional relationship between y* and y takes the form of a monotonically increasing step function. Depending on which of J intervals ]a_1, a_2], ]a_2, a_3], ..., ]a_J, a_{J+1}[ contains the value of y*, the ordinal target variable y takes a different value. In the context of conditional inference trees, for situations in which the intervals ]a_1, a_2], ..., ]a_J, a_{J+1}[ are available, Hothorn et al. (2006) suggest to use as a metric target variable the midpoints of these intervals. OF are however concerned with settings in which the intervals ]a_1, a_2], ..., ]a_J, a_{J+1}[ are unknown. The main idea of OF is to estimate the unknown interval borders by maximizing the out-of-bag prediction performance (see section "Performance measures") of standard regression forests that use as a target variable a quantity very similar to the midpoints of the intervals. More precisely, we assume y* to have a standard normal distribution and consider the following score values as the values of the target variable: phi^(-1)((phi(a_j) + phi(a_{j+1}))/2) if y = j with a_1 = -infinite and a_{J+1} = +infinite, where "phi" denotes the cumulative distribution function of the standard normal distribution. The assumption of normality is simply made, because normally distributed target variables can be expected to be well suited for standard regression forests. The estimation of the optimal partition is based on constructing many regression forests (b in 1,...,ndiv) of smaller size, where each of these uses as the values of the target variable a score set obtained using random sampling: {phi^(-1)((phi(a_{b,j}) + phi(a_{b,j+1}))/2) : j in 1,...,J}, where phi(a_{b,1})=0 < phi(a_{b,2}) <... < phi(a_{b,J}) < phi(a_{b,J+1})=1 is a random partition of [0,1]. After estimating the intervals, a larger regression forest is built using as values of the target variable the score values phi^(-1)((phi(a_{opt,j}) + phi(a_{opt,j+1}))/2), where the a_{opt,j} values are the optimized borders. An important by-product of the construction of OFs are the estimated interval borders a_{opt,1},...,a_{opt,J+1}. The optimized intervals ]phi(a_{opt,1}), phi(a_{opt,2})], ..., ]phi(a_{opt,J}), phi(a_{opt,J+1})[ can be interpreted vividly as the estimated relative widths of the classes 1,...,J. Further details are given in the sections "Hyperparameters" and "Performance measures". Hyperparameters: There are several hyperparameters, which do, however, not have to be optimized by the user in general. Expect in the case of nbest the larger these values are, the better, but the default values should be large enough in most applications. These hyperparameters are described in the following: • ndiv Default value: 1000. The default value of the number of considered partitions in the estimation of the optimal partition is quite large. Such a large number of considered partitions is necessary to attain a high chance that some of the partitions are close enough to ordfor 7 • • • • the optimal partition, that is, the partition that leads to the optimal performance with respect to the considered performance measure (provided with perfmeasure). ntreeperdiv Default value: 100. A smaller number of trees considered per tried partition might lead to a too strong variability in the assessments of the performances achieved for the individual partitions. ntreefinal Default value: 5000. The number of trees ntreefinal plays the same role as in conventional regression forests. For very high-dimensional datasets the default number 5000 might be too small. npermtrial Default value: 500. As stated above it is necessary to consider a large number of tried partitions ndiv in the optimization in order increase the chance that the best of the considered partitions are close to the optimal partition. However, for a larger number of classes it is in addition necessary to use an efficient algorithm for the sampling of the considered partitions. This algorithm should ensure that the partitions are heterogeneous enough to encertain that the best of the tried partitions are close enough to the optimal partition. In OF we consider an algorithm that has the goal of making the rankings of the widths of the intervals corresponding to the respective classes as dissimilar as possible between successive partitions. Denote d_{b,j} := phi(a_{b,j}). Then the b-th partition is generated as follows: If b = 1, then J-1 instances of a U(0,1) distributed random variable are drawn and the values are sorted. The sorted values are designated as d_{1,2},...,d_{1,J}. Moreover, set d_{1,1} := 0 and d_{1,J+1} := 1. The first considered partition is now { d_{1,1}, ... ,d_{1,J+1} }. If b > 1, the following procedure is performed: 1. Let dist_{b-1,j} = d_{b-1,j+1} - d_{b-1,j} (j in {1,...J}) denote the width of the j-th interval of the b-1-th partition and r_{b-1,j} the rank of dist_{b-1,j} among dist_{b1,1},...,dist_{b-1,J}. First, npermtrial random permutations of 1,...,J are drawn, where r*_{l,1},...,r*_{l,J} denotes the l-th drawn partition. Subsequently, that permutation r*_{l*,1},...,r*_{l*,J} (l* in {1,...,npermtrial}) is determined that features the greatest quadratic distance to r_{b-1,1},...r_{b-1,J}, that is, r*_{l*,1},...,r*_{l*,J} = argmax_{r*_{l,1},...,r*_{l,J}} sum_{j=1}^J (r*_{l,j} - r_{b-1,j})^2. 2. A partition is generated in the same way as in the case of b = 1. 3. The intervals of the partition generated in 2. are re-ordered in such a way that the width of the j-th interval (for j in {1,...,J}) has rank r*_{l*,j} among the widths of the intervals ]d_{b,1}, d_{b,2}], ..., ]d_{b,J}, d_{b,J+1}]. The b-th considered partition is now { d_{b,1}, ... ,d_{b,J+1} }. The default value 500 for npermtrial should be large enough to ensure sufficiently heterogeneous partitions - independent from the number of classes J. nbest Default value: 10. After having constructed ndiv regression forests, each of these using a particular partition, the nbest partitions associated with the highest levels of prediction performance are considered: For each j in {1,...,J} the averages of the phi(a_j) values across these nbest partitions are taken. Then phi^(-1)(.) is applied to the obtained averages resulting in the estimated optimal borders: a_opt,1,...,a_opt,J+1. The latter are then used to calculate the scores for the OF using the following formula phi^(-1)((phi(a_{opt,j}) + phi(a_{opt,j+1}))/2) (see also above). It is probably important that the number nbest of best partitions considered is not strongly misspecified. A too large value of nbest could lead to averaging over a too heterogeneous set of partitions. Conversely, a too small value of nbest could lead to a too large variance in the estimation of the optimal partition. For larger numbers of tried partitions ndiv the results are less sensitive to the choice of nbest, because the number of tried partitions that 8 ordfor are close to the optimal partition becomes larger the more partitions are considered. When setting the number of considered partitions ndiv to 1000, the value 10 for nbest specified as a default value in ordfor was seen to lead to a good trade-off between the heterogeneity of the averaged partitions and the variance in the estimation. Performance measures: As noted above, the different partitions tried during the estimation of the optimal partition are assessed with respect to their (out-of-bag) prediction performance. The choice of the specific performance measure used here determines the specific kind of performance the ordinal forest should feature: • perfmeasure="equal" This choice should be made if it is of interest to predict each class with the same precision irrespective of the numbers of observations representing the individual classes. Youden’s J statistic is calculated with respect to each class ("observation/prediction in class j" vs. "observation/prediction NOT in class j" (j=1,...,J)) and the simple average of the J results taken. • perfmeasure="irrespective" This choice should be made if it is of interest to predict all observations with the same precision independent of their categories. Youden’s J statistic is calculated with respect to each class and subsequently a weighted average of these values is taken - with weights proportional to the number of observations representing the respective classes in the training data. • perfmeasure="onecateg" This choice should be made if it is merely relevant that a particular class of the classes of the ordinal target variable is predicted as precisely as possible. This particular class must be provided via the argument categ. Youden’s J statistic is calculated with respect to class categ ("observation/prediction in class categ" vs. "observation/prediction NOT in class categ"). • perfmeasure="custom" This choice should be made if there is a particular ranking of the classes with respect to their importance. Youden’s J statistic is calculated with respect to each class. Subsequently, a weighted average with user-specified weights (provided via the argument categweights) is taken. Classes with higher weights are to be predicted more precisely than those with smaller weights. Value ordfor returns an object of class ordfor. An object of class "ordfor" is a list containing the following components: forestfinal object of class "ranger". Regression forest constructed using the optimized partition of [0,1] (i.e., the OF). Used by predict.ordfor. bordersbest vector of length J+1. Average over the nbest best partitions of [0,1]. Used by predict.ordfor. forests list of length ndiv. The smaller forests constructed during optimizing the partition of [0,1]. perfmeasurevalues vector of length ndiv. Performance measure values of the ndiv smaller forests stored in forests. bordersb matrix of dimension ndiv x (J+1). All ndiv partitions considered. classes character vector of length J. Classes of the target variable. ndiv integer. Number of random partitions considered. perfm ntreeperdiv ntreefinal perfmeasure categimp nbest classfreq varimp 9 integer. Number of trees per partition. integer. Number of trees of OF (constructed after optimizing the partition). character. Performance measure. character. If perfmeasure="onecateg": class to priorize, NA else. integer. Number of forests considered in the averaging of the parititions corresponding to the nbest forests with the highest levels of prediction performance. table. Clas frequencies. vector of length p. Variable importance for each covariate. References Hothorn, T., Hornik, K., Zeileis, A. (2006) Unbiased recursive partitioning: a conditional inference framework. Journal of Computational and Graphical Statistics, 15, 651–674. Examples data(hearth) set.seed(123) hearthsubset <- hearth[sort(sample(1:nrow(hearth), size=floor(nrow(hearth)*(1/2)))),] ordforres <- ordfor(depvar="Class", data=hearthsubset, ndiv=80, nbest=5, ntreeperdiv=100, ntreefinal=1000, perfmeasure = "equal") # NOTE: ndiv=80 is not enough!! In practice, ndiv=1000 (default value) or a higher # number should be used. ordforres sort(ordforres$varimp, decreasing=TRUE) perfm Performance measures based on Youden’s J statistic Description These performance measures based on Youden’s J statistic are used in ordfor to measure the performance of the smaller regression forests constructed for the estimation of the optimal partition. They may however also be used to measure the precision of predictions on new data or the precision of OOB predictions. Usage perfm_equal(ytest, ytestpred, categ, categweights) perfm_irrespective(ytest, ytestpred, categ, categweights) perfm_onecateg(ytest, ytestpred, categ, categweights) perfm_custom(ytest, ytestpred, categ, categweights) 10 perfm Arguments ytest factor. True values of the target variable. ytestpred factor. Predicted values of the target variable. categ character. Needed in the case of perfm_onecateg: Category to predict as precisely as possible. categweights numeric. Needed in the case of perfm_custom: Vector of length equal to the number of classes. Class weights - classes with higher weights are to be predicted more precisely than those with smaller weights. Details perfm_equal should be used if it is of interest to predict each class with the same precision irrespective of the numbers of observations representing the individual classes. Youden’s J statistic is calculated with respect to each class ("class j yes" vs. "class j no" (j=1,...,J)) and the simple average of the J results taken. perfm_irrespective should be used if it is of interest to predict all observations with the same precision independent of their categories. Youden’s J statistic is calculated with respect to each class and subsequently a weighted average of these values is taken - with weights proportional to the number of observations representing the respective classes in the training data. perfm_onecateg should be used if it is merely relevant that a particular class of the classes of the ordinal target variable is predicted as accurately as possible. This particular class must be provided via the argument categ. Youden’s J statistic is calculated with respect to class categ ("class categ yes" vs. "class categ no"). perfm_custom should be used if there is a particular ranking of the classes with respect to their importance. Youden’s J statistic is calculated with respect to each class. Subsequently a weighted average is taken using user-specified weights (provided via the argument categweights). Classes with higher weights are to be predicted more precisely than those with smaller weights. Examples data(hearth) set.seed(123) trainind <- sort(sample(1:nrow(hearth), size=floor(nrow(hearth)*(1/2)))) testind <- setdiff(1:nrow(hearth), trainind) datatrain <- hearth[trainind,] datatest <- hearth[testind,] ordforres <- ordfor(depvar="Class", data=datatrain, ndiv=80, nbest=5) # NOTE: ndiv=80 is not enough!! In practice, ndiv=1000 (default value) or a higher # number should be used. preds <- predict(ordforres, newdata=datatest) perfm_equal(ytest=datatest$Class, ytestpred=preds$ypred) plot.ordfor 11 perfm_irrespective(ytest=datatest$Class, ytestpred=preds$ypred) perfm_onecateg(ytest=datatest$Class, perfm_onecateg(ytest=datatest$Class, perfm_onecateg(ytest=datatest$Class, perfm_onecateg(ytest=datatest$Class, perfm_onecateg(ytest=datatest$Class, ytestpred=preds$ypred, ytestpred=preds$ypred, ytestpred=preds$ypred, ytestpred=preds$ypred, ytestpred=preds$ypred, categ="1") categ="2") categ="3") categ="4") categ="5") perfm_custom(ytest=datatest$Class, ytestpred=preds$ypred, categweights=c(1,2,1,1,1)) # perfm_equal(), perfm_irrespective(), and perfm_onecateg() are special cases of perfm_custom(): perfm_custom(ytest=datatest$Class, ytestpred=preds$ypred, categweights=c(1,1,1,1,1)) perfm_equal(ytest=datatest$Class, ytestpred=preds$ypred) perfm_custom(ytest=datatest$Class, ytestpred=preds$ypred, categweights=table(datatest$Class)) perfm_irrespective(ytest=datatest$Class, ytestpred=preds$ypred) perfm_custom(ytest=datatest$Class, ytestpred=preds$ypred, categweights=c(0,2,0,0,0)) perfm_onecateg(ytest=datatest$Class, ytestpred=preds$ypred, categ="2") plot.ordfor Visualization of optimized partition Description Visualization of the optimized partition of [0,1] interpretable as the relative widths of the classes of the ordinal target variable. Usage ## S3 method for class 'ordfor' plot(x, useggplot = TRUE, ...) Arguments x object of class ordfor. See function ordfor. useggplot logical. Should R package ggplot2 be used to produce a more neatly looking plot? ... other parameters to be passed through to plotting functions if useggplot=FALSE. Value If useggplot = TRUE (default) a ggplot x is returned where the ggplot2 plot is also shown on the current device. See ?ggplot for saving ggplot2 plots. If useggplot = FALSE NULL is returned. 12 predict.ordfor Examples data(hearth) set.seed(123) hearthsubset <- hearth[sort(sample(1:nrow(hearth), size=floor(nrow(hearth)*(1/2)))),] ordforres <- ordfor(depvar="Class", data=hearthsubset, ndiv=80, nbest=5) # NOTE: ndiv=80 is not enough!! In practice, ndiv=1000 (default value) or a higher # number should be used. plot(ordforres) plot(ordforres, useggplot=FALSE) predict.ordfor Prediction using ordinal forest objects Description Prediction of test data using ordinal forest. Usage ## S3 method for class 'ordfor' predict(object, newdata, ...) Arguments object object of class ordfor. See function ordfor. newdata data.frame. Data frame containing new data. ... further arguments passed to or from other methods. Value predict.ordfor returns an object of class ordforpred. An object of class "ordforpred" is a list containing the following components: ypred vector of length nrow(newdata). Factor-valued test data predictions. yforestpredmetric vector of length nrow(newdata). Numeric test data predictions: Result of applying the regression forest forestfinal returned by ordfor. predict.ordfor Examples data(hearth) set.seed(123) trainind <- sort(sample(1:nrow(hearth), size=floor(nrow(hearth)*(1/2)))) testind <- setdiff(1:nrow(hearth), trainind) datatrain <- hearth[trainind,] datatest <- hearth[testind,] ordforres <- ordfor(depvar="Class", data=datatrain, ndiv=80, nbest=5) # NOTE: ndiv=80 is not enough!! In practice, ndiv=1000 (default value) or a higher # number should be used. preds <- predict(ordforres, newdata=datatest) preds table(data.frame(true_values=datatest$Class, predictions=preds$ypred)) par(mfrow=c(1,2)) plot(preds$yforestpredmetric, as.numeric(preds$ypred)) plot(pnorm(preds$yforestpredmetric), as.numeric(preds$ypred), xlim=c(0,1)) par(mfrow=c(1,1)) 13 Index hearth, 3 ordfor, 2, 5, 9, 11, 12 ordinalForest (ordinalForest-package), 2 ordinalForest-package, 2 perfm, 5, 9 perfm_custom (perfm), 9 perfm_equal (perfm), 9 perfm_irrespective (perfm), 9 perfm_onecateg (perfm), 9 plot.ordfor, 2, 11 predict.ordfor, 2, 5, 8, 12 14
© Copyright 2026 Paperzz