A New Feature Sampling Method in Random Forests for Predicting

A New Feature Sampling Method in Random
Forests for Predicting High-Dimensional Data
Thanh-Tung Nguyen1 , He Zhao2 , Joshua Zhexue Huang3 ,
Thuy Thi Nguyen4 , and Mark Junjie Li3(B)
1
Faculty of Computer Science and Engineering, Thuyloi University, Hanoi, Vietnam
[email protected]
2
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences,
Shenzhen, People’s Republic of China
[email protected]
3
College of Computer Science and Software Engineering, Shenzhen University,
Shenzhen, China
{zx.huang,jj.li}@szu.edu.cn
4
Faculty of Information Technology, Vietnam National University of Agriculture,
Hanoi, Vietnam
[email protected]
Abstract. Random Forests (RF) models have been proven to perform
well in both classification and regression. However, with the randomizing
mechanism in both bagging samples and feature selection, the performance of RF can deteriorate when applied to high-dimensional data. In
this paper, we propose a new approach for feature sampling for RF to deal
with high-dimensional data. We first apply p-value to assess the feature
importance on finding a cut-off between informative and less informative features. The set of informative features is then further partitioned
into two groups, highly informative and informative features, using some
statistical measures. When sampling the feature subspace for learning
RFs, features from the three groups are taken into account. The new
subspace sampling method maintains the diversity and the randomness
of the forest and enables one to generate trees with a lower prediction
error. In addition, quantile regression is employed to obtain predictions
in the regression problem for a robustness towards outliers. The experimental results demonstrated that the proposed approach for learning
random forests significantly reduced prediction errors and outperformed
most existing random forests when dealing with high-dimensional data.
Keywords: Subspace feature selection · Regression · Classification
Random forests · Data mining · High-dimensional data
1
·
Introduction
High-dimensional data has become common in today’s applications. State-ofthe-art machine learning methods can work well for data sets of moderate size
c Springer International Publishing Switzerland 2015
T. Cao et al. (Eds.): PAKDD 2015, Part II, LNAI 9078, pp. 459–470, 2015.
DOI: 10.1007/978-3-319-18032-8 36
460
T.-T. Nguyen et al.
but they suffer when scaling for high-dimensional data. It is well-known that in
a high-dimensional data set only a small portion of the predictor features are
relevant to the response feature, the irrelevant features may even degrade the
performance of the model. This requires methods for selecting good subsets of
features for learning efficient prediction models.
Random forests (RF) [1] [2], an ensemble learning machine composed of
decision trees for prediction, is defined as follow: Given a training data set
L = {(Xi , Yi ), X ∈ RM , Y ∈ Y}N
i=1 , where Xi are features (also called predictor variables) and Y is the target (also called response feature), Y ∈ R1 for
a regression problem and Y ∈ {1, 2, ..c} for a classification problem (c ≥ 2), N
and M are the number of training samples and features, respectively. A standard version of RF independently and uniformly resamples observations from
the training data L to draw a bootstrap data set L∗ from which a decision tree
T ∗ is grown. Repeating this process K times produces a series of bootstrap data
sets L∗k and corresponding decision trees Tk∗ (k = 1, 2, ..., K), that form a RF.
Given an input X = x, the predicted value by the whole RF is obtained by
aggregating the results given by individual trees. Let fˆk (x) denote the prediction
of unknown value y of input x ∈ RM by kth tree, we have
K
1 ˆ
fk (x) for regression problems, and
fˆ(x) =
K
(1)
k=1
fˆ(x) = argmaxy∈Y
K
ˆ
I[fk (x) = y] for classification problems,
(2)
k=1
where I(·) and fˆ(x) denote the indicator function and RF prediction, respectively.
RFs have shown to be a state-of-the-art tool in machine learning. RF model
can be used for both feature selection and prediction, and it can perform well in
both classification and regression problems. However, the performance of random
forests suffers when applied to high-dimensional data, i.e., data with thousands
to millions of features. The main cause is that in the process of growing a tree
from the bagged sample data, the subspace of features randomly sampled from
the thousands of features in the training data to split a node of the tree is
often dominated by less important features. The tree grown from such randomly
sampled subspace features will have low accuracy in prediction, hence affects the
final prediction of the random forests.
In this paper, we propose a new approach for feature weighting subspace
selection to improve the accuracy of prediction for RF, meanwhile maintaining the diversity and the randomness of the forest. Given a training data set
L, we first use a feature permutation technique [3] [4] to measure the importance of features and produce raw feature importance scores. Then we apply
p-value assessment on finding the cut-off between informative and less informative features. For all informative features, the Spearman rank test is then used
for regression problem and the χ2 statistic is used for classification problem to
A New Feature Sampling Method in Random Forests
461
find the subset of highly informative features. The separation forms three sub
sets of features. When sampling the feature subspace for learning, features from
these three groups of highly informative, informative and less-informative features are taken into account for splitting the data at a node. Since the subspace
always contains highly informative features, it can guarantee a better split at a
node, therefore assuring a qualified tree. This sampling method always provides
enough highly informative features for the subspace feature at any levels of the
decision tree. By using taking into account features from all three subsets, the
diversity and the randomness of the forests in the Breiman’s framework [1] are
maintained.
The above feature subspace selection will be used for building trees in our new
random forests algorithm, called ssRF, for dealing with both classification and
regression problems. With the ssRF model, the quantile regression is employed
to predict both point prediction and range prediction in regression problems.
Our experimental results have shown that with the proposed feature sampling
method, our random forests ssRF model outperformed existing random forests
in reduction of prediction errors, even though a small feature subspace size of
log2 (M ) + 1 is used, and especially they performed well in range prediction on
high-dimensional data.
2
2.1
Feature Weighting Subspace Selection
Importance Measure of Features from a Random Forest
The feature importance measure obtained from the random forest is described
as follows [5], [6]. At each node t in a decision tree, a split on feature Xj is
determined by the decrease in node impurity ΔR(Xj , t). For a regression tree,
the node impurity R(t) = σ 2 (t)p(t), where p(t) = N (t)/N is the probability for
the impurity reduction that an sample chosen at random from the underlying
theoretical
distribution falls into t, N (t) is the total number of samples and
σ 2 (t) = xi ∈t (Yi − Ȳt )2 /N (t) is the sample variance of Y. Then the decrease of
impurity in node t after splitting into tL and tR is
ΔR(Xj , t) = R(t) − [R(tL ) + R(tR )]
= σ 2 (t)p(t) − [σ 2 (tL )pL + σ 2 (tR )pR ],
(3)
where pL , pR are the proportions of samples in t that go left and right, respectively.
For classification trees, the Gini index is used to reflect the node impurity
R(t). Suppose there are S categorical values in node t (s ∈ S). Let πt (s) be the
proportion of the samples from the sth category in node t. The node impurity
is defined as
R(t) = N (t)
S
s=1
πt (s)[1 − πt (s)].
462
T.-T. Nguyen et al.
The chosen split of feature Xj for each node t is the one that maximizes ΔR(Xj , t).
Let ISk (Xj ) denotes the importance score of feature Xj in a single decision tree
Tk , we have
ΔR(Xj , t).
ISk (Xj ) =
t∈Tk
Let ISj be an importance score of feature Xj , ISj is computed over all K trees
in a random forest, defined as
ISj =
K
ISk (Xj )/K.
k=1
It is worth noting that a random forest uses in-bag samples (i.e. the set of
the bagged samples used in building the trees) to produce importance scores
ISj . This is the main difference between this importance score and an out-of-bag
measure, which requires so much computational time using OOB-permutation
[7], [3]. We can normalize them into [0, 1] using the min-max normalization as
follows:
V Ij =
ISj − min(ISj )
.
max(ISj ) − min(ISj )
(4)
Having the raw importance scores V Ij determined by Equation (4) we can evaluate the contributions of the features in predicting the response feature.
2.2
A New Feature Sampling Method for Subspace Selection
We first compute importance scores for all features according to Equation (4).
Denote the feature set as LX = {Xj }, j = 1, 2, ..., M , we randomly permute
all values in each feature to get a corresponding shadow feature set, denoted as
LA = {Aj }M
1 . The shadow features do not have prediction power to the response
feature. Following the feature permutation procedure recently presented in [3], we
ran RF R times on the extended data set {LX ∪ LA , Y } to get importance scores
r
r
V IX
and V IA
, and the samples for comparison denoted as V ∗ = max{Arj , r =
j
j
1, ..R}.
The unequal variance Welch’s two-sample t-test [8] is then used to compare
the importance score of each feature with the maximum importance scores of
generated shadows. The non-parametric statistical test is required because the
importance scores across the replicates are not normal distribution. Having computed the t statistic, we can compute the p-value for the features and perform
∗
hypothesis test on V I Xj > V . This test confirms that if a feature is important, it consistently scores higher than the shadow over multiple permutations.
Therefore, any feature whose importance score is smaller than the maximum
importance score of noisy features, is considered less important, otherwise, it is
considered important.
A New Feature Sampling Method in Random Forests
463
The p-value of a feature indicates the importance of the feature in prediction.
The smaller the p-value of a feature, the more correlated the predictor feature
to the response feature, and the more powerful the feature in prediction. Given
a statistical significance level, we can identify informative features from lowinformative ones. Given all p values of features, we set a significance level as a
threshold λ, for instance λ = 0.05. Any feature whose p-value is greater than λ is
added to the low-informative feature subset denoted as Xl , the direct relationship
with the Y values is assessed otherwise.
The non-parametric Spearman ρ test is used to measure the strength of
the relationship between Xj and Y ∈ R1 in regression problems. The value
|ρ| ∈ [0, 1], where |ρ| = 1 means a perfect correlation, 0 means that there is no
correlation. Spearman rank correlation coefficient performs well in cases when
the conditional distribution is not normal, each pair (Xj , Y ) is converted to ranks
(R(xi ), R(yi )), (i = 1, .., N ) and ρ is the absolute value, computed as follows:
j (R(xi ) − X)(R(yi ) − Y )
(5)
ρj = N
N
2
2
(R(x
)
−
X)
(R(y
)
−
Y
)
i
i
i=1
i=1
where X, Y are the average values of important feature Xj and response feature
Y , respectively. Given all ρ values in the remaining features {X \ Xl }, we take
the mean of all ρ values as the threshold γ,
γ=
Mλ
1 ρj ,
Mλ j=1
(6)
where Mλ is the number of numerical features in the important feature subset
{X \ Xl }. Let Xh denote a subset of highly informative features, all features Xj
are added to Xh whose ρ-value is greater than γ. The remaining features including categorical features are added to the informative feature subset, denoted
as Xm .
For the classification problem, χ2 (X, Y ) is used to test the association between
the class label and each feature Xj . For the test of independence, a chi-squared
probability of less than or equal to 0.05 is commonly interpreted for rejecting the
hypothesis that the feature is independent of the response feature. All features Xj
whose p-value is smaller than 0.05 from the results of χ2 -test are added into Xh ,
the remaining features are added to Xm otherwise.
Given Xh , Xm and Xl , at each node, we randomly select mtry (mtry > 1)
features from three separated groups. For a given subspace size, we can choose
proportions between highly informative, informative and less-informative features depending on the size of the three groups. That is mtryhigh = mtry ×
(Mhigh /M ), mtrymid = mtry ×(Mmid /M ) and mtrylow = mtry −mtryhigh −
mtrymid , where Mhigh and Mmid are the number of features in Xh and Xm ,
respectively. These are merged to form the feature subspace for splitting nodes
of trees.
464
3
T.-T. Nguyen et al.
The Proposed ssRF Algorithm
The new feature subspace sampling method is now used to grow decision trees
for building RFs. In regression problem, we propose to use quantile regression to
obtain both point and range prediction, this idea was introduced in [9]. Using the
notations as in [1], let θk be the random parameter vector that determines the
growth of the kth tree and Θ = {θk }K
1 be the set of random parameter vectors
for the forests generated from L. In each regression tree Tk from Lk , we compute
a positive weight wi (xi , θk ) for each case xi ∈ L. Let l(x, θk , t) be a leaf node t
in Tk . The cases xi ∈ l(x, θk , t) are assigned the same weight wi (x, θk ) = 1/N (t),
where N (t) is the number of cases in l(x, θk , t). In each classification tree,
N (t)
wi (x, θk ) = 1 if
N (t)
I(Yn = Yi ) ≥
n=1
I(Yn = Yj )∀Yi = Yj .
n=1
This means the prediction for a regression problem is simply the average and for
the classification problem is the category received by a majority votes by all Y
values in node t. In this way, all cases in Lk are assigned positive weights and
the cases not in Lk are assigned zero weight.
For a single tree prediction, given X = x, the prediction value is
Ŷ k =
N
i=1
wi (x, θk )Yi =
wi (x, θk )Yi .
(7)
x,Xi ∈l(x,θk ,t)
The new random forests algorithm ssRF is summarized as follows.
1. Given L, separate the highly informative features and the informative features
from the less informative ones to obtain three feature subsets Xh , Xm and
Xl as described in Section 2.2.
2. Sample the training set L with replacement to generate bagged samples
Lk , k = 1, 2, .., K.
3. For each Lk , grow a regression tree Tk as follows:
(a) At each node, select a subspace of mtry (mtry > 1) features randomly
and separately from Xl , Xm and Xh and use the subspace features as
candidates for splitting the node.
(b) Each tree is grown nondeterministically, without pruning until the minimum node size nmin is reached. At each leaf node, all Y ∈ R1 values of
the samples in the leaf node are kept.
(c) Compute the weights wi (x, θk ) of each Xi by individual tree Tk using
out-of-bag samples.
4. Compute the weights wi (x) assigned by RF which is the average of weights
by all trees:
K
1 wi (x) =
wi (x, θk )
(8)
K
k=1
A New Feature Sampling Method in Random Forests
465
5. Given an input X = x, use Equation (2) to predict the new sample for the
classification problem. For the regression problem, we can find the leaf nodes
lk (x, θk ) from all trees where X falls and the set of Yi in these leaf nodes.
Given all Yi and the corresponding weights wi (x), the conditional distribution
N
function of Y given X is estimated as F̂ (y|X = x) = i=1 wi (x)I(Yi ≤ y),
where I(·) is the indicator function that is equal to 1 if Yi ≤ y and 0 otherwise.
Given a probability α, the quantile Qα (X) is estimated as Q̂α (X = x) =
inf {y : F̂ (y|X = x) ≥ α}. Given a probability τ , αl and αh for αh −αl = τ , τ
is the probability that prediction Y will fall in the range of [Qαl (X), Qαh (X)],
we have
[Qαl (X), Qαh (X)] = [inf {y : F̂ (y|X = x) ≥ αl },
inf {y : F̂ (y|X = x) ≥ αh }]
(9)
For the point regression, the median Q̂0.5 can be chosen in a range as the
prediction of Y given input X = x.
4
Experiments and Evaluation
4.1
Data Sets
We conducted experiments to test our proposed system on high-dimensional data
sets for both classification and regression problems. Table 1 lists the real data
sets used to evaluate the performance of random forests models. The Fbis data
set was compiled from the archive of the Foreign Broadcast Information Service
and the La1s, La2s data sets were taken from the archive of the Los Angeles
Times for TREC-51 .
The Rivers 2 data set was used to predict the flow level of a river. It is based
on a data set containing river discharge levels of 1, 439 Californian rivers for a
period of 12, 054 days. This data set contains 48.6% missing values, all values
were used to train the model. The level of the 1, 440-th river was predicted in
our experiments, the target values were converted from [0.062; 101, 000] to [0; 1].
The LOG1P data set was used in [10]. The Stock data set was described in [11]
to make a stock price prediction. This data set has about 8.35% missing values
in the predictor features. The original Y value is between 880 and 82, 710, these
target feature values were converted to [0; 1] using linear scale. Regarding the
characteristics of the data sets given in Table 1, the proportion of the sub-data
sets for training was separately from the testing.
4.2
Experimental Setting
Evaluation Measure: We used Breiman’s method of measurement as described
in [1]. The accuracy of prediction of RF models was evaluated on test set.
1
2
http://trec.nist.gov
http://www.usgs.gov
466
T.-T. Nguyen et al.
Table 1. Description of high-dimensional data sets sorted by the number of features
and grouped into two groups - for regression and classification problems, accordingly
Data set #Train #Test #Features #Classes
Stock
1,942
785
495
Rivers
8,345 3,709
1,439
LOG1P 16,087 3,308 4,272,227
Fbis
La2s
La1s
1,711
1,855
1,963
752
845
887
2,000
12,432
13,195
17
5
5
In which, for the regression problem the mean of square residuals (M SR) measure was computed, for the classification problem the test error measure was
used.
The latest RF [12], QRF [13], cRF (cForest) [14] and GRRF R-packages [15]
in CRAN3 were used in R environment to conduct these experiments. For the
GRRF model, we used a value of 0.1 for the coefficient γ because GRRF(0.1)
has shown competitive prediction performance in [16]. The novel SRF model
[17] using the stratified sampling method was intended to solve the classification problem. The QRF and eQRF [18] models were developed for solving only
regression problems. The ssRF model with the new subspace sampling method
is a new implementation. In that implementation, we called the corresponding
R/C++ functions in R environment.
From each training data set we built 10 random forest models and the average
of MSRs and the test errors of the models were computed; each of the RF
models had 200 and 500 trees, respectively. The number of the minimum node
size nmin was 5 for regression and 1 for classification problems. The number of
features-candidates was set with the default setting to mtry = log2 (M ) + 1.
The parameters R,
√ mtry and λ for pre-computation of feature partition used
in ssRF were 30, M and 0.05, respectively. In order to process the large-scale
data set LOG1P, only 5% of the samples was used to train the eQRF and ssRF
models for feature partition and subspace selection, since the computational time
required for all the samples is too long.
To address the missing values in the data set, we separate all samples containing missing values and create an extra ”missing” group for them. We then
treat this ”missing” class as a predictor feature of the response feature. If missing
values occur in the response feature, those samples are routinely omitted. After
separation, missing values are typically treated as if they were actually observed.
All experiments were conducted on the six 64-bit Linux machines, each one
equipped with IntelR XeonR CPU E5620 2.40 GHz, 16 cores, 4 MB cache, and
32 GB main memory. The ssRF and eQRF models were implemented as multithread processes, while other models were run as single-thread processes.
3
http://cran.r-project.org/
A New Feature Sampling Method in Random Forests
4.3
467
Results on Real Data Sets
The performance of RF models is evaluated when the number of trees and features are varied, those are two key parameters in the RF models. Figures 1(a), (b)
show the regression errors of the random forest models varied with the number
of K trees used with mtry = log2 (M ) + 1. Figures 1(c), (d) present the plots
of curves when the number of random features mtry in the subspace increases
while the number of trees is fixed (K = 200), the vertical line in each plot indicates the size of a subspace of features mtry = log2 (M ) + 1, this subspace
was suggested by Breiman [1] for the case when applying RF to low-dimensional
data sets. Table 2 shows the test errors on the classification data sets against
the number of trees and features. The RF, QRF and eQRF models were unable
to build their models on the data sets Stock and Rivers containing missing values. The imputation function in randomForest R-package was used to recover
missing values on the two data sets. The eQRF model was not considered in this
experiment because its prediction accuracy is last in this ranking on imputed
data sets. The cRF model was processed well on data set containing missing
values, however this model crashed when applied to the large-size data sets. The
results of RF models when applied to imputed data sets are denoted as RF.i,
QRF.i in the plots, respectively.
(a)
(b)
(c)
(d)
Fig. 1. The prediction performance of regression random forest models changes against
the number of trees and features on real data sets. (a), (c) Stock data. (b), (d) Rivers
data.
468
T.-T. Nguyen et al.
Table 2. The prediction test error of the RF models against the number of trees K
and features mtry on classification data sets. Numbers in bold are the best results.
Data Model
The number of trees
The number of features
set
K=50 100 150 200 300 mtry=10
20
30
40
50
Fbis RF
.2307 .2241 .2254 .2261 .2279
.2434 .2351 .2156 .2303 .2187
GRRF .2394 .2407 .2287 .2314 .2340
.2527 .2101 .1955 .1862 .1981
SRF
.1689 .1649 .1622 .1569 .1618
.1569 .1702 .1636 .1715 .1715
ssRF .1676 .1676 .1543 .1689 .1569
.1822 .1556 .1503 .1503 .1522
La2s RF
GRRF
SRF
ssRF
.2303
.2476
.1327
.1078
.2363
.2121
.1517
.1066
.2256
.2180
.1493
.1102
.2315
.2156
.1445
.1185
.2280
.2192
.1410
.1090
.2536
.2820
.1244
.1149
.1611
.1860
.1315
.1102
.1586
.1540
.1374
.0995
.1432
.1505
.1386
.1002
.1402
.1386
.1434
.1014
La1s RF
GRRF
SRF
ssRF
.6708
.1928
.1308
.1354
.6697
.1759
.1353
.1321
.6731
.2063
.1330
.1322
.6742
.1849
.1353
.1321
.6488
.1966
.1488
.1264
.6776
.1905
.1330
.1477
.6032
.1691
.1375
.1432
.4543
.1612
.1387
.1443
.3337
.1577
.1364
.1319
.2052
.1409
.1398
.1387
We can see that ssRF always provided good results and achieved lower prediction error in Figure 1 and Table 2 when varying K and mtry on both kind of
data sets. In some cases where the ssRF model did not obtain the best results
compared with SRF on the data sets Fbis and La1s, the differences from the
best results were minor. These results demonstrated that, at lower levels of the
tree, the gain is reduced because of the effect of splits on different features at
higher levels of the tree. The other random forests models increase prediction
errors while the ssRF model always produces better results. This was because
the selected subspace of features contains enough highly informative features at
any levels of the decision tree. The effect of the new sampling method is clearly
demonstrated in this result.
In Figures 1 (c), (d) and the right panel of Table 2, the RF and QRF models require larger number of features to achieve the lower prediction error. This
means the RF and QRF models could achieve better prediction performance
only if they are provided with a much larger feature subspace. For solving the
regression and classification problem, the size of the subspace in the default
√ settings of RF and QRF R-packages were set to mtry = M/3 and mtry = M ,
respectively. With this size, the computational time for building a RF is still too
high, especially for large high-dimensional data. These empirical results indicated
that, the ssRF model does not need many features in the subspace to achieve
good prediction performance. For application on high-dimensional data, when
the ssRF model uses a subspace of features of size mtry = log2 (M )+1 features,
the achieved results can be satisfactory. In general, when the feature subspace
of the same size as the one suggested by Breiman is used, the ssRF model gives
lower prediction error with a less computational time than those reported by
Breiman. This achievement is considered to be one of the contributions in this
work.
A New Feature Sampling Method in Random Forests
469
Figure 2 shows the point and 90% range prediction results of the large highdimensional data set LOG1P by the eQRF and ssRF models. The green and red
points show the predictions inside and outside the predicted ranges, respectively.
Figure 2 (a) shows the point and 90% range predictions of the eQRF model, we
can see that the point prediction is more scattered than that of the ssRF model
in the results. Significant improvement in the prediction results of the ssRF
model can be observed in Figure 2 (b). We can see that, the predicted points are
closer to the diagonal line which indicates that the predicted values were close
to the true values in data, and there are less red points in the Figure 2 (b) which
indicates that a large number of predictions were within the predicted ranges.
These results clearly demonstrate the advantages of the ssRF model over very
recently proposed eQRF model.
(a) Range predictions by eQRF.
(b) Range predictions by ssRF.
Fig. 2. Comparisons of range predictions by the regression eQRF and ssRF models on
large high-dimensional data sets LOG1P
5
Conclusions
We have presented a new approach for feature subspace selection for efficient
node splitting when building decision trees in random forests. Based on that,
a new random forest algorithm, ssRF, has been developed for prediction highdimensional data. The quantile regression is employed to obtain predictions in
the regression problem, which makes the RF more robust towards outliers. With
the new subspace feature selection, the small subspace size mtry = log2 (M )+1
reported by Breiman can be used in our algorithm to get lower prediction error.
With ssRF, the performance for both classification and regression problems (the
point and range prediction) is preserved and improved. Experimental results have
demonstrated the improvement of our ssRF in reduction of prediction errors
in comparison with existing recent proposed random forests including eQRF,
GRRF and SRF, and especially it performed well on large high-dimensional
data.
470
T.-T. Nguyen et al.
Acknowledgments. This research is supported in part by NSFC under Grant NO.
61203294 and Natural Science Foundation of SZU(grant no. 201433). Joshua Huang
was supported by The National Natural Science Foundation of China under Grant No.
61473194.
References
1. Breiman, L.: Random forests. Machine learning 45(1), 5–32 (2001)
2. Breiman, L.: Manual on setting up, using, and understanding random forests v3.
1. (2002) (retrieved October 23, 2010)
3. Nguyen, T.T., Huang, J., Nguyen, T.: Two-level quantile regression forests for bias
correction in range prediction. Machine Learning, 1–19 (2014)
4. Tuv, E., Borisov, A., Runger, G., Torkkola, K.: Feature selection with ensembles,
artificial variables, and redundancy elimination. The Journal of Machine Learning
Research 10, 1341–1366 (2009)
5. Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and regression
trees. CRC Press (1984)
6. Louppe, G., Wehenkel, L., Sutera, A., Geurts, P.: Understanding variable importances in forests of randomized trees. In: Advances in Neural Information Processing Systems, pp. 431–439 (2013)
7. Genuer, R., Poggi, J.M., Tuleau-Malot, C.: Variable selection using random forests.
Pattern Recognition Letters 31(14), 2225–2236 (2010)
8. Welch, B.L.: The generalization ofstudent‘s’ problem when several different population variances are involved. Biometrika, 28–35 (1947)
9. Meinshausen, N.: Quantile regression forests. The Journal of Machine Learning
Research 7, 983–999 (2006)
10. Ho, C.H., Lin, C.J.: Large-scale linear support vector regression. The Journal of
Machine Learning Research 13(1), 3323–3348 (2012)
11. Cai, Z., Jermaine, C., Vagena, Z., Logothetis, D., Perez, L.L.: The pairwise gaussian random field for high-dimensional data imputation. In: Data Mining (ICDM),
pp. 61–70. IEEE (2013)
12. Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3),
18–22 (2002)
13. Meinshausen, N.: quantregforest: quantile regression forests. R package version
0.2-3 (2012)
14. Hothorn, T., Hornik, K., Zeileis, A.: party: A laboratory for recursive part (y) itioning. r package version 0.9-9999 (2011). http://cran.r-project.org/package=party
(date last accessed November 28, 2013)
15. Deng, H.: Guided random forest in the rrf package. arXiv preprint arXiv:1306.0237
(2013)
16. Deng, H., Runger, G.: Gene selection with guided regularized random forest.
Pattern Recognition 46(12), 3483–3489 (2013)
17. Ye, Y., Wu, Q., Zhexue Huang, J., Ng, M.K., Li, X.: Stratified sampling for feature
subspace selection in random forests for high dimensional data. Pattern Recognition 46(3), 769–787 (2013)
18. Tung, N.T., Huang, J.Z., Khan, I., Li, M.J., Williams, G.: Extensions to Quantile Regression Forests for Very High-Dimensional Data. In: Tseng, V.S., Ho,
T.B., Zhou, Z.-H., Chen, A.L.P., Kao, H.-Y. (eds.) PAKDD 2014, Part II. LNCS,
vol. 8444, pp. 247–258. Springer, Heidelberg (2014)

Download Report

A New Feature Sampling Method in Random Forests for Predicting

Paperzz.com

Your Paperzz