A streaming parallel decision tree algorithm

A streaming parallel decision tree algorithm
Yael Ben-Haim and Elad Yom-Tov
IBM Haifa Research Lab, 165 Aba Hushi st., Haifa 31905 , Israel
{yaelbh,yomtov}@il.ibm.com
ABSTRACT
A new algorithm for building decision tree classifiers is proposed. The algorithm is executed in a distributed environment and is especially designed for classifying large datasets
and streaming data. It is empirically shown to be as accurate as standard decision tree classifiers, while being scalable
to infinite streaming data and multiple processors.
1.
INTRODUCTION
We propose a new algorithm for building decision tree classifiers for classifying both large datasets and (possibly infinite)
streaming data. As recently noted [4], the challenge which
distinguishes large-scale learning from small-scale learning
is that training time is limited compared to the amount of
available data. Thus, in our algorithm both training and
testing are executed in a distributed environment. We refer to the new algorithm as the Streaming Parallel Decision
Tree (SPDT).
Decision trees are simple yet effective classification algorithms. One of their main advantages is that they provide human-readable rules of classification. Decision trees
have several drawbacks, especially when trained on large
data, where the need to sort all numerical attributes becomes costly in terms of both running time and memory
storage. The sorting is needed in order to decide where
to split a node. The various techniques for handling large
data can be roughly grouped into two approaches: Performing pre-sorting of the data (SLIQ [12] and its successors
SPRINT [17] and ScalParC [11]), or replacing sorting by
approximate representations of the data such as sampling
and/or histogram building (e.g. BOAT [7], CLOUDS [1],
and SPIES [10]). While pre-sorting techniques are more accurate, they cannot accommodate very large datasets nor
infinite streaming data.
Faced with the challenge of handling large data, a large body
of work has been dedicated to parallel decision tree algorithms [17],[11],[13],[10],[19],[18],[8]. There are several ways
to parallelize decision trees (described in detail in [2],[19],[13]):
In horizontal parallelism, the data is partitioned such that
different processors see different examples 1 . In vertical parallelism, different processors see different attributes. Task
parallelism involves distribution of the tree nodes among the
processors. Finally, hybrid parallelism combines horizontal
or vertical parallelism in the first stages of tree construction
with task parallelism towards the end.
Like their serial counterparts, parallel decision trees overcome the sorting obstacle by applying pre-sorting, distributed
sorting, and approximations. Following our interest in infinite streaming data, we focus on approximate algorithms.
In streaming algorithms, the dominant approach is to read
a limited batch of data and use each such batch to split
tree nodes. We refer to processing each such batch as an
iteration of the algorithm.
The SPIES algorithm [10] is designed for streaming data,
but requires holding each batch in memory because it may
need several passes over each batch. pCLOUDS [18] relies on
assumptions on the behavior of the impurity function, which
are empirically justified but can be false for a particular
dataset. We note that none of the experiments reported in
previous works involved both a large number of examples
and a large number of attributes.
2.
ALGORITHM DESCRIPTION
Our proposed algorithm builds the decision tree in a breadthfirst mode, using horizontal parallelism. At the core of our
algorithm is an on-line method for building histograms from
streaming data at the processors. These histograms are then
used for making decisions on new tree nodes at the master
processor.
We empirically show that our proposed algorithm is as accurate as traditional, single-processor algorithms, while being
scalable to infinite streaming data and multiple processors.
2.1
Tree growing algorithm
We construct a decision tree based on a set of training examples {(x1 , y1 ), . . . , (xn , yn )}, where x1 , . . . , xn ∈ Rd are the
feature vectors and y1 , . . . , yn ∈ {1, . . . , c} are the labels.
Every internal node in the tree possesses two ordered child
nodes and a decision rule of the form x(i) < a, where x(i)
1
We refer to processing nodes as processors, to avoid confusion with tree nodes
is the ith feature and a is a real number. Feature vectors
that satisfy the decision rule are directed to the node’s left
child node, and the other vectors are directed to the right
child node. Every example x has thus a path from the root
to one of the leaves, denoted l(x). Every leaf has a label
t, so that an example x is assigned the label t(l(x)). The
label is accompanied with a real number that represents the
confidence on the label’s correctness 2 .
Initially, the tree consists only of one node. The tree is
grown iteratively, such that in each iteration a new level
of nodes is appended to the tree. We apply a distributed
architecture that consists of Nw processors. Each processor
can observe 1/Nw of the data, but has a view of the complete
classification tree built so far.
merge. The histogram building algorithm is a slight adaptation of the on-line clustering algorithm developed by Guedalia
et al. [9], with the addition of a procedure for merging histograms.
The update procedure:
Given a histogram (p1 , m1 ), . . . , (pr , mr ), p1 < . . . < pr and
a point p, the update procedure adds p to the set S represented by the histogram.
• If p = pi for some i, then increment mi by 1. Otherwise:
• Add the bin (p, 1) to the histogram, resulting in a histogram of r + 1 bins (q1 , k1 ), . . . , (qr+1 , kr+1 ), q1 <
. . . < qr+1 .
At each iteration, each processor uses the data points it observes to build a histogram for each class, terminal node
(leaf), and feature. Each data point is classified to the correct leaf of the current tree, and is used to update the relevant histograms. Section 2.2 provides a description of histogram algorithms.
After observing a predefined number of data points (or, in
the case of finite data, after seeing the complete data) the
histograms are communicated to a master processor which
integrates these histograms and reaches a decision on how to
split the nodes, using the chosen split criterion (see e.g. [5,
16]): For each bin location in the histogram of each dimension, the (approximate) number of points from each class to
the left and to the right of this location is counted. This is
then used in the computation of the purity of the leaf’s child
nodes, if this node is chosen to be split in the current dimension and location. The feature i and location a for which
the child nodes’ purities are maximized will constitute the
decision rule x(i) < a. The leaf becomes an internal node
with the chosen decision rule, and two new nodes (its child
nodes) are created. If the node is already pure enough, the
splitting is stopped and the node is assigned a label and a
confidence level, both determined by the number of examples from each class that reached it.
Decision trees are frequently pruned during or after training to obtain smaller trees and better generalization. We
adapted the MDL-based pruning algorithm of [12]. This algorithm involves simple calculations during node splitting,
that reflect the node’s purity. In a bottom-up pass on the
complete tree, some subtrees are chosen to be pruned, based
on estimates of the expected error rate before and after pruning.
2.2
On-line histogram building
A histogram is a set of r pairs (called bins) (p1 , m1 ), . . . , (pr , mr ),
where r is a preset constant integer, p1 , . . . , pr are real numbers, and m1 , . . . , mr are integers. The histogram is a compressed and approximate representation
of a large set S of
P
real numbers, so that |S| = ri=1 mi , and mi is the number of points in S at the surroundings of pi . The histogram
data structure supports two procedures, named update and
2
Note that since the number of different confidence levels is
upper-bounded by the number of leaves, the decision tree
does not provide continuous-valued outputs.
• Find a point qi such that qi+1 − qi is minimal.
• Replace the bins (qi , ki ), (qi+1 , ki+1 ) by the bin
ţ
ű
qi ki + qi+1 ki+1
, ki + ki+1 .
ki + ki+1
The merge procedure:
Given two histograms, the merge procedure creates a new
histogram, that represents the union S1 ∪S2 of the sets S1 , S2
represented by the histograms. The algorithm is similar to
the update algorithm. In the first step, the two histograms
form a single histogram with many bins. In the second step,
bins which are closest are merged together to form a single
bin, and the process repeats until the histogram has r bins.
3.
EMPIRICAL RESULTS
We compared the error rate of the SPDT algorithm with
the error rate of a standard decision tree on seven mediumsized datasets taken from the USI repository [3]: Adult,
Isolet, Letter recognition, Nursery, Page blocks, Pen
digits, and Spambase. The characteristics and error rates
of all datasets are summarized in Table 1. Ten-fold cross
validation was applied where there was no natural train/test
partition.
We used an 8-CPU Power5 machine with 16GB memory,
using a Linux operating system. Our algorithm was implemented within the IBM Parallel Machine Learning toolbox
[15], which runs using MPICH2.
The comparison shows that the approximations undertaken
by the SPDT algorithm do not necessarily have a detrimental effect on its error rate. The FF statistics combined with
Holm’s procedure (see [6]) with a confidence level of 95%
shows that all but SPDT with eight processors exhibited
performance which could not be detected as statistically significantly different. For relatively small data, using eight
processors means that each node sees little data, and thus
the histograms suffer in accuracy. This may explain the
degradation in performance when using eight processors.
Dataset
Adult
Isolet
Letter
Nursery
Page blocks
Pen digits
Spambase
Number of
examples
32561 (16281)
6238 (1559)
20000
12960
5473
7494 (3498)
4601
Number of
features
105
617
16
25
10
16
57
Standard tree
17.67
18.70
7.48
1.01
3.13
4.6
8.37
SPDT
1 processor
15.75
14.56
8.65
2.58
3.07
5.37
10.52
SPDT
2 processors
15.58
17.90
9.28
2.67
3.18
5.43
11.11
SPDT
4 processors
16.16
19.69
10.13
2.82
3.51
5.20
11.29
SPDT
8 processors
16.50
19.31
10.07
3.16
3.44
5.83
11.61
Table 1: Error rates for medium-sized datasets. The number of examples in parentheses is the number of
test examples (if a train/test partition exists). The lowest error rate for each dataset is marked in bold.
Dataset
Adult
Isolet
Letter
Nursery
Page blocks
Pen digits
Spambase
Error rate
before pruning
16.50
19.31
10.07
3.16
3.44
5.83
11.61
Tree size
before pruning
1645
221
135
178
55
89
572
Error rate
after pruning
14.34
17.77
9.26
3.21
3.44
5.83
11.45
Tree size
after pruning
409
141
67
167
36
81
445
Table 2: Error rates and tree sizes (number of processors) before and after pruning, with eight processor.
It is also interesting to study the effect of pruning on the error rate and tree size. Using the procedure described above,
we pruned the trees obtained by SPDT. Table 2 shows that
pruning usually improves the error rate (though not to a statistically significant threshold (sign test)) , while reducing
the tree size by 80% on average.
We tested SPDT for speedup and scalability on the alpha
and beta datasets from the Pascal Large Scale Learning
Challenge [14]. Both datasets have 500000 examples and
500 dimensions, out of which we extracted datasets of sizes
100, 1000, 10000, 100000, and 500000.
Figure 1 shows the speedup for different sized datasets. We
further tested speedup in five more datasets taken from the
Pascal challenge: delta, epsilon, fd, ocr, and dna. Referring to dataset size as the number of examples multiplied
by the number of dimensions, we found that dataset size
and speedup are highly correlated (Spearman correlation of
0.919). This fits the theoretic analysis of complexity expected for the algorithm, which is dominated by the histogram building process.
For scalability, we checked the running time as a function of
the dataset size. In a logarithmic scale, we obtain approximate regression curves (average R2 = 0.9982) with slopes
improving from 1.1 for a single processor from up to 0.8 for
eight processors. Thus, our proposed algorithm is especially
suited for cases where large data is available and processing
can be shared between many processors.
4.
REFERENCES
[1] K. Alsabti, S. Ranka, and V. Singh. CLOUDS:
Classification for large or out-of-core datasets. In
Conference on Knowledge Discovery and Data Mining,
August 1998.
[2] N. Amado, J. Gama, and F. Silva. Parallel
implementation of decision tree learning algorithms.
In The 10th Portuguese Conference on Artificial
Intelligence on Progress in Artificial Intelligence,
Knowledge Extraction, Multi-agent Systems, Logic
Programming and Constraint Solving, pages 6–13,
December 2001.
[3] C. L. Blake, E. J. Keogh, and C. J. Merz. UCI
repository of machine learning databases, 1998.
[4] L. Bottou and O. Bousquet. The tradeoffs of large
scale learning. In Advances in Neural Information
Processing Systems, volume 20. MIT Press,
Cambridge, MA, 2008. to appear.
[5] L. Breiman, J. Friedman, R. Olshen, and C. Stone.
Classification and Regression Trees. Wadsworth,
Monterrey, CA, 1984.
[6] J. Dems̆ar. Statistical comparisons of classifiers over
multiple data sets. Journal of Machine Learning
Research, 7:1–30, 2006.
[7] J. Gehrke, V. Ganti, R. Ramakrishnan, and W. Y.
Loh. BOAT — optimistic decision tree construction.
In ACM SIGMOD International Conference on
Management of Data, pages 169–180, June 1999.
[8] S. Goil and A. Choudhary. Efficient parallel
classification using dimensional aggregates. In
Workshop on Large-Scale Parallel KDD Systems,
SIGKDD, pages 197–210, August 1999.
[9] I. D. Guedalia, M. London, , and M. Werman. An
on-line agglomerative clustering method for
nonstationary data. Neural Comp., 11(2):521–540,
1999.
[10] R. Jin and G. Agrawal. Communication and memory
efficient parallel decision tree construction. In The 3rd
SIAM International Conference on Data Mining, May
6
5
6
100 examples
1000 examples
10000 examples
100000 examples
500000 examples
5
4
Speedup
Speedup
4
100 examples
1000 examples
10000 examples
100000 examples
500000 examples
3
2
2
1
0
1
3
1
2
3
4
5
6
7
8
0
1
2
3
4
5
6
7
8
Number of processors
Figure 1: Speedup of the SPDT algorithm for the alpha (left) and beta (right) datasets.
2003.
[11] M. V. Joshi, G. Karypis, and V. Kumar. ScalParC: A
new scalable and efficient parallel classification
algorithm for mining large datasets. In The 12th
International Parallel Processing Symposium, pages
573–579, March 1998.
[12] M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast
scalable classifier for data mining. In The 5th
International Conference on Extending Database
Technology, pages 18–32, 1996.
[13] G. J. Narlikar. A parallel, multithreaded decision tree
builder. In Technical Report CMU-CS-98-184,
Carnegie Mellon University, 1998.
[14] Pascal large scale learning challenge, 2008.
[15] Ibm parallel machine learning toolbox.
[16] J. R. Quinlan. C4.5: Programs for Machine Learning.
Morgan Kaufmann, San Mateo, CA, 1993.
[17] J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A
scalable parallel classifier for data mining. In The
22nd International Conference on Very Large
Databases, pages 544–555, September 1996.
[18] M. K. Sreevinas, K. Alsabti, and S. Ranka. Parallel
out-of-core divide-and-conquer techniques with
applications to classification trees. In The 13th
International Symposium on Parallel Processing and
the 10th Symposium on Parallel and Distributed
Processing, pages 555–562, 1999.
[19] A. Srivastava, E.-H. Han, V. Kumar, and V. Singh.
Parallel formulations of decision-tree classification
algorithms. Data Mining and Knowledge Discovery,
3(3):237–261, September 1999.