Algorithm Selection on Data Streams Jan N. van Rijn 1 Leiden 1 Geoffrey Holmes 2 Bernhard Pfahringer 2 Joaquin Vanschoren 3 Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands 2 Department of Computer Science, University of Waikato, Hamilton, New Zealand 3 Department of Mathematics and Computer Science, TU Eindhoven, Eindhoven, The Netherlands Introduction Discovery 1 Modern society produces vast amounts of data coming, for instance, from sensor networks and various text sources on the internet. Various machine learning algorithms are able to capture general trends and make predictions for future observations with a reasonable success rate. Using Meta-Learning, our goal is to establish knowledge about what type of algorithm works well on which kind of data. Using basic meta-features, we can predict the best performing classifier from the following set of classifiers: {Naive Bayes, k-NN, Hoeffding Tree, SPegasus, Stochastic Gradient Descent}. The pie chart shows the HT skewness of this dataset. Hoeffding Trees perform best on most intervals. Measured over all data streams, a meta-classifier 65% was able to predict the best 2% performing classifier in more than 80% SGD 5% of the cases. It improved the baseline on SPeg 11% both meta-level accuracy and base-level 17% accuracy. The results are competitive NB with state of the art meta-classifiers. Data Streams Requirements: (as stated by [1]) • Process an example at a time, and inspect it only once • Use a limited amount of memory • Work in a limited amount of time • Be ready to predict at any point Interleaved test-then-train: Each instance is first used to test the model, after which it can be used to train the model. E.g.: The stock market. k-NN Discovery 2 Leveraging Bagging is a state of the art meta-learning technique, proven superior to other algorithms many times. It is in particular successful when used on Hoeffding trees. Meta-Learning 0.002 It is common for learning curves in Data Streams to cross multiple times, as illustrated by this plot, describing the Electricity data stream. 0.001 0 Percentage -0.001 0.95 0.9 -0.002 -0.003 -0.004 -0.005 -0.006 0.85 -0.007 ) 00 00 00 ) ,1 0 al 00 in 00 om 10 ) ,n l, 00 re ina 00 ) he m 00 00 0) sp ,no ,1 00 00 no itis ric 00 00 0) (io at e ,1 0 00 G p numeric ic,1 00 e BNG(h or, um mer l,10 b n u a ) 0) BNG(la or, s,n min 96 00 b iti o 52 0 BNG(la pat t,n c,5 ,100 0) e en ri c 0 0) BNG(h gm me eri 000 00 0 0 u m e 0) BNG(s c,n ,nu al,1 100 ) 00 ) m -a in l, 2 BNG(c dit om ina 07 00) 000 00 e n m 1 0 ,1 0 r BNG(c or, ,no l,13 00 ric 000 b -a a 0 e ,1 BNG(la dit min al,1 num ric re o in , e BNG(c te,n om 000 um n 5 ,n o , 0) BNG(v ins rm- 000 ) a o 5 1 ) 00 BNG(tr vef rm- .00 96 000 a o ;0 2 1 BNG(w vef (50 l,55 al, ) ) a F a in 0 0 ) BNG(w RB min om 000 000 000 m o ,n 0 0 0 BN do c,n om l,10 l,10 00 an m ro a a l,1 R G(c sh min min ina ) u ) 00 o o m BNG(m k,n n,n ,no 00 00 ic a d 00 00 BNG(s ybe roi ized 00 l,1 o hy al l,1 a BN (s ot m na in G yp or i om BNG(h e-n nom e,n p l, ik ) ) 00 0) BN erty nea ern v n z ) 01 0 0 0 0 0 co G(a eat (0; .00 00 00 f F ;0 ,1 0 1 5) BNG(m RB (10 eric al, 24 ) in m F BN do RB num om ) 295 45 an m l, ,n 1 c, 2 R do nea se .00 eri 295 an n ba ;0 m l, R G(a am (10 ,nu ina 00) ) p F s m 0 00 BN (s RB ock ,no 00 00 G 0 m bl s 10 BN do ge- ock ric, 1) l,10 0) l an a b e 0 a 00 R G(p ge- um .00 min 00 ) a ,n ;0 o 10 00 BN (p el 50 ,n c, 0 G i 0 ow F( IG er 0 BNG(v RB OR m ,10 0) m l. nu al BN do nea 00 e, in 0 an n a ik m 00 R (a am ern ,no ) ,1 G ic dr -z gy 00 BN B. eat olo er 0) 0 D f t m 00 00 IM (m a nu 0 10 0 m , , r G 0 c y ) i e og l,1 er 00 BN (d ) l a 0 o m G ) n t i 0 00 u 0 BN 5 a m ,n 0 00 ) A( rm o G 10 n , I e l , 00 0 SE G(d wel OR ina ,1 00 . al 0 l o m in 00 BNG(v nea ,no om ic,1 ) n h BN (a rtr,n er 00 G ea ) ce m 00 BN (h 00 an nu 00 , ) G 00 t-c og ,1 0) 00 BN 5 s tl ric 00 00 A( rea sta me 00 00 SE G(b art- ,nu 10 l,1 0) , e c al a 00 BNG(h artin min 00 e ) 0 ) m BNG(h 00 ,no ,no c,1 00 0) 0 g ri 0 0 e 0 o BN (5 hicl tatl me 000 00 D u 1 00 ) s e LE G(v art- ts,n ric, l,1 000 i e a 0 e g BNG(h ndi um min 00 00) ,n o l,1 0 e BNG(p ph ts,n ina 00 00) ) m gi m 0 0 0 BN (ly di no l,1 00 00 ) a G en c, in 10 0 0 0) BNG(p art- om nal, ,100 000 000 n i 0 e 0 c , BNG(h tos om eri l,10 ,10 l ) ,n m u a a 0 ) BN (a ph nu in in 0 00 G m g, m m 00 o 00 BNG(ly dit s,n r,no 100 00 re it ie l, ,1 BN (c dig ur ina al G pt -fo m ) in BNG(o eat ,no 001 nom 0) f g . , 0 BN (m dit 0;0 00 00 0) G re (1 50 00 00 0) BNG(c ne rm- al,1 00 ) 00 la o in 10 BN erp vef m ic, ) 000 000 r yp a no e 1 0 ,1 H G(w ter, um 000 100 nal t n . , i ) , BN (le os 0;0 ric om 00 G 0) 00 ut (1 me ,n BNG(a ane ,nu nen 00 00 l e u 00 ,1 BN erp hicl arh zed ,10 ric yp e -k li al e H G(v eat rma in num 0) f o m , 0 BNG(m y-n ,no nen 000 it kp u 0 ) BN tric vs- arh ic,1 00 0 ec r- -k er ) 0 el G(k eat um 00 00 f 0) n ,1 0 BNG(m ter, 00 ic 00 t er 10 00 BNG(le e um al, f ,10 p ,n in rf ic BN rty ar m t.a er ve on ,no rif um co G(s nar s.d r,n o up ie BN (s o r G gr ou f s ed BN ew eat iz al n f 20 (m lec rm G kE no BN Po ndov a C erh k po accuracy accuracy(k-NN)-accuracy(LB-kNN) -0.008 0.8 0.75 0.7 0.65 0.6 Hoeffding Tree Naive Bayes SPegasos k-NN 0.55 0.5 0 5 10 15 20 25 30 35 40 interval This plot shows that applying Leveraged Bagging to k-NN actually decreases the performance! Breiman [4] reported on similar behaviour in the classical setting. Idea: What if we could determine dynamically at any point in the stream which algorithm to use Discovery 3 Algorithm Selection Naive Bayes works well when the meta-feature measuring the number of changes detected in the stream is high. Naive Bayes generally needs only relatively few observations to achieve good accuracy compared to more sophisticated algorithms such as Hoeffding Trees. Assuming that a high number of changes detected by this landmarker indicates that the concept of the stream is indeed changing quickly, this could explain why a classifier like Naive Bayes outperforms more sophisticated learning algorithms that need more observations of the same concept to perform well. • Run a set of algorithms full factorial on a set of data streams. • All algorithms as implemented in the MOA framework [1]. • We used real world data streams, data generators and semi generated data. • Split data streams up in windows of n instances. • Calculate a set of meta-features of each interval (SSIL). • Introduced novel stream-based meta-features. • Store evaluation scores of each algorithm on each interval. • All results available on http://www.openml.org/ [3]. References [1] A. Bifet, G. Holmes, R. Kirkby, B. Pfahringer. MOA: Massive Online Analysis. J. Mach. Learn. Res. 11, pages 1601–1604, 2010. [2] J. N. van Rijn, G. Holmes, B. Pfahringer, and J. Vanschoren. Algorithm Selection on Data Streams. Discovery Science, 17th International Conference, pages 325–336, 2014. [3] J. Vanschoren, J.N. van Rijn, B. Bischl and L. Torgo. OpenML: networked science in machine learning. ACM SIGKDD Explorations Newsletter 15, pages 49–60, 2014 [4] L. Breiman. Bagging Predictors. Machine learning 24(2), pages 123–140, 1996.
© Copyright 2024 Paperzz