Testing neural networks using the complete round robin method Boris Kovalerchuk, Clayton Todd, Dan Henderson Dept. of Computer Science, Central Washington University, Ellensburg, WA, 98926-7520 [email protected] [email protected] [email protected] 1 Problem Common practice Pattern discovered by learning methods such as NN are tested in the data before they are used for their intended purpose. Reliability of testing The conclusion about reliability of discovered patterns depends on testing methods used. Problem A testing method may not test critical situations. 2 Result New testing method Tests much wider set of situations than the traditional round robin method Speeds up required computations 3 Novelty and Implementation Novelty The mathematical mechanism based on the theory of monotone Boolean functions and Multithreaded parallel processing. Implementation The method has been implemented for backpropagation neural networks and successfully tested for 1024 neural networks by using SP500 data. 4 1. Approach and Method Common approach Tr 1. Select subsets in the data set D Tv subset T for training and for validating the discovered patterns. r subset T v 2. repeat step 1 several times for different subsets 3. compare if results are similar to each other than a discovered regularity can be called reliable for data D. 5 Known Methods [Dietterich, 1997] Random selection of subsets bootstrap aggregation (bagging) Selection of disjoint subsets crossvalidated committees Selection of subsets according the probability distribution Boosting 6 Problems of sub-sampling Different regularities for different sub-samples of Tr Rejecting and accepting regularities heavily depends on a specific splitting of Tr Example: non-stationary financial time series bear and bull market trends for different time intervals. 7 Splitting sensitive regularities for non-stationary data Reg1 does not work on A’=B A regularity 1 (bear market) A’ Training sample Tr B’ B regularity 2 (bull market) C Independent data C C Reg2 does not work on B’=A 8 Round robin method Eliminates arbitrary splitting by examining several groups of subsets of Tr. The complete round robin method examines all groups of subsets of Tr. Drawback 2n possible subsets, where n is the number of groups of objects in the data set. Learning 2n neural networks is a computational challenge. 9 Implementation Complete round robin method applicable to both attribute-based and relational data mining methods. The method is illustrated for neural networks. 10 Data Let M be a learning method and D be a data set of N objects, represented by m attributes. D={di},i=1,…,N, di=(di1, di2,…, dim). Method M is applied to data D for knowledge discovery. 11 Grouping data Example -- a stock time series. the first 250 data objects (days) belong to 1980, the next 250 objects (days) belong to 1981, and so on. Similarly, half years, quarters and other time intervals can be used. Any of these subsets can be used as training data. 12 Testing data Example 1 (general data) Training -- the odd groups #1, #3, #5, #7 and #9 Testing -- #2, #4, #6, #8 and #10 Example 2 (time series) Training -- #1, #2, #3, #4 and #5 Testing -- #6, #7, #8, #9 and #10 Used third path completely independent later data C for testing. 13 Hypothesis of monotonicity D1 and D2 are training data sets and Perform1 and Perform2 the performance indicators of learned models using D1 and D2 Binary Perform index, 1 stands for appropriate performance and 0 stands for inappropriate performance. The hypothesis of monotonicity (HM) If D1 D2 then Perform1 Perform2. (1) 14 Assumption If data set D1 covers data set D2, then performance of method M on D1 should be better or equal to performance of M on D2. Extra data bring more useful information than noise for knowledge discovery. 15 Experimental testing of the hypothesis Hypothesis P(M,Di) P(M,Dj) for performance is not always true. 1024 subsets of ten years and generated NN Found 683 subsets such that Di Dj. Surprisingly, in this experiment monotonicity was observed for all of 683 combinations of years. 16 Discussion Experiment strong evidence for use of monotonicity along with the complete round robin method. Incomplete round robin method assumes some kind of independence of training subsets used. Discovered monotonicity shows that independence at least should be tested. 17 Formal notation The error Er for data set D={di}, i=1,…,N, is the normalized error of all its components di: Er N 2 ( T ( d ) J ( d )) i i i 1 N i 1 T (di ) T(di) is the actual target value for di d J(di) is the target value forecast delivered by discovered model J, i.e., the trained neural network in our case. 18 Performance Performance is measured by the error tolerance (threshold) Q0 of error Er: 1, if Er Q 0 Perform 0, if Er Q 0 19 Binary hypothesis of monotonicity Combinations of years are coded as binary vectors vi=(vi1,vi2…vi10) like 0000011111 with 10 components from (0000000000) to (1111111111) total 2n=1024 data subsets. Hypothesis of monotonicity If vi vj then Performi Performj (2) Here vi vj vik vjk for all k=1,...,10. 20 Monotone Boolean Functions Not every vi and vj are comparable with each other by the ”” relation. Perform as a quality indicator Q: Q(M, D, Q0) =1 Perform=1, where Q0 is some performance limit. (3) Rewritten monotonicity (2) If vi vj then Q(M, Di, Q0) Q(M, Dj, Q0) (4) Q(M, D, Q0) is a monotone Boolean function of D. [Hansel, 1966; Kovalerchuk et al, 1996] 21 Use of Monotonicity A method M, Data sets D1 and D2, with D2 D1. Monotonicity if the method M does not perform well on the data D1, then it will not perform well on the data D2 either. under this assumption, we do not need to test method M on D2. 22 Experiment SP500 running method M 250 times instead of the complete 1024 times. Tables 23 Experiments with SP500 and Neural Networks The backpropagation neural network predicting SP500 [Rao, Rao, 1993] for 0.6% prediction error on the SP500 test data (50 weeks) with 200 weeks (about four years) of training data. Is this result reliable? Will it be sustained for wider training and testing data? 24 Data for testing reliability The same data: Training data -- all trading weeks from 1980 to 1989 and Independent testing data -- all trading weeks from 1990-1992. 25 Complete round robin Generated all 1024 subsets of the ten training years (19801989) Computed 1024 the corresponding backpropagation neural networks and their Perform values. 3% error tolerance threshold (Q0=0.03) (higher than 0.6% used for the smaller data set in [Rao, Rao, 1993]). 26 Performance Table 1. Performance of 1024 neural networks Performance Number of % of neural neural networks Training Testing networks 0 0 289 28.25 0 1 24 2.35 1 0 24 2.35 1 1 686 67.05 A satisfactory performance is coded as 1 and non-satisfactory performance is coded as 0. 27 Analysis of performance Consistent performance (training and testing) 67.05% + 28.25%= 95.3% with 3% error tolerance sound Performance on training data => sound Performance on testing data (67.05%) unsound Performance on training data => unsound Performance on testing data (28.25%) A random choice of data for training from ten-year SP500 data will not produce regularity in 32.95% of cases, although regularities useful for forecasting do exist. 28 Specific Analysis Table 2 9 nested data subsets of the possible 1024 subsets Begin the nested sequence with a single year (1986). This single year’s data is too small to train a neural network to a 3% error tolerance. 29 Specific analysis 9 nested data subsets of the possible 1024 subsets 1111111011 x x x x x x x x x 1 1 0111111011 x x x x x x x x 1 1 0011111011 x x x x x x x 1 1 0001111011 x x x x x x 1 1 0000111011 x x x x x 1 1 0000011011 x x x x 1 1 0000001011 x x x 0 0 0000001001 x x 0 0 0000001000 x 0 0 90-92 years for set of 80 81 82 83 84 85 86 87 88 89 Training Testing Binary code Training years Performance Table 2. Backpropagation neural networks performance for different data subsets. 30 Nested sets A single year (1986) is too small to train a neural network to a 3% error tolerance. 1986, 1988 and 1989, produce the same negative result. 1986, 1988, 1989 and 1985, the error moved below the 3% threshold. Five or more years of data also satisfy the error criteria. The monotonicity hypothesis is confirmed. 31 Analysis All other 1023 combinations of years only a few combinations of four years satisfy the 3% error tolerance, practically all five-year combinations satisfy the 3% threshold. all combinations of over five years satisfy this threshold. 32 Analysis Four-year training data sets produce marginally reliable forecasts. Example. 1980, 1981, 1986, and 1987, corresponding to the binary vector (1100001100) do not satisfy the 3% error tolerance 33 Further analyses Three levels of error tolerance, Q0: 3.5%, 3.0% and 2.0%. The number of neural networks with sound performance goes down from 81.82% to 11.24% by moving error tolerance from 3.5% to 2%. NN with 3.5% error tolerance are much more reliable than networks with 2.0 % error tolerance. 34 Three levels of error tolerance Table 3. Overall performance of 1024 neural networks with different error tolerance Error tolerance Performance Number of neural % of neural networks networks Training Testing 3.5% 0 0 167 16.32 0 1 3 0.293 1 0 6 0.586 1 1 837 81.81 3.0% 0 0 289 28.25 0 1 24 2.35 1 0 24 2.35 1 1 686 67.05 2.0% 0 0 845 82.60 0 1 22 2.150 1 0 31 3.030 1 1 115 11.24 35 % of neural networks Performance 90 80 70 60 50 40 30 20 10 0 <0,0> <1,1> <0,1> <1,0> Performance <training,testing> Er. tolerance=3.5% Er.tolerance=3.0% Er. tolerance=2.0% 36 Standard random choice method A random choice of training data for 2% error tolerance will more often reject the training data as insufficient. This standard approach does not even allow us to know how unreliable the result is without running the complete 1024 subsets in complete round robin method. 37 Reliability of 0.6 error tolerance Rao and Rao [1993] the 0.6% error for 200 weeks (about four years) from the 10-year training data. unreliable (table 3) 38 Underlying mechanism The number of computations depends on a sequence of testing data subsets Di. To optimize the sequence of testing, Hansel’s lemma [Hansel, 1966, Kovalerchuk et al, 1996] from the theory of monotone Boolean functions is applied to so called Hansel’s chains of binary vectors. 39 Computational challenge of Complete round robin method Monotonicity and multithreading significantly speed up computing Use of monotonicity with 1023 threads decreased average runtime about 3.5 times from 15-20 minutes to 4-6 minutes in order to train 1023 Neural Networks in the case of mixed 1s and 0s for output. 40 Error tolerance and computing time Different error tolerance values can change output and runtime. extreme cases with all 1’s or all 0’s as outputs.. File preparation to train 1023 Neural Networks. The largest share of time for file preparation (41.5%) is taken by files using five years in the data subset (table 5). 41 Runtime Table 1. Runtime for different error tolerance settings Method Average time for 1023 Neural Networks 1 Processor, no threads Round-Robin with monotonicity for 15-20 min. mixed 1’s and 0’s as output Round-Robin with monotonicity for 10 min. all 1’s as output Average time for 1023 Neural Networks 1 Processor, 1023 threads 6-4 min. 3.5 min. 42 Time distribution Table 5. Time for backpropagation and file preparation Set of years % of time in File preparation 0000011111 41.5% 0000000001 36.4% 1111111111 17.8% % of time in Backpropagation 58.5% 63.6% 82.2% 43 General logic of software The exhaustive option generate 1024 subsets using a file preparation program and compute backpropagation for all subsets Optimized option generate Hansel chains, store them compute backpropagation only for specific subsets dictated by the chains. 44 Implemented optimized option 1. For a given binary vector the corresponding training data are produced. 2. Backpropagation is computed generating the Perform value. 3. The Perform value is used along with a stored Hansel chains to decide which binary vector (i.e., subset of data) will be used next for learning neural networks. 45 System Implementation The exhaustive set of Hansel chains is generated first and stored for a desired case. A 1024 subsets of data are created using a file preparation program. Based on the sequence dictated by the Hansel Chains, we produce the corresponding training data and compute backpropagation to generate the Perform value. The next vector to be used is based on the stored Hansel Chains and the Perform value. 46 User Interface and Implementation Built to take advantage of Windows NT threading. Built with Borland C++ Builder 4.0. Parameters for the Neural Networks and sub-systems are setup through the interface. Various options, such as monotinicity, can be turned on or off through the interface. 47 Multithreaded Implementation Independent Sub-Processes Learning of an individual neural network or a group of the neural networks. Each sub-process is implemented as an individual thread. Run in parallel on several processors to further speed up computations. 48 Threads Two different types of threads. Worker Thread and Communication Threads Communication Threads send and receive the data. Work threads are started by the Servers to perform the calculations on the data. 49 Client/Server Implementation Decreases the workload per-computer Client does the data formatting, Server performs all the data manipulation. A client connects to multiply servers. Each server performs work on the data and returns the results to the client. The client formats the returned results and gives a visualization of them. 50 Client Loads vectors into linked list. Attaches threads to each node. Connects thread to active server. Threads send server activation packet Waits for server to return results. Formats results for outputting. Send more work to server. 51 Server Obtains a message from the client that this client is connected. Starts work on data given it from client. Returns finished packet to client. Loop. 52 System Diagram 53
© Copyright 2025 Paperzz