Presentation ppt file - Central Washington University

Testing neural networks
using the complete round
robin method
Boris Kovalerchuk, Clayton Todd,
Dan Henderson
Dept. of Computer Science, Central
Washington University, Ellensburg, WA,
98926-7520
[email protected] [email protected]
[email protected]
1
Problem
 Common practice
 Pattern discovered by learning methods such as
NN are tested in the data before they are used for
their intended purpose.
 Reliability of testing
 The conclusion about reliability of discovered
patterns depends on testing methods used.
 Problem
 A testing method may not test critical situations.
2
Result
New testing method
 Tests much wider set of situations than
the traditional round robin method
 Speeds up required computations
3
Novelty and
Implementation
 Novelty
 The mathematical mechanism based on the theory
of monotone Boolean functions and
 Multithreaded parallel processing.
 Implementation
 The method has been implemented for
backpropagation neural networks and successfully
tested for 1024 neural networks by using SP500
data.
4
1. Approach and Method
 Common approach
Tr
 1. Select subsets in the data set D
Tv
 subset T
for training and

for validating the discovered patterns.
r
subset T
v
 2. repeat step 1 several times for different subsets
 3. compare
 if results are similar to each other than a discovered
regularity can be called reliable for data D.
5
Known Methods
[Dietterich, 1997]
 Random selection of subsets
 bootstrap aggregation (bagging)
 Selection of disjoint subsets
 crossvalidated committees
 Selection of subsets according the
probability distribution
 Boosting
6
Problems of sub-sampling
 Different regularities
 for different sub-samples of Tr
 Rejecting and accepting regularities
 heavily depends on a specific splitting of Tr
 Example:
 non-stationary financial time series
 bear and bull market trends for different time
intervals.
7
Splitting sensitive regularities
for non-stationary data
Reg1 does not work on A’=B
A
regularity 1
(bear market)
A’
Training sample Tr
B’
B
regularity 2
(bull market)
C
Independent
data C
C
Reg2 does not work on B’=A
8
Round robin method
 Eliminates arbitrary splitting by examining
several groups of subsets of Tr.
 The complete round robin method examines
all groups of subsets of Tr.
 Drawback
 2n possible subsets, where n is the number of groups
of objects in the data set.
 Learning 2n neural networks is a computational
challenge.
9
Implementation
 Complete round robin method applicable
to both
 attribute-based and
 relational data mining methods.
 The
method is illustrated for neural
networks.
10
Data
 Let M be a learning method and
 D be a data set of N objects, represented
by m attributes.
 D={di},i=1,…,N, di=(di1, di2,…, dim).
 Method M is applied to data D for
knowledge discovery.
11
Grouping data
 Example -- a stock time series.
 the first 250 data objects (days) belong to 1980,
 the next 250 objects (days) belong to 1981, and so
on.
 Similarly, half years, quarters and other time
intervals can be used.
 Any of these subsets can be used as training
data.
12
Testing data
 Example 1 (general data)
 Training -- the odd groups #1, #3, #5, #7 and #9
Testing -- #2, #4, #6, #8 and #10
 Example 2 (time series)
 Training -- #1, #2, #3, #4 and #5
 Testing -- #6, #7, #8, #9 and #10
 Used third path
 completely independent later data C for testing.
13
Hypothesis of monotonicity
 D1 and D2 are training data sets and
 Perform1 and Perform2
 the performance indicators of learned models using
D1 and D2
 Binary Perform index,
 1 stands for appropriate performance and
 0 stands for inappropriate performance.
 The hypothesis of monotonicity (HM)

If
D1 D2 then Perform1 Perform2.
(1)
14
Assumption
 If data set D1 covers data set D2, then
performance of method M on D1 should
be better or equal to performance of M on
D2.
 Extra data bring more useful information
than noise for knowledge discovery.
15
Experimental testing of the
hypothesis
 Hypothesis P(M,Di) P(M,Dj) for
performance is not always true.
 1024 subsets of ten years and generated
NN
 Found 683 subsets such that Di  Dj.
 Surprisingly,
in
this
experiment
monotonicity was observed for all of 683
combinations of years.
16
Discussion
 Experiment
 strong evidence for use of monotonicity along
with the complete round robin method.
 Incomplete round robin method assumes
some kind of independence of training
subsets used.
 Discovered monotonicity shows that
independence at least should be tested.
17
Formal notation
 The error Er for data set D={di}, i=1,…,N,
 is the normalized error of all its components di:
Er 

N
2
(
T
(
d
)

J
(
d
))
i
i
i 1

N
i 1
T (di )
 T(di) is the actual target value for di d
 J(di) is the target value forecast delivered by
discovered model J, i.e., the trained neural
network in our case.
18
Performance
 Performance is measured by the error
tolerance (threshold) Q0 of error Er:
1, if Er  Q 0
Perform  
0,
if
Er

Q
0

19
Binary hypothesis of
monotonicity
 Combinations of years are coded
 as binary vectors vi=(vi1,vi2…vi10) like 0000011111
with 10 components from (0000000000) to
(1111111111)
 total 2n=1024 data subsets.
 Hypothesis of monotonicity
 If vi  vj then Performi Performj
(2)
 Here
 vi  vj  vik  vjk for all k=1,...,10.
20
Monotone Boolean Functions
 Not every vi and vj are comparable with each
other by the ”” relation.
 Perform as a quality indicator Q:
 Q(M, D, Q0) =1  Perform=1,
 where Q0 is some performance limit.
(3)
 Rewritten monotonicity (2)
 If vi  vj then Q(M, Di, Q0) Q(M, Dj, Q0)
(4)
 Q(M, D, Q0) is a monotone Boolean function of
D. [Hansel, 1966; Kovalerchuk et al, 1996]
21
Use of Monotonicity
 A method M,
 Data sets D1 and D2, with D2  D1.
 Monotonicity
 if the method M does not perform well on
the data D1, then it will not perform well
on the data D2 either.
 under this assumption, we do not need to
test method M on D2.
22
Experiment
 SP500
 running method M 250 times instead of the
complete 1024 times.
 Tables
23
Experiments with SP500
and Neural Networks
 The backpropagation neural network
predicting SP500 [Rao, Rao, 1993]
for
 0.6% prediction error on the SP500 test data
(50 weeks) with 200 weeks (about four years)
of training data.
 Is this result reliable? Will it be sustained for
wider training and testing data?
24
Data for testing reliability
 The same data:
 Training data -- all trading weeks from 1980
to 1989 and
 Independent testing data -- all trading weeks
from 1990-1992.
25
Complete round robin
 Generated
 all 1024 subsets of the ten training years (19801989)
 Computed
 1024 the corresponding backpropagation neural
networks and
 their Perform values.
 3% error tolerance threshold (Q0=0.03)

(higher than 0.6% used for the smaller data set in [Rao, Rao, 1993]).

26
Performance
Table 1. Performance of 1024 neural networks
Performance
Number of
% of neural
neural
networks
Training
Testing
networks
0
0
289
28.25
0
1
24
2.35
1
0
24
2.35
1
1
686
67.05
A satisfactory performance is coded as 1 and
non-satisfactory performance is coded as 0.
27
Analysis of performance
 Consistent performance (training and testing)
67.05% + 28.25%= 95.3% with 3% error
tolerance
 sound Performance on training data => sound
Performance on testing data (67.05%)
 unsound Performance on training data =>
unsound Performance on testing data (28.25%)
 A random choice of data for training from ten-year
SP500 data will not produce regularity in 32.95%
of cases, although regularities useful for
forecasting do exist.
28
Specific Analysis
 Table 2
 9 nested data subsets of the possible
1024 subsets
 Begin the nested sequence with a single
year (1986).
 This single year’s data is too small to train
a neural network to a 3% error tolerance.
29
Specific analysis
9 nested data subsets of the possible 1024 subsets
1111111011 x x x x x x x
x x
1
1
0111111011
x x x x x x
x x
1
1
0011111011
x x x x x
x x
1
1
0001111011
x x x x
x x
1
1
0000111011
x x x
x x
1
1
0000011011
x x
x x
1
1
0000001011
x
x x
0
0
0000001001
x
x
0
0
0000001000
x
0
0
90-92
years
for set of 80 81 82 83 84 85 86 87 88 89 Training Testing
Binary code Training years
Performance
Table 2. Backpropagation neural networks performance for different data subsets.
30
Nested sets
 A single year (1986)
 is too small to train a neural network to a 3%
error tolerance.
 1986, 1988 and 1989,
 produce the same negative result.
 1986, 1988, 1989 and 1985,
 the error moved below the 3% threshold.
 Five or more years of data
 also satisfy the error criteria.
 The monotonicity hypothesis is confirmed.
31
Analysis
 All other 1023 combinations of years
 only a few combinations of four years satisfy
the 3% error tolerance,
 practically all five-year combinations satisfy
the 3% threshold.
 all combinations of over five years satisfy this
threshold.
32
Analysis
 Four-year training data sets produce
marginally reliable forecasts.
 Example.
 1980, 1981, 1986, and 1987, corresponding to the
binary vector (1100001100) do not satisfy the 3%
error tolerance
33
Further analyses
 Three levels of error tolerance, Q0: 3.5%, 3.0%
and 2.0%.
 The number of neural networks
 with sound performance goes down from 81.82% to
11.24% by moving error tolerance from 3.5% to 2%.
 NN with 3.5% error tolerance
 are much more reliable than networks with 2.0 %
error tolerance.
34
Three levels of error tolerance
Table 3. Overall performance of 1024 neural networks with different error tolerance
Error tolerance Performance
Number of neural
% of neural networks
networks
Training
Testing
3.5%
0
0
167
16.32
0
1
3
0.293
1
0
6
0.586
1
1
837
81.81
3.0%
0
0
289
28.25
0
1
24
2.35
1
0
24
2.35
1
1
686
67.05
2.0%
0
0
845
82.60
0
1
22
2.150
1
0
31
3.030
1
1
115
11.24
35
% of neural networks
Performance
90
80
70
60
50
40
30
20
10
0
<0,0>
<1,1>
<0,1>
<1,0>
Performance <training,testing>
Er. tolerance=3.5%
Er.tolerance=3.0%
Er. tolerance=2.0%
36
Standard random choice
method
 A random choice of training data
 for 2% error tolerance will more often reject the
training data as insufficient.
 This standard approach does not even
allow us to know how unreliable the result
is without running the complete 1024
subsets in complete round robin method.
37
Reliability of 0.6 error
tolerance
 Rao and Rao [1993]
 the 0.6% error for 200 weeks (about four
years) from the 10-year training data.
 unreliable (table 3)
38
Underlying mechanism
 The number of computations depends on
a sequence of testing data subsets Di.
 To optimize the sequence of testing,
Hansel’s lemma [Hansel, 1966,
Kovalerchuk et al, 1996] from the theory
of monotone Boolean functions is applied
to so called Hansel’s chains of binary
vectors.
39
Computational challenge of
Complete round robin method
 Monotonicity and multithreading
significantly speed up computing
 Use of monotonicity with 1023 threads
decreased average runtime about 3.5 times
from 15-20 minutes to 4-6 minutes in order
to train 1023 Neural Networks in the case
of mixed 1s and 0s for output.
40
Error tolerance and
computing time
 Different error tolerance values can
change output and runtime.
 extreme cases with all 1’s or all 0’s as
outputs..
 File preparation to train 1023 Neural
Networks.
 The largest share of time for file preparation
(41.5%) is taken by files using five years in
the data subset (table 5).
41
Runtime
Table 1. Runtime for different error tolerance settings
Method
Average time for 1023
Neural Networks
1 Processor, no threads
Round-Robin with monotonicity for 15-20 min.
mixed 1’s and 0’s as output
Round-Robin with monotonicity for 10 min.
all 1’s as output
Average time for 1023
Neural Networks
1 Processor, 1023 threads
6-4 min.
3.5 min.
42
Time distribution
Table 5. Time for backpropagation and file preparation
Set of years
% of time in File preparation
0000011111
41.5%
0000000001
36.4%
1111111111
17.8%
% of time in Backpropagation
58.5%
63.6%
82.2%
43
General logic of software
 The exhaustive option
 generate 1024 subsets using a file
preparation program and
 compute backpropagation for all subsets
 Optimized option
 generate Hansel chains, store them
 compute backpropagation only for specific
subsets dictated by the chains.
44
Implemented optimized option
 1. For a given binary vector the
corresponding training data are produced.
 2. Backpropagation is computed
generating the Perform value.
 3. The Perform value is used along with a
stored Hansel chains to decide which
binary vector (i.e., subset of data) will be
used next for learning neural networks.
45
System Implementation
 The exhaustive set of Hansel chains is
generated first and stored for a desired case.
 A 1024 subsets of data are created using a file
preparation program.
 Based on the sequence dictated by the Hansel
Chains, we produce the corresponding training
data and compute backpropagation to generate
the Perform value. The next vector to be used
is based on the stored Hansel Chains and the
Perform value.
46
User Interface and
Implementation
 Built to take advantage of Windows NT
threading.
 Built with Borland C++ Builder 4.0.
 Parameters for the Neural Networks and
sub-systems are setup through the
interface.
 Various options, such as monotinicity, can
be turned on or off through the interface.
47
Multithreaded Implementation
 Independent Sub-Processes
 Learning of an individual neural network or a
group of the neural networks.
 Each sub-process is implemented as an
individual thread.
 Run in parallel on several processors to
further speed up computations.
48
Threads
 Two different types of threads.
 Worker Thread and
 Communication Threads
 Communication Threads send and receive
the data.
 Work threads are started by the Servers
to perform the calculations on the data.
49
Client/Server Implementation
 Decreases the workload per-computer
 Client does the data formatting, Server
performs all the data manipulation.
 A client connects to multiply servers.
 Each server performs work on the data
and returns the results to the client.
 The client formats the returned results
and gives a visualization of them.
50
Client







Loads vectors into linked list.
Attaches threads to each node.
Connects thread to active server.
Threads send server activation packet
Waits for server to return results.
Formats results for outputting.
Send more work to server.
51
Server
 Obtains a message from the client that
this client is connected.
 Starts work on data given it from client.
 Returns finished packet to client.
 Loop.
52
System Diagram
53