IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 5, SEPTEMBER 1978 Towards Automatic Auditing R. C. T. LEE, MEMBER, IEEE, JAMES R. SLAGLE, AND of 441 Records C. T. MONG Abstract-We computer scientists face at least two problems in promoting the use of computerized data-base systems: 1) some important data might be missing; 2) there might be errors in the data. Both of these problems can be quite serious. If they cannot be solved, it will be quite hard to convince potential users that computerized information systems are useful. In this paper, we shall show that while it is generally impossible to overcome these difficulties entirely, we have succeeded in developing some techniques to overcome these difficulties partially. Using a part of the data, our algorithm detected an error in the book Weyer's Warships of the World 1969. Each of the approximately 2000 warships listed in the book has 18 variables associated with it. It would be difficult for a person to find errors in the book. It is important to point out here that our method does not require any a priori knowledge about the data. For instance, we detected the error in the book Weyer's Warships of the World 1969 without knowing anything about warships. sible errors in a book concerned with warships. However, we do not have to know anything about warships. Skinner [11] was the first to use clustering analysis for inductive inference. His clustering method is based upon a property list data structure, which means a property list has to be constructed for every record. Our clustering method is different from his and can be applied directly to a set of records. We further extended Skinner's work to solve the Index Terms-Auditing of records, clustering analysis, data-integrity problem, exror in data, Hamming distance, missing data, multikey sorting, short spanning path. II. THE MISSING-DATA PROBLEM AND MULTIKEY SORTING Let us discuss some basic principles on how to handle the missing-data problem. Imagine that we have some personnel records. We may assume that there is one person A whose record does not contain any information about his salary. Therefore, our problem is: How are we going to make an educated guess and fill in this missing information? We shall assume that the majority of records do not have any missing data. If this is the case, we can then do the following. 1) Find all of the people who have a background similar to A. Let us assume that they are B, C, D, and E. 2) Let the salaries of B, C, D, and E be S1, S2, S3, and S4, respectively. Then we may estimate the salary of A is the average of S1, S2, S3, and S4. The reader can see that this procedure requires finding all of the records of the people who have a background similar to A. This can be accomplished by searching through the entire file. If our file is large one, this searching can be quite timeconsuming. Our method is to sort this set of records into a sequence R1,R2,.-* ,RM in such a way that similar records are grouped together. After this is done, we can simply find Ri, which is most similar to R. We then retrieve some records around Ri in the sequence and use these records to estimate the value of the missing data. Thus our method must consist of the following mechanisms. 1) We must be able to measure the degree of similarity between every pair of records. 2) We must be able to determine the degree of similarity between a record with missing data and a record without missing data. I. INTRODUCTION I N THIS PAPER, we are concemed with the problem of automatic auditing of records in a data-base system. By automatic auditing, we mean the following. 1) If there is an invalid blank (missing data) in the record, try to estimate the value of this piece of missing data. For instance, try to estimate the salary of an employee. This problem will be referred to as the missing-data problem. 2) Detect possible errors that might exist in the records. For instance, if a record shows a very limited endurance for a powerful ship, there might be some error in the record and we would like to detect such an error. This problem will be referred to as the data-integrity problem. Freund and Hartley [6], Naus et al. [101 as well as Felligi and Holt [4] have all made contributions to solving of this problem. Our method is different from theirs in one important respect: Ours does not require any a priori knowledge of the data. For instance, our method can be used to detect posManuscript received April 29, 1977; revised December 2, 1977. This work was supported in part by the National Science Council, the Republic of China, under Grant NSC-65M-0204-03 (04). R. C. T. Lee is with the Institute of Computer and Decision Sciences, National Tsing Hua University, Hsinchu, Taiwan 300, Republic of China. J. R. Slagle is with the Computer Science Laboratory, Communications Sciences Division, Naval Research Laboratory, Washington, DC 20375. C. T. Mong is with the Air Force Cadet School, Taiwan, Republic of China. data-integrity problem. While we cannot say that we can correctly fill in all of the missing data and detect all possible errors, according to our experimental results, we did fill in many data with reasonable accuracy and, furthermore, we did detect some errors which otherwise would be very hard to detect. 0098-5589/78/0900-0441$00.75 C 1978 IEEE 442 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 5, SEPTEMBER 1978 3) We must be able to sort the records in such a way that To measure the degree of similarity between two records, it similar records are grouped together in the sequence. suffices to measure the distance between them. If the distance In the following section, we shall discuss these problems. is small, these records must be similar. If the distance is large, they must be dissimilar. Since there are three different types of records, we have to III. DISTANCES AND MODIFIED DISTANCES define three types of distance functions. In the rest of the secThroughout this paper, we shall assume that every record is tion, we shall assume that Ri = (ri, , ri2,.*. , riN) and R; = of the form of a vector: Ri = (ril, ri2, * , rfN). In general, we (r11, rj2, * * *, riN). may have three types of records. 1) Type I records: Every key assumes numerical values. A A. Type IRecords typical record is This kind of record is numerical. We can use at least two well-known distance functions; (1.0, 3.1, -1.5, 11.3). 1) Euclidean distances 2) Type II records: Every key is symbolic. A typical record 1/2 /N is dij = (rik jk)2) - , k=1 (a, f,g, h). For instance, if Ri = (3.0,4.1, -1.5, 9.1) and (b, c, 160, 5.3, 3.6). RX = (4.0, 3.1, -1.7, 8.5) The reader may now wonder how one can possibly have Type III records. Imagine that a record is related to a person then and we have five keys as follows: di= ((3.0- 4.0)2 + (4.1 - 3.1)2 + (-1.5 +1.7)2 color of hair, + (9.1 - 8.5)2)1/2 body weight, body height, = (1.02 + 1.02 + 0.22 + 0.62)1/2 3) Type III records: Some keys numerical. A typical record is are symbolic and some are age, religion. There are possibly several distinct colors for a person's hair, such as black, brown, white, red, and so on. The most natural way to represent these colors is to code them differently as follows: = 1.55. 2) City block distances N dij= Irkrjl For instance, in the previous example, di= 13.0- 4.01+14.1- 3.11+1-1.5+ 1.71+ 19.1 - 8.51 = 1.0+ 1.0+0.2+0.6 = 2.8. No matter whether one uses the Euclidean distances or the city block distances, one should be careful to make sure that The reader may still wonder why one cannot code these colors all of the variables are properly normalized, so that the unit of measurements will not play dominating roles. One popular by numbers. For instance, we may code them as follows: and good method of normalization is to normalize the variables black : with respect to their variances so that the variances are all 1. brown : 2 B. Type IIRecords white : 3 red :4. This type of record is nonnumerical, and we shall introduce the so-called Hamming distances. For Hamming distances, black : a, brown : b, white C, : d, red N di,j= E There is a severe problem associated with this kind of approach. Note that the difference between "black" and "red" is 3 while the difference between "black" and "white" is 2. This is of course not reasonable. Therefore, we shall try our best to avoid using numbers. k=1 (rik,rjk) where S(Xik,Xjk) = 1, = O, if Xik #Xk; otherwise. LEE et al.: TOWARDS AUTOMATIC AUDITING OF RECORDS For instance, if Ri= (a, b, c, a) and Ri = (a, c, c, d) then di = 8 (a, a) + 6 (b, c) + 6 (c, c) + 8 (a, d) + I +O+ I 443 The weights are calculated as follows: w1 = 10/4= 2.5 w2 = 10/9= 1.1 W3 = 10/6= 1.67. The distance between R1 and R2 and the distance between RI andR5 are therefore, d12 =2.5X 0+1.1 X 1 + 1.67X 1=2.77 and d15 = 2.5 X 1 + 1.1 X 1 + 1.67 X 1 = 5.27 respectively. 2. The reader can now see that we have achieved our goal Just as with the Type I records, we sometimes have to nor- because the distance between RI and R5 is much greater than malize Type II records. For instance, consider the following that between R I and R2 . set of records: We can proceed to discuss how to define distances among Type III records. RI: (A, A, B) R2 : (A,B,C) C. Type IIIRecords R3-: (A, C, C) As discussed before, a Type III record is a mixed type, R4 : (A, D, B) because variables are both numerical and nonnumerical. In R (B, D, C). this case, we can still have a reasonable way to define According to the definition of Hamming distances, the distances. distance between R, and R2 is the same as that between R4 N \ ~ ~ 1/2 and R5. However, by examining the records more carefully, = )2 XIk) WkC(Xik, dij= one would find that record R5 is actually quite different from k=1 all other records in one respect; the value of XI is B for R5, where Wk is the reciprocal of the average distance of variable while the values of XI are A for all other samples. To make sure that R5 appears more distinct, we have to Xk among all distinct pairs of records, and modify the definition of Hamming distances previously menC(Xik, Xjk) = 6 (Xik - Xjk), if Xk is nonnumerical tioned by giving XI more weight than other variables. There probably are many methods to assign more weight. The and method we shall introduce is to use the average difference C(Xik, Xjk) = lXtk - Xik I, if Xk is numerical. concept. That is, distance can be defined as follows: For instance, consider the following set of records: N R d (Ri,R,)= Wk 8 (rk, rik) R = (A,B, 5.0, 10.0) k=1 R2 = (A, C, 6.0, 11.0) where R3= (B, C, 7.0,9.0). 6 (xik, x/k) = 1, if Xik :X,k In this case, = 0, otherwise wI = 3/2 = 1.5 and W2 = 3/2= 1.5 Wk M-1 M(M- 1)/2 M EE i'l /=i+1 (Xik, Xjk) W3 = 31(15.0 - 6.01 + 16.0 - 7.01 + 15.0- 7.01) = 3/4 = 0.75 (M is the total number of records). and w4=3/(I10- 11l+110-91+11-91) We have tacitly assumed that the denominator of this expression will not become zero, which is a reasonable assumption. = 3/4 Let us illustrate the foregoing idea through an example. = 0.75. Consider the following set of records: R1: (A,A,B) R2: (A,B,C) R3: (A, C, C) R4: (A,D,B) R5: (B, D, C). Some distances are calculated as follows: d12=(1.5X02+1.5X 12+0.75X 12+0.75X 12)1/2 = (1.5 + 0.75 + 0.75)1/2 = (3.0)1/2 = 1.7 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 5, SEPTEMBER 1978 444 X2 The shortest spanning path is a path which connects all of the points and has a minimal total length. Since shortest spanning paths are rather hard to construct, we choose to find short spanning paths instead. A short spanning path is a spanning path which is short, but not necessarily the shortest. In Slagle et al. [13] , an algorithm was given to generate a short spanning path for a set of points. It should be emphasized here that records do not have to contain numerical values. As long as a distance can be defined between every pair of records, a short spanning path can be constructed. Once a short spanning path is obtained, we can break the longest links to produce clusters. For instance, in Fig. 1 we can obtain three clusters by breaking the two longest links ea and ch. After breaking these two links, we shall have three clusters: C x x X x b d a e x x x h f g (a) 2 c a b d h 9f X, (b) Fig. 1. d3 = (1.5 X 12 + 1.5 X 12+ 0.75 X 22 + 0.75 X 12)1/2 = (1.5 + 1.5 + 3.0 + 0.75)1/2 = (6.75)1/2 = 2.6. c, = (g, f, e) c2 =(a,b,d,c) and C3= (h). V. THE PROCEDURE TO ESTIMATE MISSING DATA Let us assume that we have a set S of records without any missing data and one record R whose jth element is missing. Our procedure to estimate the value of the jth element for R is as follows: Step 1: Find a short spanning path for S. Step 2: Use the resulting short spanning path to obtain clusters Cl, C2, ' *, CL. Step 3: Among all records in S, find the nearest neighbor of R. (As for how to calculate the distance between R and other records, see Section III.) Denote this record by Q. Assume that Q is in some cluster Ci. R = rl, , r l, , r l,+ * @ rN) Step 4: Let Q1, Q2 ..* ., Qk be the k nearest neighbors of and Q along the short spanning path, where k is a prespecified positive integer. (Any Q IQ2, Qk Q=(ql,- ,qj- I qj,Qj+ , ,qN). which is not in cluster Ci is discarded.) If the Euclidean distance is used,then the modified Euclidean Step 5: If the jth element of R is numerical, the estimated distance is value of the 1th element of R is the average of the jth elements of Qi , Q2, * * *, Qk. If the jth element of R is symbolic, the estimated value of the jth eled(R, Q)=E (rk - qk) (rk - qk)) k=1 k=j+ 1 ment of R is the most frequent jth element in QI,Q2 * ,Qk* IV. MULTIKEY SORTING BY THE SHORT SPANNING-PATH TECHNIQUE Let us illustrate the foregoing procedure by an example. As we indicated in Section II, we needed a sorting of records Consider the data in Table I. The testing record R is a record into a sequence in such a way that similar records are grouped for an Iris setosa together. This can be done by the short spanning-path techR = (5.4, 3.7, 1.5, 0.1). nique (Slagle et al. [13]), which can be best explained by considering Fig. 1. In Fig. 1(a), we have several points. In Let us assume that the first element (sepal length) of this Fig. 1(b), these points are connected by a path. The length record is missing. We can now see how the procedure works of this path is the shortest among all possible paths. Because in this case. of this special property, we can say that these points are now connected into a sequence g, f, e, a, b, d, c, h. Note that if Step 1: A short spanning path is found for S, S being the set of data in Table I. This short spanning path is these points correspond to records, we have successfully shown in Table II. grouped similar records together in the sequence. We indicated in Section II that it is necessary for us to have a distance defined between a record with missing data and a record without missing data. This can be done by simply ignoring the contribution of the variable where missing data occur. For instance, let .. LEE et al.: TOWARDS AUTOMATIC AUDITING OF RECORDS 445 TABLE I Iris Setosa Iris Versicolor TABLE II xl x2 x3 Ri 5.1 3.5 1.4 0.2 R2 4.9 3.0 1.4 0.2 R3 4.7 3.2 1.3 0.2 R4 4.6 3.1 1.5 0.2 R5 5. 0 3.6 1.4 0.2 R6 5.4 3.0 1.7 0.4 x4 R7i 4.6 3.4 1.4 0.3 R8 5.0 3.4 1.5 0.2 R9 4.4 2.9 1.4 0.2 R10 4.9 3.1 1.5 0.1 Rll 7.0 3.2 4.7 1.4 R12 6.4 3.2 4.5 1.5 R13 6.9 3.1 4.9 1.5 R14 5.5 2.3 4.0 1.3 R15 6.5 2.8 4.6 1.5 R16 5.7 2.8 R17 6.3 3.3 R18 4.9 2.4 3.3 1.0 R19 6.6 2.9 4.6 1.3 R20 5.2 2.7 3.9 1.4 4.5 1.3 4.7 1.6 Step 2: The only link in the short spanning path is the link between R6 and R18. After this link is broken, we obtain two clusters, as shown in Table II. The first cluster contains all Iris setosa and the second cluster contains all Iris versicolor. Step 3: The nearest neighbor of R is R 1. RI belongs to Cl. Step 4: Let us assume that k = 3. As shown in Table II, the three neighbors of RI on the short spanning path are RS,R1, and R8. The values of X1 forR5,RI, and R8 are 5.0, 5.1, and 5.0, respectively. Step 5: The estimated value of X1 for R is therefore 1 (5.0 + 5.1 + 5.0) giving 5.03. The real value of the sepal length of R is 5.4. The error caused by our estimation is (5.4 - 5.03)/5.4 = 0.068 = 6.8 percent. The reader may ask an important question: How can we determine the value of k? This is a problem which has for a long time puzzled pattern-recognition researchers using the nearest neighbor searching technique (Duda and Hart [2] and Meisel [9]). Fortunately, this problem has been solved. Dudani [3] showed that the value of k is not critical if each distance is given a weight. Essentially, according to Dudani's result, if a record is quite different from our record with missing data, it should not be considered very important. On the other hand, all of the records similar to our record with missing data should be given close examination. For more detailed information, the reader should consult Dudani [3]. Elements on the short spannin g path R5 Rl R8 C1 R7 R3 R4 R9 R2 R10 R6 R18 R20 R14 c2 2 R16 Ri 5 Ri9 R12 R17 Rll R13 Distance between consecutive records on the path x1 X2 X3 X4 0.0000 0.1414 5.0 5.1 5.0 4.6 4.7 4.6 3.6 1.4 1.4 1.5 1.4 1.3 1.5 1.4 0.2 0.2 0.2 0.3 0.2 0.2 0.2 1.4 1.5 1.7 0.1 0.4 0.1732 0.4242 0. 2645 0. 2449 0. 3000 0. 5000 0. 1732 0.6244 1. 8788 0. 8366 0. 5196 0.7348 0. 8306 0. 2449 0.4242 0. 2645 0.7348 0. 2645 4.4 4. 3.5 3.4 3.4 3.2 3.1 2.9 3.0 4.9 5.4 3.1 3.0 4.9 5.2 5.5 5.7 6.5 6.6 6.4 6.3 7.0 6.9 2.4 2.7 2.3 2.8 2.9 3.2 3.3 3.2 3.3 3.9 4.0 4.5 4.6 4.6 4.5 4.7 4.7 3.1 4.9 2.8 0.2 1.0 1.4 1.3 1.3 1.5 1.3 1.5 1.6 1.4 1.5 VI. CLUSTERING ANALYSIS AND THE DATA-INTEGRITY PROBLEM In Section V, when estimating the value of a piece of missing data, we divided samples into homogeneous groups. This problem of dividing records into homogeneous groups is called the clustering analysis problem (Hartigan [7]). The short spanning-path technique can also be used as a clusteringanalysis technique as shown in Section IV. In Section IV, three clusters were generated. Cluster C3 is an unusual cluster because it contains only one record. We shall call such a cluster a singleton cluster. The record in a singleton cluster is usually quite different from others and is therefore possibly an error. Of course, if a record is clustered into a singleton cluster, it does not necessarily mean that this record contains an error. That the clustering-analysis technique can be used to improve the integrity of data was accidentally discovered by the two senior authors of this paper, Lee and Slagle. In the spring of 1975, they were working on a set of personnel records and noted a very special record which was consistently put into a singleton cluster, a cluster containing only one record. After examining this record carefully, they found that almost everything in the record is normal except one peculiar item: the increment of salary from 1963 to 1967. Within five years, this gentleman was said to have quadrupled his salary. This is why the cluster analysis put this record into a singleton cluster. Finally, they looked into other sources and discovered that in 1963, the person involved was making 15 000 dollars a year, not 5000 dollars a year, as the record showed. 446 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 5, SEPTEMBER 1978 If the error in a record is not significant enough, it is possible that it will not cause the record to be significantly different from others. In such a situation, we cannot expect our method to work. Often it is unimportant to detect small TABLE III Records on the short spanning path errors. At this point, let us point out another aspect of the kind of which we have in mind. Consider Table I again. In Table I, we have 20 samples taken from the famous Iris data first discussed in Fisher [5]. The first ten samples are Irises setosa and the next ten samples are Irises versicolor. Let us now imagine that for record R 1, the value of X3 is not recorded as 1.4, but rather as 8.0. Since, for the rest of the records the value of X3 ranges from 1.3 to 4.9, this error can be considered as an out-of-range error. An out-of-range error is not hard to detect because it can be detected by exarnining each variable alone. In fact, this is a single-variable clusteringanalysis problem, and was discussed thoroughly in Slagle et al. [12]. We are not interested in being able to detect this kind of error. Suppose for R I the value of X3 is not 1.4, but rather 3.0. This is not an out-of-range error because 3.0 is well within the range of X3. In fact, if R1 is a record for an Iris versicolor, then X3 = 3.0 is not an error at all. X3 = 3.0 is an error for R I simply because it is incompatible with other elements in the same record. That is, when an iris has sepal length equal to 5.1, sepal width equal to 3.5, and petal width equal to 0.2, then it is quite unlikely that its petal length will be equal to 3.0. Putting it more formally, we shall say that the probability that X3=3.0, given X1=5.1, X2=3.5, and X4=0.2, is quite small. Let us consider another example. Imagine that we have some personnel records and there is one person who holds a very high position and yet does not have a high school diploma. It is quite possible that this is an error. In summary, the reader can see that the kind of error which we are interested in can be detected only by examining records, not by examining individual variables. This is why we have to use clustering analysis to solve this problem. VII. EXPERIMENTAL RESULTS Experiments were conducted to test the feasibility of our methods. All of the experiments were conducted by Lee and Mong at the National Tsing Hua University, Taiwan. Two sets of data were used in the experiments. 1) The Iris data. This is the famous data first investigated by Fisher [5]. For more information, consult Fisher's original paper. 2) The warship data. This set of data is obtained from the book Weyer's Warships of the World 1969, compiled by Gerhard Albrecht of West Germany. The English translation edition was published by the U.S. Naval Institute. For every ship used in the experiment, we used eight of the eighteen variables to characterize it. These eight variables are Xl displacement X2= speed X3 = endurance = cI C2 x2 x3 x4 6.9 7.0 6.3 6.4 6.6 6.5 5.7 4.9 4.7 4.7 4.5 4.6 4.0 3.9 3.3 1.5 1.4 1.6 1.5 1.3 1.5 1.3 1.3 1.4 1.0 0.2 R17 R12 R19 R15 R16 R14 R20 R18 4.9 3.1 3.2 3.3 3.2 2.9 2.8 2.8 2.3 2.7 2.4 Ri 1.4071 5.1 3.5 3. R6 R5 R8 1.4387 0.8062 0.2236 0.3316 0.1732 0.2999 0.4358 0.3000 0.3316 5.4 5.0 5.0 4.9 4.9 3.0 3.6 3.4 3.1 3.0 4.7 4.4 4.6 4.6 3.2 2.9 3.1 3.4 1.7 1.4 1.5 1.5 1.4 1.3 1.4 1.5 1.4 R10 C3 x1 0.0000 0.2645 0.7348 0.2645 0.4242 0.2449 0.8306 0.7348 0.5196 0.8366 R13 Rll errors Distance between consecutive records R2 R3 R9 R4 R7 5.5 5.2 4.6 4.5 error 0.4 0.2 0.2 0.1 0.2 0.2 0.2 0.2 0.3 X4 = horse power X5 = length X6 = width of the beam XA7 = draft X8 = number of crew. Experiment 1 (Iris Data, Missing-Information Problem): In this experiment we used the first 40 samples from each kind of Iris to be the data without missing information. There are thus a total of 120 samples without missing information. We used the last ten samples from each kind of Iris as the testing samples. We then masked one of the variables for each testing sample according to the following rule: The first variable for the filrst sample, the second variable for the second sample, and so on. For each estimated missing value, we calculated the percentage error rate. The result is as follows: k (The Number of Neighbors Used to Predict) Error 3 5 7 9 11 8.8 8.3 9.7 10.9 9.9 (%) Experiment 2 (Iris Data, Data-Integrity Problem): In this experiment, we used the data in Table I. The petal width of the first record R 1 was deliberately set to be 3.0, instead of 1.4. The purpose of the experiment was to test whether our method would detect this error. A short spanning path for the data in Table I is now shown in Table III. The reader can see that there are two long links LEE et al.: TOWARDS AUTOMATIC AUDITING OF RECORDS 447 2 nuclecr destroyers 1 destroyer ii i C, 1 nuclear cruiser ._ C3 1!IM liiI Ill (a cruiser with a very wide beam) MiI C,(mine sweepers) C7 ( short range submarines) iH _ CL3 (small frigates) C large frigates 2 .M Cs (an error) Cl2 (cruisers ) cli (destroyers ) C8 (low speed, long range submari nes) speed, C9 (high I ong range submarines) (a submarine, an error) Fig. 2. this short spanning path: the link between R1 8 and RI and there was an error in the data. The endurance should be the link between R 1 and R6. These two long links ought to be 112 000 sea miles instead of 11 200 sea miles. This error was broken and, after they are broken, three clusters will be caused by a keypunch error which had never been detected obtained before. 2) C3 contains a destroyer. This is a destroyer belonging to C1 =(R13,R1l,R17,R12,R19,R15,R16,R14, 9 DG (guided missile destroyers) group mentioned on pp. 140R20, R 18) 141 of the Weyer's Warships of the World 1969. This group of ships includes Reeves, Halsy, England, and so on. Checking C2 = (R1) into these data more carefully, we discovered that these C3 = (R6, R5, R8, R 10, R2, R3, R9, R4, R7). destroyers have shallow drafts, only 9.2 ft. This is obviously a mistake. The Naval experts we consulted told us that it should Thus, RI was singled out as something unusual because it be 19.2 ft instead of 9.2 ft. Thus an error in this book was was put into a singleton cluster. We may say that the error discovered. was successfully detected. Experiment 4 (Warship Data, Missing-Data Problem): After Again, it should be emphasized that X3= 3.0 is not an outconducting Experiment 3, the warship data were "cleaned." of-range error. The difficulty of detecting this kind of error We discarded all of the nuclear powered warships and replaced should therefore be appreciated. them with conventional ones. Both mistakes were corrected. Experiment 3 (Warship Data, the Data-Integrity Problem):Forty-five warships were used as records without missing inforIn this experiment, we selected fifty warships, including ten mation and five from each kind) were used as warships (one cruisers, ten destroyers, ten frigates, ten submarines, and ten records. For each one variable after another testing record, mine sweepers. In Fig. 2, we show the resulting short spanning was masked. That we 8 X 5 = 40 (there were is, produced path constructed from these warships. The power of the clus8 variables for each warship) testing records. The result is as tering technique is now clearly demonstrated. For instance, follows: the analysis singled out all of the nuclear powered ships, separated frigates into large frigates and small- frigates, and among k error all of the long-range submarines, it further split them into low3 6.7% speed ones and high-speed ones. There are four singleton 5 7.2% clusters which caught our attention. They are C2, C3, C10, and Cl3. C2 contains a nuclear powered cruiser, and it is the VIII. CONCLUSIONS AND FURTHER RESEARCH only such cruiser. C13 contains a cruiser with a very wide beam. Both C3 and C1O were caused by errors. Let us now In this paper, we have presented an algorithm to estimate describe these errors in more detail. missing data and an algorithm to detect possible errors. Exper1) C1o contains a submarine with a rather short endurance, imental results showed that our approach is feasible. There only 11 200 sea miles. A submarine may have a short range. are many other clustering-analysis methods, and we do not It just happens that this one has a very large displacement, the claim that our clustering method or our distance function are highest speed, the longest length, and a relatively deep draft. the best. It is our experience that many reasonable clustering All of these indicate that this submarine is unique. methods would work. Checking into this matter more carefully, we discovered that It is obvious that our method would be practical only if efin IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 5, SEPTEMBER 1978 448 iR C T. Lee (A'74-M'75) was born in Shanghai, ficient clustering-analysis methods for very large data bases are the Republic of China, in 1939. He received available. Some progress has already been made along this line. the B.S.E.E. degree from the National Taiwan The reader may consult, for instance Chang [1] and Lee and University, Taipei, Taiwan, in 1961, and the M.S. and Ph.D. degrees from the University of Dang [8]. Berkeley, in 1963 and 1967, California, We would like to point out that this kind of auditing is not respectiey going to replace ordinary auditing done by human beings. He has been with the National Cash Register Hawthorn, CA, the National Institutes of Corp., However, it will be a good aid to help auditors of records to E | || Bethesda, MD, and the Naval Research Health, carry out their duties. At the National Tsing Hua University, Laboratory, Washington, DC. In 1975, he we have been encouraging people to use our system to detect joined the National Tsing Hua University, Taiwan, where, from 1975 possible errors. In fact, many errors have since been detected to 1977, he was the Director of the Institute of Applied Mathematics; is now the Director of the Institute of Computer and Decision Scithrough the use of this system, and the interests and confi- he ences. His interests are in mechanical theorem proving, pattern recogdence of our computer users towards the computer have been nition, clustering analysis, data-base design, and the application of computers to management. He has published more than 20 papers and significantly increased. REFERENCES [1] C. L. Chang, "Finding prototypes for nearest neighbor classifiers," IEEE TPans. Comput., voL C-23, pp. 1179-1184, Nov. 1974. [2] R. Duda and P. Hart, Pattern Cassification and Scene Analysis. New York: Wiley-Interscience, 1973. [3] S. A. Dudani, "The distance-weighted K-nearest-neighbor rule," IEEE Trans. Syst., Man, Cybern., vol. SMC-6, pp. 325-327, Apr. 1976. [4] 1. P. Fefligi and D. Holt, "A systematic approach to automatic editing and imputation," J. Amer. Stat. Ass., pp. 17-35, Mar. 1976. [5] R. A. Fisher, "The use of multiple measurements in taxonomic problems," Ann. Eugen., Pt. II, pp. 179-188, 1936. [6] R. J. Freud and H. 0. Hartley, "A procedure for automatic data editing," J. Amer. Stat. Ass., pp. 341-352, June 1967. [71 J. Hartigan, ClusteringAnalysis. New York: McGraw-Hill, 1975. [8] R. C. T. Lee and T. T. Dang, "Clustering with merging for large input data," in Proc. National Computer Symp. Republic of China (Taipei, Taiwan, 1976), pp. 4-1-4-18. [9] W. S. Meisel, Computer-Oriented Approaches to Pattern Recognition. New York: Academic, 1972. [10] J. 1. Naus, T. G. Johnson, and R. Montalvo, "A probabilistic model for identifying errors and data editing," J. Amer. Stat. Ass., pp. 343-350, Dec. 1972. [11] C. W. Skinner, "A heuristic approach to inductive inference in fact retrieval systems," Commun. Ass. Comput. Mach., vol. 17, pp. 707-712, Dec. 1974. [12] J. R. Slagle, C. L. Chang, and R. C. T. Lee, "Experiments with some cluster analysis algorithms," Pattern Recognition, voL 6, pp. 181-187, 1974. [13] J. R. Slagle, C. L. Chang, and S. Helier, "A clustering and datareorganizing algorithm," IEEE Trans. Syst., Man, Cybern., vol. SMC-5,pp. 125-128, Jan. 1975. is a coauthor of the book Symbolic Logic and Mechanical Theorem Proving. James R. Slagle received the M.S. and Ph.D. degrees from the Massachusetts Institute of Technology, Cambridge. He has written numerous articles on artificial intelligence, and his book, Artificial Intelligence (McGraw-Hill), was published in 1971. He is currently Head of the Computer Science Laboratory, Communications Sciences Division of the Naval Research Laboratory, Washington, DC, where his work involves automatic pattern recognition, automatic clustering, and speech analysis. Dr. Slagle was presented with the award for Outstanding Blind Student of 1959 by President Eisenhower, and was selected as one of the Ten Outstanding Young Men of America by the United States Jaycees in 1969. C. T. Mong received the B.S. degree in mathematics from the National Cheng Kung University, Taiwan, in 1974, and the M.S. degree from the Institute of Applied Mathematics (Computer UnScience Section) of the National Tsing Hua bUniversity, Taiwan, in 1976. His interests are essentially in computer database design.
© Copyright 2026 Paperzz