Towards Automatic Auditing of Records

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 5, SEPTEMBER 1978
Towards Automatic Auditing
R. C. T. LEE, MEMBER,
IEEE,
JAMES R. SLAGLE,
AND
of
441
Records
C. T. MONG
Abstract-We computer scientists face at least two problems in promoting the use of computerized data-base systems: 1) some important
data might be missing; 2) there might be errors in the data. Both of
these problems can be quite serious. If they cannot be solved, it will
be quite hard to convince potential users that computerized information systems are useful.
In this paper, we shall show that while it is generally impossible to
overcome these difficulties entirely, we have succeeded in developing
some techniques to overcome these difficulties partially. Using a part
of the data, our algorithm detected an error in the book Weyer's Warships of the World 1969. Each of the approximately 2000 warships
listed in the book has 18 variables associated with it. It would be difficult for a person to find errors in the book.
It is important to point out here that our method does not require
any a priori knowledge about the data. For instance, we detected the
error in the book Weyer's Warships of the World 1969 without knowing
anything about warships.
sible errors in a book concerned with warships. However, we
do not have to know anything about warships.
Skinner [11] was the first to use clustering analysis for
inductive inference. His clustering method is based upon a
property list data structure, which means a property list has to
be constructed for every record. Our clustering method is
different from his and can be applied directly to a set of
records. We further extended Skinner's work to solve the
Index Terms-Auditing of records, clustering analysis, data-integrity
problem, exror in data, Hamming distance, missing data, multikey
sorting, short spanning path.
II. THE MISSING-DATA PROBLEM AND
MULTIKEY SORTING
Let us discuss some basic principles on how to handle the
missing-data problem. Imagine that we have some personnel
records. We may assume that there is one person A whose
record does not contain any information about his salary.
Therefore, our problem is: How are we going to make an educated guess and fill in this missing information?
We shall assume that the majority of records do not have any
missing data. If this is the case, we can then do the following.
1) Find all of the people who have a background similar
to A. Let us assume that they are B, C, D, and E.
2) Let the salaries of B, C, D, and E be S1, S2, S3, and S4,
respectively. Then we may estimate the salary of A is the average of S1, S2, S3, and S4.
The reader can see that this procedure requires finding all of
the records of the people who have a background similar to A.
This can be accomplished by searching through the entire file.
If our file is large one, this searching can be quite timeconsuming. Our method is to sort this set of records into a
sequence R1,R2,.-* ,RM in such a way that similar records
are grouped together. After this is done, we can simply find
Ri, which is most similar to R. We then retrieve some records
around Ri in the sequence and use these records to estimate
the value of the missing data.
Thus our method must consist of the following mechanisms.
1) We must be able to measure the degree of similarity between every pair of records.
2) We must be able to determine the degree of similarity
between a record with missing data and a record without missing data.
I. INTRODUCTION
I N THIS PAPER, we are concemed with the problem of
automatic auditing of records in a data-base system. By
automatic auditing, we mean the following.
1) If there is an invalid blank (missing data) in the record,
try to estimate the value of this piece of missing data. For
instance, try to estimate the salary of an employee. This problem will be referred to as the missing-data problem.
2) Detect possible errors that might exist in the records.
For instance, if a record shows a very limited endurance for a
powerful ship, there might be some error in the record and we
would like to detect such an error. This problem will be referred to as the data-integrity problem.
Freund and Hartley [6], Naus et al. [101 as well as Felligi
and Holt [4] have all made contributions to solving of this
problem. Our method is different from theirs in one important respect: Ours does not require any a priori knowledge of
the data. For instance, our method can be used to detect posManuscript received April 29, 1977; revised December 2, 1977. This
work was supported in part by the National Science Council, the
Republic of China, under Grant NSC-65M-0204-03 (04).
R. C. T. Lee is with the Institute of Computer and Decision Sciences,
National Tsing Hua University, Hsinchu, Taiwan 300, Republic of
China.
J. R. Slagle is with the Computer Science Laboratory, Communications Sciences Division, Naval Research Laboratory, Washington, DC
20375.
C. T. Mong is with the Air Force Cadet School, Taiwan, Republic of
China.
data-integrity problem.
While we cannot say that we can correctly fill in all of the
missing data and detect all possible errors, according to our
experimental results, we did fill in many data with reasonable
accuracy and, furthermore, we did detect some errors which
otherwise would be very hard to detect.
0098-5589/78/0900-0441$00.75
C 1978 IEEE
442
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 5, SEPTEMBER 1978
3) We must be able to sort the records in such a way that
To measure the degree of similarity between two records, it
similar records are grouped together in the sequence.
suffices to measure the distance between them. If the distance
In the following section, we shall discuss these problems.
is small, these records must be similar. If the distance is large,
they must be dissimilar.
Since there are three different types of records, we have to
III. DISTANCES AND MODIFIED DISTANCES
define three types of distance functions. In the rest of the secThroughout this paper, we shall assume that every record is tion, we shall assume that Ri = (ri, , ri2,.*. , riN) and R; =
of the form of a vector: Ri = (ril, ri2, * , rfN). In general, we (r11,
rj2, * * *, riN).
may have three types of records.
1) Type I records: Every key assumes numerical values. A A. Type IRecords
typical record is
This kind of record is numerical. We can use at least two
well-known
distance functions;
(1.0, 3.1, -1.5, 11.3).
1) Euclidean distances
2) Type II records: Every key is symbolic. A typical record
1/2
/N
is
dij =
(rik jk)2)
-
,
k=1
(a, f,g, h).
For instance, if
Ri = (3.0,4.1, -1.5, 9.1)
and
(b, c, 160, 5.3, 3.6).
RX = (4.0, 3.1, -1.7, 8.5)
The reader may now wonder how one can possibly have
Type III records. Imagine that a record is related to a person then
and we have five keys as follows:
di= ((3.0- 4.0)2 + (4.1 - 3.1)2 + (-1.5 +1.7)2
color of hair,
+ (9.1 - 8.5)2)1/2
body weight,
body height,
= (1.02 + 1.02 + 0.22 + 0.62)1/2
3) Type III records: Some keys
numerical. A typical record is
are
symbolic and
some are
age,
religion.
There are possibly several distinct colors for a person's hair,
such as black, brown, white, red, and so on. The most natural
way to represent these colors is to code them differently as
follows:
= 1.55.
2) City block distances
N
dij=
Irkrjl
For instance, in the previous example,
di= 13.0- 4.01+14.1- 3.11+1-1.5+ 1.71+ 19.1 - 8.51
= 1.0+ 1.0+0.2+0.6
= 2.8.
No matter whether one uses the Euclidean distances or the
city block distances, one should be careful to make sure that
The reader may still wonder why one cannot code these colors all of the variables are properly normalized, so that the unit
of measurements will not play dominating roles. One popular
by numbers. For instance, we may code them as follows:
and good method of normalization is to normalize the variables
black :
with respect to their variances so that the variances are all 1.
brown : 2
B. Type IIRecords
white : 3
red
:4.
This type of record is nonnumerical, and we shall introduce
the so-called Hamming distances. For Hamming distances,
black : a,
brown : b,
white
C,
: d,
red
N
di,j= E
There is a severe problem associated with this kind of approach.
Note that the difference between "black" and "red" is 3 while
the difference between "black" and "white" is 2. This is of
course not reasonable. Therefore, we shall try our best to
avoid using numbers.
k=1
(rik,rjk)
where
S(Xik,Xjk) = 1,
= O,
if Xik #Xk;
otherwise.
LEE et al.: TOWARDS AUTOMATIC AUDITING OF RECORDS
For instance, if
Ri= (a, b, c, a)
and
Ri = (a, c, c, d)
then
di = 8 (a, a) + 6 (b, c) + 6 (c, c) + 8 (a, d)
+ I +O+ I
443
The weights are calculated as follows:
w1 = 10/4= 2.5 w2 = 10/9= 1.1 W3 = 10/6= 1.67.
The distance between R1 and R2 and the distance between RI
andR5 are therefore,
d12 =2.5X 0+1.1 X 1 + 1.67X 1=2.77
and
d15 = 2.5 X 1 + 1.1 X 1 + 1.67 X 1 = 5.27
respectively.
2.
The reader can now see that we have achieved our goal
Just as with the Type I records, we sometimes have to nor- because the distance between RI and R5 is much greater than
malize Type II records. For instance, consider the following that between R I and R2 .
set of records:
We can proceed to discuss how to define distances among
Type III records.
RI: (A, A, B)
R2 : (A,B,C)
C. Type IIIRecords
R3-: (A, C, C)
As discussed before, a Type III record is a mixed type,
R4 : (A, D, B)
because
variables are both numerical and nonnumerical. In
R
(B, D, C).
this case, we can still have a reasonable way to define
According to the definition of Hamming distances, the distances.
distance between R, and R2 is the same as that between R4
N
\ ~ ~ 1/2
and R5. However, by examining the records more carefully,
=
)2
XIk)
WkC(Xik,
dij=
one would find that record R5 is actually quite different from
k=1
all other records in one respect; the value of XI is B for R5,
where Wk is the reciprocal of the average distance of variable
while the values of XI are A for all other samples.
To make sure that R5 appears more distinct, we have to Xk among all distinct pairs of records, and
modify the definition of Hamming distances previously menC(Xik, Xjk) = 6 (Xik - Xjk), if Xk is nonnumerical
tioned by giving XI more weight than other variables. There
probably are many methods to assign more weight. The and
method we shall introduce is to use the average difference
C(Xik, Xjk) = lXtk - Xik I, if Xk is numerical.
concept. That is, distance can be defined as follows:
For instance, consider the following set of records:
N
R
d (Ri,R,)=
Wk 8 (rk, rik)
R = (A,B, 5.0, 10.0)
k=1
R2 = (A, C, 6.0, 11.0)
where
R3= (B, C, 7.0,9.0).
6 (xik, x/k) = 1, if Xik :X,k
In this case,
= 0,
otherwise
wI = 3/2 = 1.5
and
W2 = 3/2= 1.5
Wk
M-1
M(M- 1)/2
M
EE
i'l /=i+1
(Xik, Xjk)
W3 = 31(15.0 - 6.01 + 16.0 - 7.01 + 15.0- 7.01)
= 3/4 = 0.75
(M is the total number of records). and
w4=3/(I10- 11l+110-91+11-91)
We have tacitly assumed that the denominator of this expression will not become zero, which is a reasonable assumption.
= 3/4
Let us illustrate the foregoing idea through an example.
= 0.75.
Consider the following set of records:
R1: (A,A,B)
R2: (A,B,C)
R3: (A, C, C)
R4: (A,D,B)
R5: (B, D, C).
Some distances are calculated as follows:
d12=(1.5X02+1.5X 12+0.75X 12+0.75X 12)1/2
=
(1.5 + 0.75 + 0.75)1/2
=
(3.0)1/2 = 1.7
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 5, SEPTEMBER 1978
444
X2
The shortest spanning path is a path which connects all of
the points and has a minimal total length. Since shortest spanning paths are rather hard to construct, we choose to find short
spanning paths instead. A short spanning path is a spanning
path which is short, but not necessarily the shortest. In Slagle
et al. [13] , an algorithm was given to generate a short spanning
path for a set of points. It should be emphasized here that
records do not have to contain numerical values. As long as a
distance can be defined between every pair of records, a short
spanning path can be constructed.
Once a short spanning path is obtained, we can break the
longest links to produce clusters. For instance, in Fig. 1 we
can obtain three clusters by breaking the two longest links ea
and ch. After breaking these two links, we shall have three
clusters:
C
x
x
X
x
b d
a
e
x
x
x
h
f
g
(a)
2
c
a
b
d
h
9f
X,
(b)
Fig. 1.
d3 = (1.5 X 12 + 1.5 X 12+ 0.75 X 22 + 0.75 X 12)1/2
= (1.5 + 1.5 + 3.0 + 0.75)1/2
=
(6.75)1/2
=
2.6.
c, = (g, f, e)
c2 =(a,b,d,c)
and
C3= (h).
V. THE PROCEDURE TO ESTIMATE MISSING DATA
Let us assume that we have a set S of records without any
missing data and one record R whose jth element is missing.
Our procedure to estimate the value of the jth element for R
is as follows:
Step 1: Find a short spanning path for S.
Step 2: Use the resulting short spanning path to obtain
clusters Cl, C2, ' *, CL.
Step 3: Among all records in S, find the nearest neighbor of
R. (As for how to calculate the distance between R
and other records, see Section III.) Denote this
record by Q. Assume that Q is in some cluster Ci.
R
=
rl, , r l,
, r l,+ * @ rN)
Step 4: Let Q1, Q2 ..* ., Qk be the k nearest neighbors of
and
Q along the short spanning path, where k is a prespecified
positive integer. (Any Q IQ2,
Qk
Q=(ql,- ,qj- I qj,Qj+ ,
,qN).
which is not in cluster Ci is discarded.)
If the Euclidean distance is used,then the modified Euclidean
Step 5: If the jth element of R is numerical, the estimated
distance is
value of the 1th element of R is the average of the
jth elements of Qi , Q2, * * *, Qk. If the jth element
of R is symbolic, the estimated value of the jth eled(R, Q)=E (rk - qk)
(rk - qk))
k=1
k=j+ 1
ment of R is the most frequent jth element in
QI,Q2 * ,Qk*
IV. MULTIKEY SORTING BY THE SHORT
SPANNING-PATH TECHNIQUE
Let us illustrate the foregoing procedure by an example.
As we indicated in Section II, we needed a sorting of records Consider the data in Table I. The testing record R is a record
into a sequence in such a way that similar records are grouped for an Iris setosa
together. This can be done by the short spanning-path techR = (5.4, 3.7, 1.5, 0.1).
nique (Slagle et al. [13]), which can be best explained by
considering Fig. 1. In Fig. 1(a), we have several points. In Let us assume that the first element (sepal length) of this
Fig. 1(b), these points are connected by a path. The length record is
missing. We can now see how the procedure works
of this path is the shortest among all possible paths. Because
in this case.
of this special property, we can say that these points are now
connected into a sequence g, f, e, a, b, d, c, h. Note that if
Step 1: A short spanning path is found for S, S being the set
of data in Table I. This short spanning path is
these points correspond to records, we have successfully
shown in Table II.
grouped similar records together in the sequence.
We indicated in Section II that it is necessary for us to have a
distance defined between a record with missing data and a
record without missing data. This can be done by simply
ignoring the contribution of the variable where missing data
occur. For instance, let
..
LEE et al.: TOWARDS AUTOMATIC AUDITING OF RECORDS
445
TABLE I
Iris
Setosa
Iris
Versicolor
TABLE II
xl
x2
x3
Ri
5.1
3.5
1.4
0.2
R2
4.9
3.0
1.4
0.2
R3
4.7
3.2
1.3
0.2
R4
4.6
3.1
1.5
0.2
R5
5. 0
3.6
1.4
0.2
R6
5.4
3.0
1.7
0.4
x4
R7i
4.6
3.4
1.4
0.3
R8
5.0
3.4
1.5
0.2
R9
4.4
2.9
1.4
0.2
R10
4.9
3.1
1.5
0.1
Rll
7.0
3.2
4.7
1.4
R12
6.4
3.2
4.5
1.5
R13
6.9
3.1
4.9
1.5
R14
5.5
2.3
4.0
1.3
R15
6.5
2.8
4.6
1.5
R16
5.7
2.8
R17
6.3
3.3
R18
4.9
2.4
3.3
1.0
R19
6.6
2.9
4.6
1.3
R20
5.2
2.7
3.9
1.4
4.5
1.3
4.7
1.6
Step 2: The only link in the short spanning path is the link
between R6 and R18. After this link is broken, we
obtain two clusters, as shown in Table II. The first
cluster contains all Iris setosa and the second cluster
contains all Iris versicolor.
Step 3: The nearest neighbor of R is R 1. RI belongs to Cl.
Step 4: Let us assume that k = 3. As shown in Table II, the
three neighbors of RI on the short spanning path
are RS,R1, and R8. The values of X1 forR5,RI,
and R8 are 5.0, 5.1, and 5.0, respectively.
Step 5: The estimated value of X1 for R is therefore
1 (5.0 + 5.1 + 5.0) giving 5.03.
The real value of the sepal length of R is 5.4. The error
caused by our estimation is
(5.4 - 5.03)/5.4 = 0.068 = 6.8 percent.
The reader may ask an important question: How can we
determine the value of k? This is a problem which has for a
long time puzzled pattern-recognition researchers using the
nearest neighbor searching technique (Duda and Hart [2] and
Meisel [9]). Fortunately, this problem has been solved.
Dudani [3] showed that the value of k is not critical if each
distance is given a weight. Essentially, according to Dudani's
result, if a record is quite different from our record with missing data, it should not be considered very important. On the
other hand, all of the records similar to our record with missing data should be given close examination. For more detailed
information, the reader should consult Dudani [3].
Elements on
the short
spannin g
path
R5
Rl
R8
C1
R7
R3
R4
R9
R2
R10
R6
R18
R20
R14
c2
2
R16
Ri 5
Ri9
R12
R17
Rll
R13
Distance
between
consecutive
records on
the path
x1
X2
X3
X4
0.0000
0.1414
5.0
5.1
5.0
4.6
4.7
4.6
3.6
1.4
1.4
1.5
1.4
1.3
1.5
1.4
0.2
0.2
0.2
0.3
0.2
0.2
0.2
1.4
1.5
1.7
0.1
0.4
0.1732
0.4242
0. 2645
0. 2449
0. 3000
0. 5000
0. 1732
0.6244
1. 8788
0. 8366
0. 5196
0.7348
0. 8306
0. 2449
0.4242
0. 2645
0.7348
0. 2645
4.4
4.
3.5
3.4
3.4
3.2
3.1
2.9
3.0
4.9
5.4
3.1
3.0
4.9
5.2
5.5
5.7
6.5
6.6
6.4
6.3
7.0
6.9
2.4
2.7
2.3
2.8
2.9
3.2
3.3
3.2
3.3
3.9
4.0
4.5
4.6
4.6
4.5
4.7
4.7
3.1
4.9
2.8
0.2
1.0
1.4
1.3
1.3
1.5
1.3
1.5
1.6
1.4
1.5
VI. CLUSTERING ANALYSIS AND THE
DATA-INTEGRITY PROBLEM
In Section V, when estimating the value of a piece of missing
data, we divided samples into homogeneous groups. This
problem of dividing records into homogeneous groups is called
the clustering analysis problem (Hartigan [7]). The short
spanning-path technique can also be used as a clusteringanalysis technique as shown in Section IV. In Section IV,
three clusters were generated. Cluster C3 is an unusual cluster
because it contains only one record. We shall call such a
cluster a singleton cluster. The record in a singleton cluster is
usually quite different from others and is therefore possibly an
error. Of course, if a record is clustered into a singleton cluster, it does not necessarily mean that this record contains an
error.
That the clustering-analysis technique can be used to improve
the integrity of data was accidentally discovered by the two
senior authors of this paper, Lee and Slagle. In the spring of
1975, they were working on a set of personnel records and
noted a very special record which was consistently put into a
singleton cluster, a cluster containing only one record. After
examining this record carefully, they found that almost everything in the record is normal except one peculiar item: the
increment of salary from 1963 to 1967. Within five years,
this gentleman was said to have quadrupled his salary. This is
why the cluster analysis put this record into a singleton cluster.
Finally, they looked into other sources and discovered that in
1963, the person involved was making 15 000 dollars a year,
not 5000 dollars a year, as the record showed.
446
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 5, SEPTEMBER 1978
If the error in a record is not significant enough, it is possible
that it will not cause the record to be significantly different
from others. In such a situation, we cannot expect our
method to work. Often it is unimportant to detect small
TABLE III
Records on
the short
spanning
path
errors.
At this point, let us point out another aspect of the kind of
which we have in mind. Consider Table I again. In
Table I, we have 20 samples taken from the famous Iris data
first discussed in Fisher [5]. The first ten samples are Irises
setosa and the next ten samples are Irises versicolor. Let us
now imagine that for record R 1, the value of X3 is not
recorded as 1.4, but rather as 8.0. Since, for the rest of the
records the value of X3 ranges from 1.3 to 4.9, this error can
be considered as an out-of-range error. An out-of-range error
is not hard to detect because it can be detected by exarnining
each variable alone. In fact, this is a single-variable clusteringanalysis problem, and was discussed thoroughly in Slagle et al.
[12]. We are not interested in being able to detect this kind
of error.
Suppose for R I the value of X3 is not 1.4, but rather 3.0.
This is not an out-of-range error because 3.0 is well within the
range of X3. In fact, if R1 is a record for an Iris versicolor,
then X3 = 3.0 is not an error at all. X3 = 3.0 is an error for R I
simply because it is incompatible with other elements in the
same record. That is, when an iris has sepal length equal to 5.1,
sepal width equal to 3.5, and petal width equal to 0.2, then it
is quite unlikely that its petal length will be equal to 3.0. Putting it more formally, we shall say that the probability that
X3=3.0, given X1=5.1, X2=3.5, and X4=0.2, is quite
small.
Let us consider another example. Imagine that we have some
personnel records and there is one person who holds a very
high position and yet does not have a high school diploma. It
is quite possible that this is an error.
In summary, the reader can see that the kind of error which
we are interested in can be detected only by examining records, not by examining individual variables. This is why we
have to use clustering analysis to solve this problem.
VII. EXPERIMENTAL RESULTS
Experiments were conducted to test the feasibility of our
methods. All of the experiments were conducted by Lee and
Mong at the National Tsing Hua University, Taiwan. Two sets
of data were used in the experiments.
1) The Iris data. This is the famous data first investigated
by Fisher [5]. For more information, consult Fisher's original
paper.
2) The warship data. This set of data is obtained from the
book Weyer's Warships of the World 1969, compiled by
Gerhard Albrecht of West Germany. The English translation
edition was published by the U.S. Naval Institute. For every
ship used in the experiment, we used eight of the eighteen variables to characterize it. These eight variables are
Xl
displacement
X2= speed
X3 = endurance
=
cI
C2
x2
x3
x4
6.9
7.0
6.3
6.4
6.6
6.5
5.7
4.9
4.7
4.7
4.5
4.6
4.0
3.9
3.3
1.5
1.4
1.6
1.5
1.3
1.5
1.3
1.3
1.4
1.0
0.2
R17
R12
R19
R15
R16
R14
R20
R18
4.9
3.1
3.2
3.3
3.2
2.9
2.8
2.8
2.3
2.7
2.4
Ri
1.4071
5.1
3.5
3.
R6
R5
R8
1.4387
0.8062
0.2236
0.3316
0.1732
0.2999
0.4358
0.3000
0.3316
5.4
5.0
5.0
4.9
4.9
3.0
3.6
3.4
3.1
3.0
4.7
4.4
4.6
4.6
3.2
2.9
3.1
3.4
1.7
1.4
1.5
1.5
1.4
1.3
1.4
1.5
1.4
R10
C3
x1
0.0000
0.2645
0.7348
0.2645
0.4242
0.2449
0.8306
0.7348
0.5196
0.8366
R13
Rll
errors
Distance
between
consecutive
records
R2
R3
R9
R4
R7
5.5
5.2
4.6
4.5
error
0.4
0.2
0.2
0.1
0.2
0.2
0.2
0.2
0.3
X4 = horse power
X5 = length
X6 = width of the beam
XA7 = draft
X8 = number of crew.
Experiment 1 (Iris Data, Missing-Information Problem): In
this experiment we used the first 40 samples from each kind of
Iris to be the data without missing information. There are
thus a total of 120 samples without missing information. We
used the last ten samples from each kind of Iris as the testing
samples. We then masked one of the variables for each testing
sample according to the following rule: The first variable for
the filrst sample, the second variable for the second sample,
and so on. For each estimated missing value, we calculated the
percentage error rate. The result is as follows:
k (The Number of Neighbors
Used to Predict)
Error
3
5
7
9
11
8.8
8.3
9.7
10.9
9.9
(%)
Experiment 2 (Iris Data, Data-Integrity Problem): In this
experiment, we used the data in Table I. The petal width of
the first record R 1 was deliberately set to be 3.0, instead of
1.4. The purpose of the experiment was to test whether our
method would detect this error.
A short spanning path for the data in Table I is now shown
in Table III. The reader can see that there are two long links
LEE et al.: TOWARDS AUTOMATIC AUDITING OF RECORDS
447
2 nuclecr
destroyers
1 destroyer
ii
i
C,
1 nuclear cruiser
._
C3
1!IM
liiI
Ill
(a cruiser
with a very
wide beam)
MiI
C,(mine sweepers)
C7 ( short range
submarines)
iH
_
CL3
(small frigates)
C
large frigates
2
.M
Cs
(an error)
Cl2
(cruisers )
cli
(destroyers )
C8 (low speed,
long range
submari nes)
speed,
C9 (high
I ong range
submarines)
(a submarine, an error)
Fig. 2.
this short spanning path: the link between R1 8 and RI and there was an error in the data. The endurance should be
the link between R 1 and R6. These two long links ought to be 112 000 sea miles instead of 11 200 sea miles. This error was
broken and, after they are broken, three clusters will be caused by a keypunch error which had never been detected
obtained
before.
2) C3 contains a destroyer. This is a destroyer belonging to
C1 =(R13,R1l,R17,R12,R19,R15,R16,R14,
9 DG (guided missile destroyers) group mentioned on pp. 140R20, R 18)
141 of the Weyer's Warships of the World 1969. This group of
ships
includes Reeves, Halsy, England, and so on. Checking
C2 = (R1)
into these data more carefully, we discovered that these
C3 = (R6, R5, R8, R 10, R2, R3, R9, R4, R7).
destroyers have shallow drafts, only 9.2 ft. This is obviously a
mistake.
The Naval experts we consulted told us that it should
Thus, RI was singled out as something unusual because it
be
19.2
ft
instead of 9.2 ft. Thus an error in this book was
was put into a singleton cluster. We may say that the error
discovered.
was successfully detected.
Experiment 4 (Warship Data, Missing-Data Problem): After
Again, it should be emphasized that X3= 3.0 is not an outconducting
Experiment 3, the warship data were "cleaned."
of-range error. The difficulty of detecting this kind of error
We
discarded
all of the nuclear powered warships and replaced
should therefore be appreciated.
them
with
conventional
ones. Both mistakes were corrected.
Experiment 3 (Warship Data, the Data-Integrity Problem):Forty-five
warships
were
used
as records without missing inforIn this experiment, we selected fifty warships, including ten
mation
and
five
from each kind) were used as
warships
(one
cruisers, ten destroyers, ten frigates, ten submarines, and ten
records.
For
each
one variable after another
testing
record,
mine sweepers. In Fig. 2, we show the resulting short spanning
was
masked.
That
we
8 X 5 = 40 (there were
is,
produced
path constructed from these warships. The power of the clus8
variables
for
each
warship)
testing
records.
The result is as
tering technique is now clearly demonstrated. For instance,
follows:
the analysis singled out all of the nuclear powered ships, separated frigates into large frigates and small- frigates, and among
k
error
all of the long-range submarines, it further split them into low3
6.7%
speed ones and high-speed ones. There are four singleton
5
7.2%
clusters which caught our attention. They are C2, C3, C10,
and Cl3. C2 contains a nuclear powered cruiser, and it is the
VIII. CONCLUSIONS AND FURTHER RESEARCH
only such cruiser. C13 contains a cruiser with a very wide
beam. Both C3 and C1O were caused by errors. Let us now
In this paper, we have presented an algorithm to estimate
describe these errors in more detail.
missing data and an algorithm to detect possible errors. Exper1) C1o contains a submarine with a rather short endurance, imental results showed that our approach is feasible. There
only 11 200 sea miles. A submarine may have a short range. are many other clustering-analysis methods, and we do not
It just happens that this one has a very large displacement, the claim that our clustering method or our distance function are
highest speed, the longest length, and a relatively deep draft. the best. It is our experience that many reasonable clustering
All of these indicate that this submarine is unique.
methods would work.
Checking into this matter more carefully, we discovered that
It is obvious that our method would be practical only if efin
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 5, SEPTEMBER 1978
448
iR C T. Lee (A'74-M'75) was born in Shanghai,
ficient clustering-analysis methods for very large data bases are
the Republic of China, in 1939. He received
available. Some progress has already been made along this line.
the B.S.E.E. degree from the National Taiwan
The reader may consult, for instance Chang [1] and Lee and
University, Taipei, Taiwan, in 1961, and the
M.S. and Ph.D. degrees from the University of
Dang [8].
Berkeley, in 1963 and 1967,
California,
We would like to point out that this kind of auditing is not
respectiey
going to replace ordinary auditing done by human beings.
He has been with the National Cash Register
Hawthorn, CA, the National Institutes of
Corp.,
However, it will be a good aid to help auditors of records to
E | ||
Bethesda, MD, and the Naval Research
Health,
carry out their duties. At the National Tsing Hua University,
Laboratory, Washington, DC. In 1975, he
we have been encouraging people to use our system to detect joined the National Tsing Hua University, Taiwan, where, from 1975
possible errors. In fact, many errors have since been detected to 1977, he was the Director of the Institute of Applied Mathematics;
is now the Director of the Institute of Computer and Decision Scithrough the use of this system, and the interests and confi- he
ences. His interests are in mechanical theorem proving, pattern recogdence of our computer users towards the computer have been nition, clustering analysis, data-base design, and the application of
computers to management. He has published more than 20 papers and
significantly increased.
REFERENCES
[1] C. L. Chang, "Finding prototypes for nearest neighbor classifiers," IEEE TPans. Comput., voL C-23, pp. 1179-1184, Nov.
1974.
[2] R. Duda and P. Hart, Pattern Cassification and Scene Analysis.
New York: Wiley-Interscience, 1973.
[3] S. A. Dudani, "The distance-weighted K-nearest-neighbor rule,"
IEEE Trans. Syst., Man, Cybern., vol. SMC-6, pp. 325-327, Apr.
1976.
[4] 1. P. Fefligi and D. Holt, "A systematic approach to automatic
editing and imputation," J. Amer. Stat. Ass., pp. 17-35, Mar.
1976.
[5] R. A. Fisher, "The use of multiple measurements in taxonomic
problems," Ann. Eugen., Pt. II, pp. 179-188, 1936.
[6] R. J. Freud and H. 0. Hartley, "A procedure for automatic data
editing," J. Amer. Stat. Ass., pp. 341-352, June 1967.
[71 J. Hartigan, ClusteringAnalysis. New York: McGraw-Hill, 1975.
[8] R. C. T. Lee and T. T. Dang, "Clustering with merging for large
input data," in Proc. National Computer Symp. Republic of
China (Taipei, Taiwan, 1976), pp. 4-1-4-18.
[9] W. S. Meisel, Computer-Oriented Approaches to Pattern Recognition. New York: Academic, 1972.
[10] J. 1. Naus, T. G. Johnson, and R. Montalvo, "A probabilistic
model for identifying errors and data editing," J. Amer. Stat.
Ass., pp. 343-350, Dec. 1972.
[11] C. W. Skinner, "A heuristic approach to inductive inference in
fact retrieval systems," Commun. Ass. Comput. Mach., vol. 17,
pp. 707-712, Dec. 1974.
[12] J. R. Slagle, C. L. Chang, and R. C. T. Lee, "Experiments with
some cluster analysis algorithms," Pattern Recognition, voL 6,
pp. 181-187, 1974.
[13] J. R. Slagle, C. L. Chang, and S. Helier, "A clustering and datareorganizing algorithm," IEEE Trans. Syst., Man, Cybern., vol.
SMC-5,pp. 125-128, Jan. 1975.
is a coauthor of the book Symbolic Logic and Mechanical Theorem
Proving.
James R. Slagle received the M.S. and Ph.D.
degrees from the Massachusetts Institute of
Technology, Cambridge.
He has written numerous articles on artificial
intelligence, and his book, Artificial Intelligence
(McGraw-Hill), was published in 1971. He is
currently Head of the Computer Science Laboratory, Communications Sciences Division of
the Naval Research Laboratory, Washington,
DC, where his work involves automatic pattern
recognition, automatic clustering, and speech
analysis.
Dr. Slagle was presented with the award for Outstanding Blind Student of 1959 by President Eisenhower, and was selected as one of the
Ten Outstanding Young Men of America by the United States Jaycees
in 1969.
C. T. Mong received the B.S. degree in mathematics from the National Cheng Kung University, Taiwan, in 1974, and the M.S. degree from
the Institute of Applied Mathematics (Computer
UnScience Section) of the National Tsing Hua
bUniversity, Taiwan, in 1976.
His interests are essentially in computer database design.