Distortions of Taxonomic Structure from Incomplete

Journal of General Microbiology (1983), 129, 1045-1073.
Printed in Great Britain
1045
Distortions of Taxonomic Structure from Incomplete Data on a Restricted
Set of Reference Strains
By P . H . A . S N E A T H
Department of Microbiology, Leicester University, Leicester LEI 7RH, U.K .
(Received I6 June 1982; revised 27 August 1982)
The paper examines how well taxonomic relationships can be estimated when the data are
restricted to the similarities between each of the strains and a small subset of reference strains.
Such data represent a strip from the similarity matrix rather than the complete matrix. The
methods studied were: (a)minimum spanning trees, (b)the definition of one group at a time, and
(c) the calculation of ‘derived matrices’. A derived matrix is a complete matrix obtained solely
from the entries of the incomplete matrix, by treating these as quantitative character states. The
data used were taxonomic distances based on morphological, biochemical and physiological
results, and were selected from a previous study to provide good examples of salient patterns of
taxonomic relationship. The results that were most similar to those from the complete data were
given by derived matrices. Surprisingly little taxonomic distortion occurred, even if the
reference strains were rather few, provided these were suitably chosen. Reference strains should
be well dispersed, because distortion was considerable if all were very similar to one another.
Ideally there should be a reference strain from each cluster, and aids to ensuring this are
discussed.
The method has considerable potential for serological or nucleic acid pairing studies in which
it is usually impracticable to obtain complete data on numerous strains.
INTRODUCTION
Much taxonomic work is based on nucleic acid pairing or serology. With such methods it is
generally impracticable to obtain similarity values between every pair of strains if the strains are
numerous. Instead, it is usual to measure the similarity of the entire set of strains to a restricted
set of reference strains for which nucleic acid preparations or antisera are available. The worker
then.draws taxonomic conclusions from this limited array of similarity values. Many examples
of such studies might be cited, but those of Gasser & Gasser (1971), London et al. (1975), Krych
et al. (1980) and Hebert et al. (1980) are typical.
The question then arises, how accurate are these conclusions from limited data? This is
seldom considered, though Cristofolini (1980) has discussed it briefly. The present paper is an
attempt to assess the best way to use such incomplete data, and to investigate the safe limits of
conclusions drawn from them. The example is from a phenetic study that illustrates various
common taxonomic situations. It was not possible to find a serological study in the literature that
adequately illustrated these, but the findings apply a forteriori to serology and similar work.
Similarity values between all pairs of operational taxonomic units (OTUs) can be arranged as
a square matrix of order t x t where there are t entities to be classified. In numerical taxonomy
the matrix is usually symmetrical, so that only the lower left triangle is shown, but with studies
on serology and nucleic acids this is not so. It is therefore convenient to consider all similarity
matrices as square. With serology and nucleic acid studies the taxonomic process, therefore,
does not start from an n x t table of n characters versus t OTUs, as in most numerical
Ahhreciation : OTU, Operational taxonomic unit.
Downloaded from www.microbiologyresearch.org by
0022- 1287/83/0001-0642 $02.00 0 1983 SGM
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
1046
P. H . A . S N E A T H
taxonomies, from which a similarity matrix is calculated. Instead, it starts from the similarity
matrix itself, wherein the serological or nucleic acid values are the analogues of computed
similarity coefficients. The distinction is important and will be referred to repeatedly.
If the data consist of the similarities to a restricted set of reference OTUs, the matrix contains
rows or columns that correspond to the reference OTUs. The problem can then be viewed as how
one may best obtain from the incomplete matrix a taxonomic structure that is as close as possible
to the structure obtainable from the complete matrix.
Three strategies have been commonly used: (a)to define one taxonomic group at a time; (b)to
construct a network of closest neighbours; and (c) to derive a new matrix from the incomplete
similarities and analyse this. These are not the only possible strategies (see Discussion).
In defining one group at a time the worker must choose a reference OTU from which to start.
He must then choose a critical level of similarity which he considers will define an appropriate
taxonomic category such as a species. All OTUs more similar to the reference OTU than this are
allocated to the first species. A second reference OTU is then chosen from the remaining OTUs,
and the process is repeated. This is continued until all, or almost all, OTUs have been grouped
into species. It is convenient to think in terms of dissimilarities, and this implies the choice of a
particular radius about each reference OTU which will circumscribe the members of only one
species. It is clear that the choice of the correct radius will be critical to success. The concept of
defining one group at a time was employed in the algorithm of Rogers & Tanimoto (1960); a
bacterial application is that of Silvestri et al. (1962), although the method did not consider
incomplete matrices.
The commonest network method is to construct a minimum spanning tree. This is made by
joining only the closest pairs of OTUs until all OTUs are linked. The method is closely related to
Single Linkage Clustering (Gower & Ross, 1969) and has been occasionally used in
microbiology, by Kurylowicz et al. (1979, for example, and various modifications have been
used (e.g. Gasser & Gasser, 1971 ;London et al., 1975) when the similarity matrix is incomplete.
Derived matrices were first developed in serology (Lee, 1968; Basford et al., 1968). Some
mathematical aspects are discussed by Sibson (1972). The observed serological cross-reactions
were treated as though they were themselves character states of a complex character that was
determined by a particular reference antiserum. The concept has been discussed by Cristofolini
(1980): it implies that the cross-reactions can be regarded as scores on a serological dimension
defined by the antiserum. The way in which a complete matrix is derived is illustrated later but,
once obtained, the taxonomic structure can be analysed by usual methods such as clustering or
ordination. The process of deriving the matrix introduces distortion. In a preliminary study
(Sneath, 1980) it was noted that c reference OTUs define a (c - 1)-dimensional space, with an
extra dimension for added OTUs, but with the added factor of reflection into one half of the cspace.
It should be clear that incomplete similarity matrices do not permit reconstruction of all
details of the full configuration. Some detail is irretrievably lost. A geographical example will
illustrate this. If distances between North American cities are only measured to New York and
San Francisco, the distances for Winnipeg in the north and Dallas in the south are almost
identical : it will not be possible from this restricted information to say whether Winnipeg and
Dallas are close or distant, and different analytical methods will imply different assumptions.
Minimum spanning trees will imply they are distant. Derived matrices will imply they are close.
The method of constructing one group at a time may imply either, depending on the critical
radius adopted. Unlike geographical relationships (where the number of dimensions p is fixed,
and where therefore p
1 reference points will fix positions of all points), taxonomic
relationships are usually of indefinitely high dimensionality. There cannot be, therefore, a
completely satisfactory solution, but one may hope to find those methods that are best suited to
any particular situation.
+
METHODS
Data. The data are the taxonomic distances between 36 strains of Serratia and allied bacteria (Tables 1 and 2),
selected from the larger study of Grimont et al. (1977) so as to illustrate main taxonomic patterns. Eight strains
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
1047
Taxonomy with incomplete data
Table 1 . Strains studied
Phenon* and species
A
B
C1
C2
Serratia marcescens
Serratia rubidaeat
Serratia liquefaciens
Serratia plymuthica
Enterobacter cloacae
Enterobacter aerogenes
Enterobacter agglomerans
Erwinia carotovora subsp. atroseptica
* Phenon
Strain nos in
present study
Respective nos in Grimont
et aI. (1977)
1-8
25-32
9-16
17-24
33
34
35
36
5 , 16, 293, 207, 60, 261, 57, 12.
22, 39, 37, 224, 231, 36, 513, 288.
9, 221, 319, 283, 239, 278, 507, 503.
238, 289, 392, 299, 325, 410, 329, 517.
C1 = NCTC lOOOS$
A1 = NCTC 10006$
E20 = NCTC 9381$
E29 = NCPPB549Q
letteres as in Grimont et al. (1977).
t Listed in Grimont et a!. (1977) and Sneath (1978) as S . marinorubra but according to the Approved
(Skerman et al., 1980) this is a synonym of S . rubidaea.
$ NCTC, National Collection of Type Cultures, London, U.K.
Q NCPPB, National Collection of Plant Pathogenic Bacteria, Harpenden, U.K.
Lists
were chosen from each of four species of Serratia, and four isolated strains (singletons) were added. The distances
the same strains are used as an
were obtained from SsMvalues (based on 244 binary characters) as d = J( 1 - SSM):
illustration in Sneath (1978). The matrix, sr0, of d values (Table 2 ) is the starting point for the subsequent
experiments and represents the reference configuration or ‘true relationships’ against which the other results are
judged.
Representation of taxonomic structure. UPGMA phenograms (Sneath & Sokal, 1973) and Principal Coordinate
ordinations (Cower, 1966)were selected as commonly used methods, together with shaded similarity matrices for
certain cases. Minimum spanning trees were prepared for the cases where there were only a few reference OTUs
by retaining in Table 2 only the rows and columns pertaining to those reference OTUs. The trees were prepared
from these incomplete matrices as follows. Table 3 shows four OTUs of which two are reference OTUs. The
closest pair is 3 and 4, joining at 0.08; this link is therefore made. The next closest pair is 1 and 2, at 0.10. The third
is 2 and 3 at 0.20. These three links join all the OTUs, so the tree is as follows, with lengths of the links in
parentheses: 1 (0.10), 2 (0.20), 3 (0.08), 4.
Numerical analysis by derived matrices. The mathematical conventions are based on those in Sneath (1980). The
starting matrix srO was read by a computer program that allowed the user to specify a given subset of c reference
OTUs. The program then computed the new similarities, S:, between all pairs of OTUs, using as the character
states only those elements of sr0 that related to the reference OTUs. The program permitted a choice of
coefficients for S;: the most frequently used was
where dr,,gand drkgare the distances between OTU j and OTU k,respectively, and an OTU r that is one of the c
reference OTUs. The subscript g refers to the number of iterations of the process of obtaining Sg. In most
experiments, g = 0, but in some there were repeated cycles of calculation.
An illustrative example is shown in Table 3, from which it will be seen that a complete similarity matrix can be
obtained. It may be noted that although the derived matrix is symmetrical, the matrix from which the selected
columns are available need not be, as in Table 3(a). The method is therefore available where the reaction of
antigen a with antiserum b is not the same as that of antigen b with antiserum a. There is no need to average the
corresponding heterologous values. One can operate either with selected rows or selected columns, and thus obtain
similarities between antisera or between antigens (a good example of each is given by Darbyshire et a / . ,1979). The
same principle is true for DNA pairing. If the starting matrix is symmetric, or is made so by averaging, the
resulting derived matrices are, of course, identical.
Other $ coefficients used were d,$,g+ 1, order to illustrate the behaviour with Simple Matching Coefficients,
S,, (which are the complements of squared distances), and correlation coefficients (calculated from the usual
formula over the c reference OTUs, but with c > 2). Correlation coefficients are cosines of angles, cosc), and to
convert them to a form suitable for Principal Coordinates analysis they were transformed to their semichords
The semichords were also used for UPGMA and minimal spanning trees.
J[f(l - COS~)].
Measures of clustering and dimensionality. The tightness of the clusters was measured by the sums of squared
distances of OTUs to the centroid of their own cluster, symbolized as SS,, etc. The four singletons were treated
here as a fifth cluster, and the sum of the five SS terms gave the within-cluster sum of squares, SSw.The ratio
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
1048
P. H . A . S N E A T H
Table 2. Taxonomic distances, d, between strains (matrix srO)
OTU
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
0.000
.145 0.000
.158
-205 0.000
.170
.202
.195 0.000
.286
.321
.297
.333 0.000
.272
- 2 9 3 .274
.307
.239 0.000
.257
.281
.290 .308
.202
.182 0.000
.316
- 1 1 8 .318
.321
.295
.257
.259 0.000
.446
- 4 5 1 ,455
,456
.460
.460
- 4 5 3 .432 0.000
.432
.184 0.000
.445
.454
.438
- 4 4 0 - 4 5 5 .454
.458
.477
.465
,452
,207
.214 0.000
.460
- 4 5 6 .456
.469
.467
.449
- 4 4 5 .454
.475
.463
.454
.456
.432
.201
.232
.224 0.000
.455
.446
.243
.232 0.000
.452
.447
.465
.477
.439
.438
.241
.237
.455
.471
.448
,259
.281
.245 0.000
.281
.482
.477
.477
.487
.491
.245
.214
.241
.274
.192 0.000
.497
.471
.474
.465
.272
.468
.464
.485
.235
.472
.354
.346
.377
.324
.359 0.000
.526
.510
.505
.496
.341
.491
.495
.506
.339
.495
.409
-427 .404
.505
.492
.516
.386
.391
.422 0.000
.401
.513
.402
.531
.533
.526
.530
.439
.456
.430
.495
.507
.520
.488
.497
.429
.438
.457
.322 0.000
.432
.503
.499
.438
.507
.442
.420
.487
.SO8
.4 17 .401 .424
.422
.428
.438
.265
.241
.506
- 5 19 .5 1 1 .5 19 .5 14 .490
.404
.407
.447
.341
-232
.472
.471
.483
.415
.432
.420
.498
- 3 9 7 .407
.482
- 4 8 7 .486
.486
.412
.411
.452
.346
.249
.506
- 4 9 8 .496
.508
.406
.432
.423
- 3 9 2 .404
.519
.511
.519
.514
.526
.514
.544
.432
- 4 5 2 .440
- 4 5 6 - 3 8 6 .316
.448
.460 .482
.442
.546
.534
.535
.536
.523
.481
.504
.463
.474
.550
.539
.542
.514
.412
.475
.483
.493
.392
.349
.549
.547
.551
.555
.502
.492
.480
.401
.363
.498
.504
.498
.548
.553
.485
.496
.533
- 5 5 5 ,548
.535
.533
.545
.524
.514
.511
.519
.508
.522
.508
.546
.517
.522
.563
.548
.548
. 5 4 1 .530
.538
.522
.534
.524
.516
.517
.510
.497
.541
.541
.540
.535
.534
.557
.531
.523
.509
.547
.518
.539
,532
.527
.517
.519
.513
-503
.516
.540 .511
.529
.532
.550
.536
.535
.534
,524
.517
.533
.525
.532
.558
.543
.543
.517
.526
-513 -503
.537
- 5 4 0 .534
- 5 4 1 .533
525 .523 - 5 4 0 - 5 2 0 .527
.524
- 5 6 0 - 5 3 2 - 5 3 9 .530
-531 .525
.513
.538
.541
- 5 4 6 .539 .544
.563 - 5 5 6 - 5 5 5 - 5 5 4 .546
- 5 2 0 .548
.527
.525
.552
- 5 5 6 .549
.554
.566
.558
.558
. 5 5 1 .540
- 5 2 7 .517
.530
.505
.495
.537
.546
.523
.531
.522
.539
- 5 3 4 .521
.530
.538
- 5 4 1 .542
.540
.574
.552
.551
.550
.537
.544
.570
- 5 7 4 - 5 6 7 .573
.583
.576
.576
.575
.535
.559
.546
.558
.544
.525
.528
.531
.539
.549
.539
.513
.531
.545
.589
.6 10 .609
.606
.531
.607
.61 1 - 6 0 6 .627
.519
- 5 2 9 .531
.541
.516
.580
.576
.559
.564
.574
.603
- 6 1 3 .587
.607
.582 .589
.598
.510
.589
.583
.571
.580
.589
.592
.564
.594
- 5 6 0 .595
.602
.608
.619
.611
.576
.627
.630 .628
.645
.559
.564
.595
.I51
.760 - 7 5 0 .766
.I27
.752
.746
,750
.657
.666
.657
.646
.628
.643
.657
.653
.669
.672
-
R = SS,/SST (where SS, is the total sum of squares, and also the sum of eigenvalues in the Principal Coordinates
analysis) is a measure of the aggregate tightness of clustering. If R is small the clusters are tight compared to the
distances between them (Table 4).
Principal Coordinates analysis is possible with data that show some negative eigenvalues, although Gower
(1978) has emphasized the need to watch this aspect critically. In the present study all the derived matrices using d
and semichord r were Euclidean distances, so there were no negative eigenvalues larger than expected from
rounding errors during computation. With d 2 , however, substantial negative eigenvalues occurred, and this is
referred to later.
The formulae for ‘effective dimensionality’, n’, are simplified from Sneath (1980).
n’
=
1/Cpi2
1
wherep, is I,/ I!,and the I , values are the t non-negative eigenvalues of the Principal Coordinates analysis. When
the effective dimensionality m‘ of a three-dimensional ordination is calculated, the summations are over only
eigenvalues 1 to 3.
The concept of effective dimensionality is best appreciated from an ordination such as Fig. 9, in which the
configuration can be seen to be almost one-dimensional, although three dimensions are shown. The quantity n’,
however, can be obtained without an ordination from the n x n matrix Wof sums of squares and cross-products
for quantitative or binary characters after subtracting from each character state the mean for that character. This
matrix corresponds to the unrotated configuration in n-space. If A is the diagonal matrix of eigenvalues of W , then
. . ., and their sum is the trace of A*. Further, the
the eigenvalues of W-are the squares of those of W , i.e. 1,2 ,
trace of W (because W is real symmetric) equals the sum of the squares of the elements w of W , which in turn
equals tr(A2). Therefore n’ can be obtained as [tr ( W)12/ w2. The proof is derived from Searle (1966, pp. 177178).
Choice of reference OTUs. The sets of reference OTUs which were chosen as of most interest, were as follows.
Set a. All 36 strains. The derived matrix sral was operated on a second and a third time to give derived matrices
sra2 and sra3. The use of all OTUs causes an even effect referred to as a uniform transformation. When, instead of
taxonomic distances, correlations followed by semichords and squared taxonomic distances were used, derived
matrices srhl and srml, respectively, were produced.
Set h. Strains 7, 12,24 and 28, one from each species of Serratia. Within the species the strains were chosen at
random. The derived matrix was srbl.
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
1
1049
Taxonomy with incomplete data
Table 2. (continued)
0.000
.288 0.000
.257
.214 0.000
.335
.281
.308 0.000
.359
.336
- 3 1 6 .377 0.000
.374
- 3 8 5 -386
.370 0.000
.369
- 4 9 5 .465
.482
.520
.474
.517 0.000
.481
.479
.508
.487
.466
.505
.145 0.000
.490
.449
.468
.506
.460
.512
.192
.170 0.000
.490
.449
.468
.514
- 4 6 9 - 5 1 2 - 1 9 2 .158
.158 0.000
-502
.469
.487
-516
.482
.521
.202
.158
.170
.192 0.000
-481 .449
.202
.230 0.000
.468
.481
.460
-520
.241
.232
.202
.511
- 1 9 2 .241
.249 0.000
.469
- 4 9 8 .526
.496
.537
.272
.241
.212
.514
.471
.497
.480
.472
- 5 2 3 - 3 4 6 .316
- 3 2 9 .341
.354
- 3 4 6 .34% 0.000
.544
.555
.508
.540
.571
.522
.594
.585
- 5 7 5 .582
-583 -575
.587
.560 0.000
.577
.576
.585
.546
.530
- 5 3 4 .530
.527
.552
.452 0.000
.570
.568
.523
.535
.530
.573
.587
.543
-569
.598
.603
.592
.603
.588
.519
.569 0.000
.606
.529
.610
.597
.598
-701 - 7 0 3
.695
-711
.650
.532
- 6 5 5 .520 0.000
- 6 5 7 - 5 7 6 - 7 0 0 - 6 9 2 .695
.646
.595
- 6 4 9 .636
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Table 3. Example of calculation o j a derived matrix
(a) Complete matrix of immunological distances
Antiserum
Antigen
~
f
1
2
3
4
1
2
3
4
0
0.10
0.26
0.30
0.12
0
0.20
0.23
0-24
0.20
0
0.08
0.3 1
0.23
0.07
0
(h) Incomplete matrix from only two reference antisera, I and 3
Antiserum
Antigen
1
2
3
4
f
1
2
0
0.10
0.26
0-30
-
-
3
4
0.24
0.20
0
0-08
-
3
-
-
(c) Dericed matrix : complete matrix 01'relationships between antigens, obtained
by calculating taxonomic distances between rows in (b)
Antigen
Antigen
1
2
3
4
A
f
>
1
2
3
4
0
0.076
0.250
0.240
0.076
0
0.181
0.165
0.250
0.181
0
0.063
0.240
0.165
0463
0
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
35
36
1050
P. H . A . SNEATH
P
i
P
Fig. 1. Starting configuration: ordination. 0,
Phenon A, Serraria marcescens; 0 , phenon B, S .
ruhidara; A,phenon C1, S . liquejaciens;
phenon C2, S.plymuthica; 0,
singletons. From matrix sr0.
The baseplate is drawn at a height of -0.2 on axis 111. The figure is another representation of that in
Sneath (1978) from a different viewpoint.
a,
Set c. Strains 7, 8, 12 and 15, two from Serratia marcescens and two from Serratia liquefaciens. Strains 8 and 15
were chosen at random from their clusters to add to 7 and 12. The derived matrix was srcl.
Set d . Strains 33, 34, 35 and 36, the four singletons, giving matrix srdl.
Set e. Strains 2 , 4 , 7 and 8, all from Serratia marcescens. Strains 2 and 4 were randomly chosen to add to strains 7
and 8 used in set c. The derived matrix was srel. Correlations with semichords gave matrix srll.
Set f : Strain 2 from Serrutia marcescens gave the matrix srfl.
Set g . Strain 20 from Serratiuplymuthicu was chosen because it was the strain closest to the centroid of all OTUs,
and it gave matrix srgl.
RESULTS
Standards for comparison
The obvious standard for comparison is the starting configuration (Figs 1,2). Recovery of this
structure is the ideal. However, a uniform transformation (Fig. 3) is also acceptable and is easier
to compare with the figures produced by varying the choice of reference OTUs. Figure 3 is also
more appropriate when considering the Sums of Squares in Table 4. It should be noted that these
are directly relevant to the phenograms, not necessarily to the ordinations.
Starting configuration
Figures 1 and 2 show the starting configuration as, respectively, the three-dimensional
ordination and as the phenogram and minimum spanning tree. The four species clusters are well
separated. The closest are S . liquefaciens (phenon C1, solid triangles) and S. pfymuthicum
(phenon C2, open triangles). They are also less compact than S . marcescens (phenon A, open
circles) and S. rubidaea (phenon B, solid circles), and this can be also seen from Table 4 by their
greater Sums of Squares. The total Sum of Squares (representing the scatter of all the strains) is
4.27, and the sum within the clusters is 1.53, so that the measure of clumping, R,is about 0.36
(Table 4).
The clusters appear looser in Fig. 2 than in Fig. 1. This is because the phenogram and
minimum spanning tree use the distances in the full hyperspace, whereas the ordination uses
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
T SOT
0
I
1
I
===+-P
v
I
V
h P
:
r
a
S.0
I
I
I
h
I
t
l
I
1
tl
l l
I
a-a-a--.
0-v-v-0-v-v
ri
\
P
I
0
-
%
-0
Fig. 2. Starting configuration: UPGMA phenogram ( a ) and minimum spanning tree ( 6 ) . They are
Phenon A,
comparable to those in Sneath (1978) with transformation from d2 to d . From matrix s r 0 . 0,
Serratia marcescens; a, phenon B, S. rubidaea; A,phenon C1, S . Iiquefaciens; A,phenon C2, S .
plymuthica; 0,
singletons.
part of the variation in such a way as to accentuate the inter-cluster distances over the distances
within clusters. The UPGMA method gives a phenogram whose cross-bars are average
distances between members of groups. The minimum spanning tree joins OTUs that are closest
to one another, and thus the links between clusters make the clusters appear closer relative to the
appearance in the phenogram. Further, the minimum spanning tree is an unfolded
representation of a tree that is folded up within a hypercube whose sides measure 1. The angles
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
srO
1, 2, 14
sral
3
sra2
4
5
sra3
srbl
6
srcl
7
srdl
8
srel
9, 15
srfl
10
srhl 11
srll
12
srml 13
Matrix
Fig.
no.
4
36
36
36
36
4
4
4
4
1
36
0-831
0.138
0.035
0.01 2
0.158
0.24 1
0.1 11
0.037
0
1-904
2.253
0-011
0-420
0.052
0.0 10
0.003
0.08 1
0-017
0.047
0.004
0
0.856
0
0.00 1
56-7
77.4
86-9
89.4
87.9
97-5
88-5
99.5
100
69.2
100
106.77
0.238
1-8 x 10-2
1.5 x 10-3
1.8 x 10-4
1.8 x 10-2
4.1 x lo-’
1.1 x 10-3
8.3 x lop2
7.9 x 10-2
0.207
1.913
1.1 x 10-4
0.212
1.6 x lo-*
1.8 x 10-3
5.1 x 10-4
1-6 x lo-‘
8.5 x 10-4
9.8 x 10-4
1.3 x 10-3
1.2 x 10-3
0-170
0.124
1.1 x 10-4
Phenon B
0-254
1-8 x 10-2
1.7 x 10-3
2.9 x lo-*
1.8 x 10-2
3.8 x lo-*
9.3 x 10-4
2.7 x 10-3
2.0 x 10-3
0-259
0-462
1-2 x 10-4
Phenon CI
A
0-385
2-8 x lo-’
2.4 x 10-3
3.1 x 10-4
3.5 x 10-2
5.9 x 10-3
3.5 x 10-3
3-9 x 10-3
3.6 x 10-3
0.506
0.547
2.5 x lo-“
Phenon C2
Sums of Squares of:
0.445
3.5 x 10-2
7-1 x 10-3
2.9 x 10-3
1.1 x 10-2
1.3 x lo-’
0-229
1.6 x lo-’
1.5 x lo-’
0.60 1
0.178
9.3 x 10-4
4.273
0.632
0.145
0.047
0-668
0-691
0-409
0.796
0.910
7.972
8.652
0-030
Total
Sum of
Singletons Squares
$ Calculated from the positive eigenvalues only with p ,
=
AJZA (+ ve).
t If calculated from the positive eigenvalues only, this becomes 95% (see text).
0.359
0.182
0.100
0.089
0-147
0.143
0-576
0-134
0.111
0-219
0.373
0-051
h,
2-56
2.07*
2.53$
1*0*
4.96
2-07*
2.98$
1 .o*
2.77
2.74
2.47
2.14
2.57
2.00
2.59
1-19
7.50
4.42
3.17
2.58
3.18
2-10
3.17
1 -20
EG
+I
F
2:
*
?
Ratio R
Effective
of within
to total dimensionality
Sums of
n’
m‘
Squares
and rn’ are the same because all the variation in these cases can be expressed in 3 dimensions or less.
0.508
0.1 18
0.022
0.004
0.122
0.023
0-067
0.030
0
0.879
0.888
0.005
* The values for n’
1-084
0.233
0-069
0.026
0.307
0.410
0-184
0.725
0.9 10
2.735
5.51 1
0.016
Percentage
variation in
No. of
first three
reference
Eigenvalues, A, for
axes (for
OTUs
principal axes
comparison
where f = * - ,
with
f
I1
I11
IV ordinations) Phenon A
applicable 1
Table 4. Parameters of the clusters in the conjigurations corresponding to the figures listed
CL
0
m
1053
Taxonomy with incomplete data
P
A
h
A
r
-
4
A
A
0
A
I1
-
0
d
0.5
Fig. 3. Uniform transformation (first cycle) using all 36 OTUs as reference OTUs: d, ordination,
UPGMA phenogram (a) and minimum spanning tree (b).Conventions as in Figs 1 and 2. From matrix
sra 1.
between the links are arbitrary, and chosen for convenience of display (‘unfolded’ is here used in
the ordinary sense, not in that in the psychometric literature mentioned in the Discussion).
Certain strains stand out in all the representations. There is no markedly aberrant strain in
phenon A, but in phenon B strain 32 is atypical, and is seen to be so in the ordination,
phenogram and minimum spanning tree (where it is respectively the rightmost, the lowest, and
the highest of its cluster). In phenon C l strain 16 is atypical (respectively the highest, lowest and
rightmost in the three representations). In phenon C2 strain 24 is atypical (respectively the
highest, lowest and lowest). Of the singletons, strain 36 stands out as the most aberrant
(rightmost, lowest and lowest respectively). These strains are convenient markers of the amount
of within-cluster detail that is preserved in the later configurations.
Uniform transformation :jirst cycle
Figure 3 was obtained by employing all the OTUs as reference OTUs and calculating new
Euclidian distances by formula 1 with g = 0. It can be seen that this produces an almost uniform
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
1054
P. H . A . S N E A T H
Fig. 4. Results of second cycle of uniform transformation with all 36 OTUs as reference OTUs: d,
ordination and UPGMA phenogram. Conventions as in Figs 1 and 2. From matrix sra2.
transformation of the starting configuration, in which no particular influence is exerted by one
OTU rather than another.
Comparison with Fig. 1 shows that the points have markedly condensed. The species clusters
are still well separated and C1 and C2 are still the closest. The OTUs that were furthest out in
Fig. 1 (particularly the singletons) have been relatively pulled inward. The outlying strains of the
clusters are still recognizable in Fig. 3 as being aberrant. The phenogram is similar to that in Fig.
2 except for the shrinkage, though the singletons are relatively closer and form a single group.
The minimal spanning tree is also similar to that in Fig. 2, though it appears to be somewhat
more linear in arrangement.
Although the total Sum of Squares has fallen from 4.27 to 0-63, the shrinkage has been
relatively greater within the clusters than between them, as is seen by the lower value of R =
0.18 (Table 4). The tightness of clustering has thus been somewhat increased.
Uniform transformation :repeated cycles
If the distances of the derived matrix used to prepare Fig. 3 are operated on for a second time
with formula 1 (g = l), with all 36 strains as reference OTUs, further shrinkage takes place (Fig.
4).The results of a third cycle are shown in Fig. 5. These figures are instructive in showing the
more extreme effects of the uniform transformation. The points shrink into a small region,
though the clusters remain separate, and indeed the tightness of clustering increases somewhat
(R in Table 4 decreases).
The singletons move round to fill the gap between phenons A and B (though they link most
closely to C1 and C2). The shrinkage becomes extreme, and this is emphasized by retaining the
same scale of d in Figs 1-5 (the minimum spanning tree becomes too cramped to illustrate in
Figs 4 and 5 ) . Nevertheless, the broad taxonomic structure is preserved, though detail is lost (in
Fig. 5 only the aberrant singleton 36 is still clearly recognizable by its extreme position).
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
1055
Taxonomy with incomplete data
0.1
0
Fig. 5. Results of third cycle of uniform transformation with all 36 OTUs as reference OTUs: d,
ordination and UPGMA phenogram. Conventions as in Figs 1 and 2. From matrix sra3.
Eflect of diflerent reference OTUs on derived configurations
Figures 6-10 show the effect of different selections of a few reference OTUs (marked with
stars) on the derived configurations. The choice of reference OTUs was made to illustrate
situations that might well occur in practice.
One reference OTUfrorn each species. In Fig. 6 are shown the results of choosing one reference
strain from each of the four species. The configuration has shrunk about as much as with the
uniform transformation (Fig. 3), as shown by the Total Sums of Squares in Table 4. Each species
is still broadly distinct, but there are pronounced distortions. Each species cluster appears to
have expanded, but the ratio R (Table 4) of 0.147 shows that they are on the whole about as tight
relatively as in Fig. 3. The appearance of expansion is largely because the reference OTUs are
relatively far away from other members of their own cluster and in each case they are peripheral.
Each now appears to be an aberrant member of its own cluster; only strain 24 was aberrant in
the starting configuration. This effect is sufficient to lead to misclustering of two reference
OTUs in the phenogram. In contrast, the singletons have been compressed into a spurious
cluster.
A major effect, therefore, is that strains that are not reference OTUs have been moved toward
the centre. Some original detail is retained: strain 36 is still the most aberrant singleton
(rightmost) and strain 16 is the most aberrant and rightmost member of cluster C l .
The minimum spanning tree also shows each species cluster as distinct, but the singletons are
attached to two clusters. Three of the singletons might be mistaken as members of phenon C2.
The bridging strains could be allocated to the wrong cluster. The most aberrant strains in this
tree are not necessarily the most outlying in the original configuration, as this will depend on the
situation of the reference OTU. Thus the strain at 9 o’clock in species A (strain 4) is not
particularly aberrant in the original configuration.
Two reference OTUs from each of two species. Figure 7 shows the results of choosing two
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
1056
P. H. A . SNEATH
0.2
I
I
I
~
0
Fig. 6 . Effect of choosing four reference OTUs (marked with stars), one from each species: d ,
ordination, UPGMA phenogram (a) and minimum spanning tree (b) constructed from only those
distances in the original configuration that relate to the reference OTUs. Conventions as in Figs 1 and 2.
From matrix srbl.
reference OTUs from species A and two from species C1. It can be seen that these two species
have expanded relative to Fig. 3, and this is not due simply to the peripheral position of the
reference OTUs. The Sums of Squares (Table 4) for A and C 1 are much larger than those for B or
C2, but the clusters on the average are about as tightly clustered (R = 0.14).
The reference OTUs appear as atypical members of their clusters, and the other members
tend to lie in a line pointing toward the centre of the configuration. Opposite them, the other two
clusters and the singletons are compressed together: this is also evident in the phenogram, in
which several misclusterings are seen. The misclustered strains include the more aberrant
members of clusters: thus the atypical strain 16 of phenon C1 appears within C2, and the
atypical strain 32 of phenon B has been lost to a spurious cluster of singletons. The most distant
singleton is still strain 36.
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
1057
Taxonomy with incomplete data
0
0.2
4-G:
I
1
1
0.
0
O
1
0
1
1
1
1
d
1
0.5
Fig. 7. Effect of choosing four reference OTUs, two from species A and two from species Cl : d ,
ordination, UPGMA phenogram (a)and minimum spanning tree (b)constructed as described in Fig. 6.
Conventions as in Figs 1, 2 and 6. From matrix srcl.
The minimum spanning tree shows confusion between clusters B and C2, with a few other
misplaced strains. The singletons are attached to all the clusters.
Four reference OTUs, all singletons. Figure 8 shows the great compression of all strains other
than reference OTUs, and R (Table 4) has increased, reflecting the poor distinctness of the
species clusters. The phenogram shows the same features, with some misclustering. As with Fig.
7, the strains that have strayed into wrong clusters are mainly atypical cluster members (strains
16, 24 and 32 are among them).
The minimum spanning tree shows only two main groups, with species A and B in one, and C1
and C2 in the other as well as two misclusterings.
Four reference OTUs, all from one species. Figure 9 shows the results of choosing all four
reference OTUs from species A. The configuration is somewhat less contracted than Figs 6-8
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
1058
P. H . A. SNEATH
I
*
0.2
1
1
0
j_____j__l
.
,
,
.
,
G+
(b)
?
s ?
d
Fig. 8. Effect of choosing four reference OTUs, all singletons: d , ordination, UPGMA phenogram (a)
and minimum spanning tree (6) constructed as described in Fig. 6 . Conventions as in Figs 1 , 2 and 6 .
From matrix srdl.
(Total Sums of Squares = 0.80). The average tightness of clustering is good (R = 0.13), but this
is partly due to the wide separation of phenon A from the rest.
This figure and the Sum of Squares for A shows that species A has greatly expanded by
comparison with the uniform transformation (Fig. 3). The reference strains are again peripheral
and widely separated from one another. A striking feature is the arrangement of all strains that
do not belong to phenon A into almost a straight line leading away from the centre of phenon A.
There is some misclustering of the other groups in the phenogram, but in the minimal spanning
tree it is phenon A which shows most misclustering, or at least would be the most difficult
phenon to interpret correctly.
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
1059
Taxonomy with incomplete data
I
0.2
I
I
0
1
d
Fig. 9. Effect of choosing four reference OTUs, all from one cluster: d, ordination, UPGMA
phenogram (a) and minimum spanning tree (b) constructed as described in Fig. 6. Conventions as in
Figs 1, 2 and 6. From matrix srel.
Strain 32 is the rightmost member of phenon B in the ordination, and the lowest in the
phenogram; strain 36 is still the most outlying singleton; but little fine detail has been preserved.
As in earlier configurations the strains that are misclustered (e.g. 16 and 24) tend to be the
aberrant members of their own clusters.
A single reference OTU. The extreme case of choosing a single reference OTU is shown in Fig.
10. This case might almost be said to be indeterminate. Strain 2 was chosen as reference OTU
because it is the closest to the centroid of cluster A. The derived configuration is entirely linear,
with all the variation in the first principal axis (A, = Total Sum of Squares, Table 4). The
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
1060
P. H. A . S N E A T H
0.2
0
w
0
d
0.5
Fig. 10. Effect of choosing a single reference OTU: d , ordination, UPGMA phenogram (a) and
minimum spanning tree (b)constructed as described in Fig. 6. Conventions as in Figs 1 , 2 and 6. From
matrix srfl.
configuration is less contracted than the earlier derived configurations, but the low R value is
largely due to the gap between phenon A and the rest.
Phenon A is relatively much expanded (as shown by its Sum of Squares in Table 4). The
reference OTU is at one end of the line of points and is very far out. The most typical member of
species A is therefore, both in the ordination and the phenogram, made to seem the least typical.
In contrast, the minimum spanning tree is a single fan, entirely different from the straight line of
the ordination. Though in the minimum spanning tree the reference OTU is made to seem the
most typical because it is in the centre, it should be noted that the same effect would be obtained
whatever strain was used, even if it were the most atypical of all in the study, i.e. strain 36. In all
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
1061
Taxonomy with incomplete data
I
I
A I
A
,
Fig. 1 1. Uniform transformation using the correlation coefficient with all 36 OTUs as reference OTUs
followed by calculating semichords: ordination and UPGMA phenogram. Conventions as in Figs 1 and
2 except that the baseplate lies at - 0.4 on the third principal axis. From matrix srhl.
of the representations in Fig. 10 there is considerable misclustering or risk of confusion in
interpreting the structure.
Another study was made with the single OTU 20 from phenon C2, which is the closest to the
grand centroid of the original configuration. The results were similar to Fig. 10; the most typical
strain in the study appeared the least typical in the ordination and phenogram; the other clusters
were intermingled, and the singletons were at the end of the linear array opposite to the
reference OTU. Cluster C2 was now expanded, but there was almost no gap between members
of C2 and those of other clusters.
Derived matrices with other similarity coeficients
There are many possible similarity coefficients that could be used in deriving complete
matrices, but they fall broadly into distance measures and angular measures. The correlation
coefficient is the best known angular measure, and the experiments were repeated using this and
transforming the correlations to semichords as described in Methods. The semichords, though
they are Euclidean distances from the mathematical aspect, are approximately proportional to
the angles for moderate angles, and thus reflect fairly closely the geometric properties expected
from angular measures. In addition, some studies were made with the square of the distance in
formula 1, which is equivalent to the complement of the Simple Matching Coefficient. Only a
few examples are illustrated with figures. It should be noted that minimum spanning trees are
either not applicable or are simple transformations of other minimum spanning trees; they are
therefore not shown.
Correlation coeficient with all OTUs as reference OTUs. Figure 11 shows the equivalent of the
uniform transformation using the correlation coefficient and semichord in place of d . The space
is measured in different units from those derived with d, so one cannot say whether the
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
1062
P . H . A. SNEATH
1.0
~
1
1
1
1
0.5
1
1
1
1
0
1
1
‘A
1
Fig. 12. Effect of four reference OTUs from one cluster using the correlation coefficient and
semichords: ordination and UPGMA phenogram. The baseplate lies at -0.4 on the third principal
axis. Other conventions as in Figs 1, 2 and 6. From matrix srll.
configuration has expanded or contracted, but the ratio R shows that the clustering is somewhat
tighter (R = 0.22 as against 0.36 for the original configuration) but about the same as the
uniform transformation with d (R = 0.1 8, see Table 4). The relative compactness of the species
is much the same; thus the Sums of Squares show that C2 is still the most disperse of the four.
Species C1 and C2 are the closest as before.
The taxonomic structure is well preserved, though the singletons now join species B before the
other species join. The main difference from Fig. 3 (which was derived using d ) is that the
clusters, and also the singletons, are almost equidistant from the grand centroid. Most strains lie
at a semichord distance of about 0.5; the coefficient of variation of the semichords is 0.12, which
is smaller than the values for the distances of the strains from the grand centroid in the original
configuration or the uniform transformation based on d (coefficients of variation 0.17 and 0.22,
respectively). The atypical strains 16 and 32 are seen as atypical in this configuration (the
rightmost of their clusters in the ordination, and the lowest in the phenogram).
Correlation coe8cient with four reference OTUs from one species. This analysis (Fig. 12) is
equivalent to Fig. 9 with change of coefficient. On comparing it with that figure it can be seen
that the ordination presents an entirely different appearance. In Fig. 12 there are three loose
clusters, comprised predominantly of species C1, C2 and B, but each also contains a reference
OTU from species A and either additional strains of species A or singletons. There is also one
isolated reference OTU. Species A is obviously much expanded as shown by a large Sum of
Squares. The phenogram shows the same main features. The appearance in Fig. 12 is very like
the minimum spanning tree in Fig. 9 : the species that provided the reference OTUs is disrupted
and mingled with other species. The fine detail of the original configuration is lost, and none of
the aberrant strains is readily recognizable as such.
The tendency for strains to be equidistant from the grand centroid (noted under Fig. 11) is
very obvious in Fig. 12: it is indeed exaggerated by the dispersed appearance of the ordination,
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
Taxonomy with incomplete data
1063
1
0.05
1
1
1
1
1
0
1
A
A
A
A
A
A
A
A
-
7
A
A
A
A
A
-0
Fig. 13. Uniform transformation using& with all 36 OTUs: ordination and UPGMA phenograrn. The
baseplate lies at -0-04on the third principal axis. Other conventions as in Figs 1,2 and 6. From matrix
srm 1.
because the coefficient of variation, 0-15,is not very low. Again, the average semichord distance
from the centroid is close to 0.5, as in the last configuration.
Squared taxonomic distance with all OTUs as reference OTUs. Figure 13 shows the derived
configuration obtained by substituting d2 for d in formula 1 and is comparable with Fig. 3. It is
equivalent to using the Simple Matching Coefficient. Because of the change of metric the Sums
of Squares are not directly comparable with those for Fig. 3. However, if one calculates R from
the square roots of the values in Table 4 (which provides some compensation for the change in
metric) it is 0-452, indicating rather loose clustering as appears from Fig. 13.
In this configuration the shorter distances are proportionately smaller than in Fig. 3, and this
has the effect of making the clustering in the phenogram tighter in Fig. 13, relative to the length
of the stems that carry the clusters. The reason that the ordination shows looser clusters is
because the metric is a squared Euclidean distance : this gives negative (‘imaginary’) eigenvalues
and ‘over’ 100% of the variation is contained in the first three dimensions (Table 4). The
negative eigenvalues commenced with the 18th, but were small except for the 35th and 36th.
They raised the trace to 110.9%. The ordination is therefore not strictly comparable with that in
Fig. 3, and comparison of the ordination and phenogram in Fig. 13 will show many
discrepancies in detail which stem from this cause. The phenogram is the ‘better’ representation
in the sense of considering all the variation : it shows that the taxonomic structure is broadly well
preserved; atypical strains 16, 32 and 36 are the lowest of their clusters; detailed differences are
because with the UPGMA method the averages of squared and unsquared distances are not
always monotonic.
Other studies with correlations and squared distances. Other studies, corresponding to Figs. 6-9
but with correlations or squared distances, showed a few new points, which are briefly
summarized as follows.
With correlations the selection of a reference OTU from each cluster gave a configuration
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
1064
P. H . A . SNEATH
d
0-0.1
rn
0.1-0.2
0.2-0.3
0.3-0.4
W
a
0.4-0.5
0.5-0-6
0.6-0-7
>0*7
0
0
O O O O O O O O A A A A A A A A A A A ~ ~ A A A @ @ @ ~ @ @ @ @ ~ o ~ u
Fig. 14. Shaded similarity matrix of the original configuration. This is equivalent to that in Sneath
(1978) with transformation from dZ to d. From matrix sr0. Conventions as in Figs 1 and 2.
similar to Fig. 6 but with the singletons pulled in and clustered with species B and C2. The four
species clusters were arranged at the apices of an almost regular tetrahedron of side (semichord)
0.8. The strains were all very close to being 0.5 (semichord) from the grand centroid (coefficient
of variation 0.10).
The configuration from correlations corresponding to Fig. 8 (two reference OTUs from
species A and two from C1) showed that species A was well-defined and far from the other
strains. The other clusters were partly mingled with each other and with the singletons: all the
strains not in cluster A formed a curved band opposite cluster A, centred on a point P about $ of
the way from the grand centroid to cluster A. All the strains were roughly 0.5 (semichord) from
P. Clusters A and C1 were somewhat expanded relative to the others.
The use of correlations with the singletons (comparable with Fig. 8) gave poor taxonomic
structure, as there was much misclustering : only species A retained its identity. The reference
OTUs were to one side of the configuration, and the other strains were not as strongly
compressed as in Fig. 8. The other strains formed a rough arc with a radius of about 0.5
(semichord) facing the singletons, which was centred at a point about halfway between strain
36 and most other strains.
No case with a single reference OTU can be obtained, because a minimum of three reference
OTUs is required to give meaningful correlations. With four reference OTUs, also, all variation
becomes expressible in three semichord dimensions. The main points about the configurations
with correlations were : (a) all points tend to lie at about the same distance from some point P in
the ordination; (b) when correlation coefficients are used, clusters are more disperse than with
distances and occupy broad regions of the arc; (c) reference OTUs are not markedly peripheral.
Analyses with d2 corresponding to Figs 6-10 gave configurations very similar to those with d .
The reference OTUs were much more peripheral, and there was greater expansion of clusters
that contained reference OTUs than in those figures. Otherwise the differences were small. The
negative eigenvalues were larger than for matrix srml. They commenced a little earlier (12th to
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
1065
Taxonomy with incomplete data
d
0-0.1
0
0
+O
0.1-0.2
0
0
0.2-0.3
+O
+O
lul
UIO
0-3-0.4
+O
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
0.4-0.5
0.5-0.6
>0.6
0
0
0
a
0
n
0
0
O
0
O
0 0 0 0 0 0 0 0 A A A A A A A A A A A A A A A O O 0 0 0 0 0 0 0 0 0 0 0
0
+ + +
Fig. 15. Shaded similarity matrix derived from four reference OTUs (marked with stars), all from one
cluster. From matrix srel. Conventions as in Figs 1 and 2.
14th eigenvalues), but were still predominantly concentrated in the last few (34th to 36th). The
traces ranged from 137 to 177%. In such instances care must be taken in interpreting the
percentage variation accounted for in the ordinations (see Table 4).
Shaded similarity matrices
Two shaded similarity matrices are shown, Fig. 14 from the starting configuration and Fig. 15
from the derived matrix with d and four reference OTUs from species A.
Figure 14 shows the usual features of such shaded matrices: the four clusters and the
singletons are well displayed and the structure is recognizably similar to that in Figs 1 and 2. In
contrast, Fig. 15 (which corresponds to Fig. 9) shows the confounding of clusters for which no
reference OTU is available. Species A is reasonably clear, but even a different choice of shading
levels does not display well the detail of the other groups. The very aberrant singleton, strain 36,
is however recognizable as the lowest strain in the diagram.
DISCUSSION
Previous examples of distortion
Authors have seldom compared their earlier findings from a few reference OTUs with those
obtained later using more of them. It seems that the problem of distortion is not readily
recognized. Rohlf (in Sneath & Sokal, 1973, p. 299) noted that OTUs that were very dissimilar
might appear spuriously close in a derived matrix, and it may be that this referred to the genus
Dineutes in the study of Basford et al. (1968) on which he was commenting. Dineutes was thought
to be an outlying genus, but the derived matrix from serological data made it appear very close to
certain other genera. As it was not a reference OTU, it may have been drawn inward in the
manner illustrated above. Sommerville & Jones (1972) have noted that it may be difficult to
obtain a consistent picture of the DNA-DNA relationships of the Bacillus cereus complex when
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
1066
P. H . A . S N E A T H
different reference strains were employed. This problem may be much greater than is generally
realized.
The fullest discussion is by Cristofolini (1 980). He pointed out that incomplete serological
data made the position of an OTU uncertain, and that the uncertainty was inversely
proportionate to the distance from a reference OTU. Cristofolini’s concept of serological axes
(each corresponding to one antiserum) differs from the model used here, but the implications are
very similar. He illustrates several examples of distortion (Cristofolini, 1980, 1981). Thus Fig. 6
of the latter paper, derived from a single antiserum, shows both the expansion of a cluster of
closely allied species and the misclustering of very different species. Several aspects of the
problem are discussed by Hildebrand et al. (1982).
An example of extreme reduction in dimensionality is given in Fig. 6 of London et al. (1975),
which illustrates the immunological distances of fructose diphosphate aldolase of various
streptococci from one reference aldolase (Streptococcusfaecalis ATCC 27792). The diagram is
described as a phylogenetic map, but this is misleading because its relationships are phenetic
and it is one-dimensional. It also gives the impression that very different species like Streptococcus pyogenes and Streptococcus salivarius are serologically very similar, whereas this can only
be determined by comparing them directly. Such diagrams are useful if their limitations are
realized. However, it is often not clear how they are constructed. Networks such as Figs 8-10 of
Gasser & Gasser (1971) are not minimum spanning trees (they do not show full fan-shaped
arrangements); they appear to be more similar to diagrams from derived matrices.
Types of distortion with derived matrices
Certain mathematical properties of derived matrices (Sneath, 1980) need mention, and can be
illustrated by the present examples.
Reduction in dimensionality. With Euclidean distances (taxonomic distances and semichords
are Euclidean) t points can be represented in t - 1 dimensions. The choice of c reference OTUs
causes the derived configuration to be in c dimensions, i.e. c - 1 for the c reference OTUs plus
an additional one for adding the non-reference OTUs. In addition, the points are reflected about
the hyperplane defined by the c reference OTUs so that they lie in one half of the c-space. This
can be seen from Table 4 for the configurations derived from four reference OTUs: all the
variation is in four dimensions because it will be found that the sum of the first four eigenvalues
equals the Total Sum of Squares. The extreme case is illustrated by Fig. 10, where one reference
OTU gives a one-dimensional space with all points lying on the same side of the reference OTU.
This reduction in dimensionality itself causes difficulty in defining clusters. If clusters are to
be recognized as distinct (whether by cluster analysis or ordination), they must not overlap
heavily in space. Using simplifying assumptions (clusters are hyperspherical multivariate
normal, are equal in dispersion, and are themselves distributed in the same fashion) it can be
shown that the risk of confusion is related to the chi-square distribution (Sneath, 1980).
First, one makes an estimate of the ratio G = s,/sb, where s, is the average standard deviation
of the cluster centroids on each of the n dimensions, and s b is the average standard deviation
within clusters for each dimension. These can often be estimated by the formula in Sneath (1974,
1979a), or obtained by inspection of ordination plots. Next, one requires a critical level of
overlap such that this, or worse overlap, will be likely to cause clusters to be confused. For
clusters of about equal size and density, the centroids must be separated by more than four
average cluster standard deviations if they are to be seen as clearly distinct. In terms of the W
statistic (Sneath, 1977, 1979b), Wmust be over 2.0. Third, look up in tables of chi-square the
probability P that corresponds to
(02)
for n degrees of freedom. With W,,,, = 2 as advised,
P corresponding to chi-square of 8 / ~ is* to be found. This is the probability that any pair of
clusters taken at random overlaps to more than the critical value in n-space. If one wishes to be
reasonably sure that no pair of clusters is likely to overlap, P should be less than 2/q2 where q is
the likely number of clusters in the study.
One may often assume with safety that all clusters are well separated in the full space on n
characters. The reduction to a smaller number of dimensions, m,by ordination or by deriving
new matrices, may cause undue overlap. If so, it is P form degrees of freedom that is of interest.
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
Taxonomy with incomplete data
1067
These probabilities increase greatly when m becomes small. Thus for two-dimensional
ordination plots (m = 2), clusters must be separated on the average by about 16 standard
deviations if P is to be less than 1% (Sneath, 1980; Table 1). These risks of not recognizing
clusters as distinct (because of overlap due to reducing the dimensionality) apply not only to
derivations with reference OTUs but also to ordinations as customarily employed in taxonomy.
With c reference OTUs it has been noted that the derived space has c dimensions. Due to the
reflection the effective dimensionality is less, something like c - +,Thus, even if the choice of
reference OTUs were such as to produce no distortion (in the sense of affecting OTUs
differentially), there will be serious risks of not distinguishing clusters when very few reference
OTUs are employed.
The risks are increased by differential distortion. An extreme example is shown by Fig. 9. All
four reference OTUs are from one species; the derived configuration, however, is not effectively
four-dimensional, so that m is not effectively c = 4, nor even 3.5; the effective dimensionality
can be seen to be little greater than 1.0 (cf. Fig. 10). It is for this reason that the concept of
‘effective dimensionality’ is introduced, and shown by n’ and m’ in Table 4. It is at present not
possible to give an analytic solution to how ‘effective dimensionality’ depends on the choice of
reference OTUs, but some empirical observations are noted below.
The test for the risk of overlap may be illustrated by Fig. 8 where it is desired to know the
approximate risk of serious overlap between pairs of clusters at random in the three-dimensional
space, if one is given the configuration in four dimensions of the derived matrix. The singletons
are excluded from consideration for simplicity. The principal coordinates permit calculation of
and s,2 = 2 x
so o2
the dispersion between and within clusters, and gives s,2 = 1.63 x
is about 8-15.Then the P for chi-square of 8/a2 = 0.98 is about 0-08 for 4 degrees of freedom and
about 0.19 for 3 d.f. The reduction of dimensions from three to four thus increases the risk of
overlap, and 19% is of the order of magnitude expected on considering the ordination in Fig. 8.
The use of the effective dimensions for degrees of freedom may be justified as being statistically
conservative : this somewhat increases the probabilities in this example.
E3ective dimensionality. Table 4 shows the effective dimensionality of the various
configurations. One may note that n’ refers to the configuration itself and to the phenogram,
whereas rn’ refers to the three-dimensional ordination shown in the corresponding figure. The
latter is only included to give a visual impression of the effective dimensionality of the threedimensional figures; it necessarily lies between 1 and 3. It is n’ that is of most interest, though in
extreme cases it may give misleading results. This quantity is a descriptive statistic, not a test of
whether an eigenvalue is significantly greater than others. Such a test is that of Bartlett (1951 ; a
worked example is given by Hope, 1968, p. 63), but this becomes indeterminate if any
eigenvalue is zero; it is of course based on a specific model that is not readily applicable to the
present situation.
The starting configuration does not have t - 1 = 35 effective dimensions, but only 7.5. This is
because it is not hyperspherical: some axes are very small compared to others, so that they
effectively contribute only a small fraction of a dimension. This may be appreciated by
considering a straight line of points in a three-dimensional space; there is only one dimension of
variation, and Fig. 10 (n’ = 1.0) illustrates this. In contrast, the ordination of Fig. 1 is almost
spherical, so that m’ is almost 3.
The effective dimensionality can be seen to reflect to some extent the adequacy of the
reference OTUs. Figures 6 and 8 have widely spaced reference OTUs, and gave high n’ and
reasonable derived structures, although the inadequacy of choosing singletons as reference
OTUs is not evident from n’. At the other extreme is Fig. 10 (entirely one-dimensional) and Fig.
9 (almost one-dimensional, because all distances are measured from reference OTUs that are
almost at the same point in the hyperspace). Figure 7, with two reference OTUs from each of
two clusters is intermediate (n’ = 2-1), and it will be seen that this configuration is derived from
distances that are measured from what are almost two points in hyperspace. A low value of n’
therefore draws attention to the dangers of inadequate reference OTUs as evidenced by their
positions as well as by their number. It helps to quantify the statements of Cristofolini (1980) and
Hildebrand et al. (1982) that one should not choose reference antisera that are too much alike.
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
1068
P. H . A . S N E A T H
How to choose reference OTUs to maximize n’ remains to be studied: one might tentatively
consider that reference OTUs known to belong to one tight cluster contributed together only one
effective dimension.
Movement ofpoints toward the simplex of reference OTUs. The region enclosed by the reference
OTUs is mathematically a c-simplex. Other points are drawn in toward this region. This is seen
clearly in Fig. 6 where most OTUs lie within a tetrahedron whose apices are the reference
OTUs. This is the converse of the tendency for reference OTUs to be peripheral.
Dispersal ofpoints close to reference OTUs. Earlier study (Sneath, 1980), showed a tendency for
OTUs to bunch together if they were not reference OTUs. The present work shows that this is a
combination of the movement toward the simplex and a tendency of points close to reference
points to disperse. Figures 7 and 12 show this effect well. One consequence is that outlying
members may be displaced into nearby clusters.
Eflect on equidistant OTUs. A mathematical property of OTUs that are all equidistant from
one another is that all OTUs that are not reference OTUs become compressed to a single point
P. At this point all axes P r l , Pr2 . . , Pr, to the reference OTUs are at right angles, so that each
reference OTU lies along its own characteristic dimension. This extreme distortion was not
noted here, but should be borne in mind in future work.
Efect of iterations. Iterative cycles are useful in showing distortions when carried to an
extreme, but there seems no reason to prefer the results of the second or subsequent iterations.
The resulting configurations do not necessarily stabilize in a taxonomically useful way. The idea
that one could extrapolate backwards from the results of cycles to obtain a relatively likely
original configuration (Sneath, 1980) remains only a conjecture.
Alternative similarity coeficients and cluster methods for derived matrices
Similarity coeficients. There are two aspects to similarity coefficients. The first is the choice of
scaling of the primary observations : this determines properties of the original configuration.
Thus one might employ immunological distances or their logarithms, or convert DNA-DNA
percent pairing values D into J(100 - D) if these transformations gave better original
configurations. Criteria for this need further study. In the same class are coefficients for
determining similarity from the number of common antigens in gel diffusion tests: the Phi
Coefficient has some advantages over the Jaccard Coefficient (Cristofolini & Feoli Chiapella,
1977; Cristofolini, 1980). The second aspect relates to the coefficients used in derivation of the
complete matrix. The results given here suggest that there is not a lot to choose between d and d 2 ,
but that the correlation coefficient renders reference OTUs less peripheral in the derived
configuration than d or d 2 ,but also renders clusters more nearly equidistant. This is partly due to
its mathematical properties: all points are equidistant 0.5 from some point in the full
configuration when the semichord transformation is used, but this is not necessarily so in the
ordinations. The evidence on whether correlations cause more misclustering is inconclusive, but
the present study suggests that it does not recover the original configuration as well as d does.
Clustering methods. UPGMA clustering was used as a standard method, and was superior to
Single Linkage as judged by some preliminary studies. The relation of the latter to minimum
spanning trees suggests that further work here is desirable, particularly on the influence of the
numbers of OTUs in different clusters (Williams et al., 1971 ; Sneath, 1976). The difficulties in
choosing appropriate levels for shading in shaded similarity matrices require emphasis : the
usual levels may reveal little structure.
Minimum spanning trees
Minimum spanning trees are of special interest, because they are one of the few numerical
methods for handling incomplete data. Their properties have been little explored. They tend to
exaggerate the distances within clusters relative to those between them, because the inter-cluster
links are the shortest possible ones. In the present study they have not proved very reliable, but
they are illuminating with reference to another common method of working with incomplete
data, discussed below, to define one group at a time by sequential choice of reference OTUs.
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
Taxonomy with incomplete data
1069
Defining one group at a time
When defining one group at a time, it is clear that a decisive step is the choice of a critical
distance, or rather radius, to determine group membership. Few workers explain how they
arrive at this, and it is generally implied that experience allows a correct choice of a radius that
defines, for example, a species. This is not the place to consider this in detail, but instead the
dependence of the final results on the choice of reference OTUs will be briefly explored with the
examples reported here.
The first point is that we lack information on the tightness and degree of separation of
bacterial species when defined by serological or DNA-DNA measures. If they formed compact
and widely separated clusters it would not matter what strains were chosen as reference OTUs,
and critical radii would be quickly found from histograms of relationships.
If one attempts this approach it is necessary to know how frequently the choice of an
inappropriate strain as reference OTU would lead to misgroupings. Table 5 shows the safe range
for a critical radius to define the species in the original configuration. It can be seen that almost
any choice of reference OTU will permit choice of an appropriate radius. Only with strain 17,
and possibly with strain 21, will this be difficult. Nevertheless, these choices assume that one
does already know what the species are. If one did not, the choices would be much more difficult.
Some ranges are small. With some strains as reference OTUs, e.g. 13 and 15, it would be difficult
to find a suitable radius from histograms of distances, and this would be scarcely possible if one
started with a singleton. The critical radii, also, would differ with the species: it can be seen from
Table 5 that no single value would be entirely satisfactory, though 0.39 or 0-40 would be
acceptable in most cases. And it should be remembered that these clusters are relatively wellseparated examples. In addition the order in which species are studied may make a difference to
the results. It would be worthwhile to re-examine the method of Rogers & Tanimoto (1960) in the
context of incomplete similarity matrices.
Comparison with the minimum spanning trees raises interesting points. The initial choice of
one reference OTU yields a tree similar to that in Fig. 10. The choice of a second reference OTU
that is clearly too far away to belong to species A would yield a tree with two fans, from which a
third, and fourth reference OTU could be selected. This would ultimately give a minimum
spanning tree similar to that in Fig. 6, which shows quite satisfactory structure. Nevertheless,
any mistake in the choice could become misleading: if one started with singletons (Fig. 8,
implying too big a radius), or accumulated several reference OTUs from one species (Fig. 6,
implying too small a radius), the trees would be difficult to interpret and to improve. It would
however be most instructive to apply the methods of derived matrices, minimum spanning trees,
and defining one group at a time, to serological or DNA data whose detailed taxonomic
structure was known, and also to DNA-RNA diagrams such as those of De Ley et al. (1978).
Analogous situations
There are few parallels to the problem studied here. When one considers that most taxonomy
is based on very incomplete data it is surprising how little attention has been given to the matter.
It is known from experience that moderate percentages of missing values in n x t tables do not
greatly affect numerical taxonomies (Sneath & Sokal, 1973, p. lSl), if these gaps are haphazard,
or at least do not affect a majority of characters of any OTU. Similarly, methods for non-metric
scaling and ordination may be adapted for missing similarity values, and the work of Gleason &
Staelin (1975) suggests that a substantial proportion may be missing without serious effects
provided there is nothing systematic in the pattern of missing values. The pattern is, however,
very systematic in the present problem, so their findings scarcely apply. The process of deriving
a new matrix has analogies to the iterative weighting of characters described by Hogeweg (1976)
where the transformation is determined by the choice of characters rather than OTUs. A
uniform transformation with the correlation coefficient was proposed by Kaneko (1 979). Figure
11 is closely analogous to the results of his algorithm. It seems an open question whether
Kaneko’s procedure is an improvement: one must balance tighter clustering against the risk of
misclustering. Rather similar is the suggestion by Baum (1977) that one should power a
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
Table 5 . Dependence of a species radius on the choice of reference OTU
0.354.46 (20)
0.32-0’47 (23)
0.32-0-45 (20)
0.344.45 (20)
0.35-0.47 (20)
0.35-0.45 (20)
0-354.47 (20)
0-354.47 (20)
9
10
11
12
13
14
15
16
0.34-0.40 ( I 7)
0.34-0-40 (17)
0.35-0-40 (21)
0-354.43 (1 7)
0.384.40 (17)
0.32-0.39 (17)
0.36-0.39 (17)
0.38-0-42 (17)
18
19
20
21
22
23
24
17
Strain
no.
33
34
35
36
NA-0.45 (34)
NA-0.45 (33)
NA-0.52 (33)
NA-0.52 (35)
25
26
27
28
29
30
31
32
0-32-0.44 (10)
0.32-0.45 (9)
0-32-045 (10, 12)
0.33-0.46 (9)
0-33-0.44 (10)
0.31-0.45 (13)
0 . 3 1 4 4 (10)
0 . 3 2 4 4 3 (9, 10, 12)
Strain
no.
0.40-0*39* (1 4)
0.36-0.43 (14)
0.37-0.41 (10)
0.38-0.40 (9)
0.39-0.391- (1 0)
0.39-0’43 (9)
0-39-0.46 (27, 29)
0.40-0.48 (9)
Safe range and no.
of nearest strain
to the singleton
Safe range and no.
of nearest strain
outside species
Safe range and no.
of nearest strain
outside species
Safe range and no.
of nearest strain
outside species
Strain
no.
Safe range and no.
of nearest strain
outside species
NA,Not applicable, because there are no other strains of these species.
* The second value is less than the first, so there is no safe range; the distance to strains 14 and 15 of another species is less than 0.40, and these strains would be
misclassified. In addition strains 9, 10 and 13 are only marginally over 0.40 from strain 17.
t The range is very small (0.386-0.392).
Strain
no.
Strain
no.
Singletons
Species C2
Species C1
Species B
*PA--
Species A
The first value in the safe range is the distance from the strain to the farthest strain of its own species. The second value is the distance to the nearest strain
(shown in parentheses) that belongs to another species (or several strains if distances are equal). The distances are from the original configuration (Table
2).
9
x
cd
0
4
0
c-.
Taxonomy with incomplete data
107 1
similarity matrix in the hope of finding clearer taxonomic structure, though Gower (1978) has
drawn attention to certain dangers in this. A related problem of distortion is that which occurs
with ordination of frequency data (see Williamson, 1978). The converse of the present problem
has been studied in psychometrics under the name of multidimensional unfolding; this is to
reconstruct a complete similarity matrix from a rectangle of entries (e.g. Gold, 1973; Davison,
1976).
Eflect of experimental error
The model has assumed that serological or other distances are known exactly. They are of
course, like phenetic distances, subject to some degree of uncertainty. Barely detectable crossreactions correspond to indefinitely large distances, and some figure less than infinity must be
allocated to absence of a cross-reaction. It is notable how little detail is usually given on the
reproducibility or accuracy of serological and DNA values. This is particularly important for
weak reactions, because small differences in weak reactions imply much larger differences in the
apparent positions of OTUs than do differences in strong reactions. Some perspective to the
example used here may be obtained from the formulae in Sneath & Johnson (1972): using these,
the distances in Table 2 and Table 5 have a standard deviation of about 0-03.
Requirements for recovery of taxonomic structure
The method of derived matrices can give good results with appropriate choice of OTUs. Thus
Sgorbati (1979) gave a phenogram of 20 bifidobacteria from results with three reference
antisera. The matrix was derived using d from logarithms of indices of serological dissimilarity.
The phenogram showed seven clusters. The same 20 strains were later tested with five additional
antisera, and a new derived matrix and phenogram was prepared (Sgorbati & London, 1982).
The phenograms were very similar. The same seven clusters were seen in the second with only
minor rearrangements of two clusters. The three reference OTUs were chosen in the first study
so as to be widely spread among the bifidobacteria, and the additional reference OTUs were
chosen where possible from clusters that previously had none.
One may conclude that no cluster in a phenogram from a derived configuration can be relied
upon unless it was built around a reference OTU. In consequence, the current trend in serology is
to construct as complete a matrix of cross-reactions as possible (Cristofolini, 1980) and to
associate additional OTUs with the complete structure (e.g. Gorman et al., 1980).
One can be sure with Euclidean spaces that if a pair of O T U s j and k (that are not reference
OTUs) are shown as distant in a derived configuration then they are indeed distant. One cannot
be sure they are indeed close if they appear to be close; they may be separated by a distance of
up to the minimum of 4,. d k r for all reference OTUs r. This principle has been exploited in
strategies used in cladistics (Goodman & Moore, 1971) and could be considered for the present
problem. It would be interesting to study various estimates of a probable distance d,k by using
formulae such as the minimum of J(d,2, d&)for any r. Other strategies for this problem would
be worth exploring. J. C. Gower (personal communication) notes that one can add a new point
to a Principal Coordinate analysis (Gower, 1968; a taxonomic example is given by Wilkinson,
1970). It would therefore be possible to base the analysis on the c reference OTUs and add new
OTUs to the space thus defined. This could then be followed by clustering or further ordination.
In conclusion, clusters may not be recognizable by any of the methods discussed here unless
they are represented by reference OTUs. This therefore places the taxonomist in difficulty : if he
has already correctly recognized the clusters he will obtain results that confirm them, but if he
has not he may fail to find the clusters at all.
In practice the outlook is less gloomy. The method of defining one group at a time is well
known and can be very successful if care is taken. Minimum spanning trees are probably less
useful but could perhaps be put to better use than in the past, Derived matrices utilize all the
available similarity values and, of the three methods, give the best taxonomic structure with
proper choice of reference OTUs. They may even be applicable in a manner that at first sight
would appear illogical. Much effort has gone into preparing pure antigens and DNA. It would
seem retrograde to make deliberate mixtures of pure preparations. Yet the purification is
+
+
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
1072
P. H. A. SNEATH
normally to separate unwanted material of the same organism. It would seem quite logical to mix
pure antigens or pure DNA preparations from different organisms, or antisera to such
preparations, in the hope that the resulting polyvalent mixtures would behave as if they were
located at the cluster centres. This could offer interesting opportunities for replacing single
reference OTUs by composites that better represent the clusters.
Grateful thanks are due to Barbara Sgorbati who drew attention to this problem, and to P. A. D. Grimont for the
data here. D. Jones, J. C. Gower, M. J . Sackin, M. E. Rhodes-Roberts and E. M. H. Wellington and a number of
other colleagues, are thanked for discussion and advice. The work was supported by a grant from the Medical
Research Council.
REFERENCES
BARTLETT,M. S. (1951). A further note on tests of
significance in factor analysis. British Journal of
Statistical Psychology 4, 1-2.
BASFORD,
N . L., BUTLER,J. E., LEONE,C. A . & ROHLF,
F. J. (1968). Immunologic comparisons of selected
Coleoptera with analysis of relationships using
numerical taxonomic methods. Systematic Zoology
17, 388406.
BAUM,B. R. (1977). Reduction of dimensionality for
heuristic purposes. Taxon 26, 191-195.
CRISTOFOLINI,
G . (1980). Interpretation and analysis of
serological data. In Chemosystematics: Principles and
Practice, pp. 269-288. Edited by F. A. Bisby, J . G .
Vaughn & C. A. Wright. London & New York:
Academic Press.
CRISTOFOLINI,
G . (1981). Serological systematics of the
Leguminosae. In Advances in Legume Systematics,
pp. 513-531. Edited by R. M. Polhill &P. H . Raven.
Kew: Royal Botanic Garden.
CRISTOFOLINI,
G . & FEOLICHIAPELLA,L. (1977).
Serological systematics of the tribe Genisteae (Fabaceae). Taxon 26, 43-56.
DARBYSHIRE,
J. H., ROWELL,J. G., COOK,J. K . A. &
PETERS,R. W. (1979). Taxonomic studies on strains
of avian infections bronchitis virus using neutralization tests in tracheal organ cultures. Archives of
Virology 61, 227-238.
DAVISON,
M. L. (1976). Fitting and testing Carroll’s
weighted unfolding model for preferences. Psychometrika 41, 233-247.
DE LEY,J., SEGERS,P. & GILLIS,M. (1978). Intra- and
intergeneric similarities of Chromobacterium and
Janthinobacterium ribosomal ribonucleic acid cistrons. International Journal of Systematic Bacteriology
28, 154-168.
GASSER,F. & GASSER,C. (1971). Immunological
relationships among lactic dehydrogenases in the
genera Lactobacillus and Leuconostoc. Journal ofBacteriology 106, 113-125.
GLEASON,
T. C. & STAELIN,
R. (1975). A proposal for
handling missing data. Psychometrika 40, 229-252.
GOLD,E. M. (1973). Metric unfolding: data requirement for unique solution and clarification of
Schonemann’s algorithm. Psychometrika 38, 555569.
GOODMAN,
M. & MOORE,G. W. (1971). Immunodiffusion systematics of the Primates. I. The Catar.
rhini. Systematic Zoology 20, 19-62.
GORMAN,
G . C., BUTH,D. G . & WYLES,J. S. (1980).
Anolis lizards of the Eastern Caribbean : a case study
in evolution. 111. A cladistic analysis of albumin
immunological data, and the definition of species
groups. Systematic Zoology 29, 143-158.
GOWER,J. C. (1966). Some distance properties of
latent root and vector methods used in multivariate
analysis. Biometrika 53, 325-3 38.
GOWER, J. C. (1968). Adding a point to vector
diagrams in multivariate analysis. Biometrika 55,
582-585.
GOWER,J. C. (1978). Comment on transformations to
reduce dimensionality. Taxon 27, 353-355.
GOWER,J. C. & Ross, G. J. S. (1969). Minimum
spanning trees and single linkage cluster analysis.
Applied Statistics 18, 54-64.
GRIMONT,
P. A. D., GRIMONT,
F., DULONGDE ROSNAY,
H. L. C. &SNEATH,P. H. A. (1977). Taxonomy of the
genus Serratia. Journal of General Microbiology 98,
39-66.
HEBERT,G . A,, Moss, C. W., MCDOUGAL,L. K.,
BOZEMAN,
F. M., MCKINNEY,
R. M. & BRENNER,
D. J. (1980). The rickettsia-like organisms TATLOCK (1943) and HEBA (1959): bacteria phenotypically similar, but genetically distinct from Legionellapneumophilu and the WIGA bacterium. Annals of
Internal Medicine 92, 45-52.
HILDEBRAND,
D. C., SCHROTH,M. N . & HUISMAN,
0. C. (1982). The DNA homology matrix and nonrandom variation concepts as the basis for the
taxonomic treatment of plant pathogenic and other
bacteria. Annual Review of Phytopathology 20, 235256.
OGEWEG,
P. (1976). Iterative character weighing in
numerical taxonomy. Computers in Biology and
Medicine 6, 199-2 11.
OPE, K . (1968). Methods of Multicariate Analysis.
London : University of London Press.
KANEKO,T . (1979). Correlative similarity coefficient:
new criterion for forming dendrograms. International
Journal of’ Systematic Bacteriology 29, 188-1 93.
KRYCH,V. K., JOHNSON,J. L. & YOUSTEN,A. A.
(1980). Deoxyri bonucleic acid homologies among
strains of Bacillus sphaericus. International Journal of
Systematic Bacteriology 30, 476-484.
KURYLOWICZ,
W., PASZKIEWICZ,
A., WOZNICKA,
W.,
KURZ~TKOWSKI,
W . & SZULGA, T. (1975).
Classification of streptomycetes by different numerical studies. Posepy Higieny i Medycyny DoSwiadczalnej Zeszyty Problemowe Warsaw 29, 7-8 1.
LEE. A . M . (1968). Numerical taxonomy and the
influenza B virus. Nature, London 217, 620-622.
LONDON,J., CHACE,N. M. & KLINE, K. (1975).
Aldolase of lactic acid bacteria : immunological
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55
Taxonomy with incomplete data
relationships among aldolases of streptococci and
Gram-positive nonsporeforming anaerobes. International Journal of Systematic Bacteriology 25, 1 14123.
ROGERS,
D. J. & TANIMOTO,
T. T. (1960). A computer
program for classifying plants. Science 132, 11 151 1 18.
SEARLE,
S. R. (1966). Matrik Algebra for the Biological
Sciences. New York: John Wiley.
SGORBATI,
B. (1 979). Preliminary quantification of
immunological relationships among transaldolases
of the genus BiJdobacterium. Antonie van Leeuwenhoek 45, 557-564.
SGORBATI,
B. & LONDON,J . (1982). Demonstration of
phylogenetic relatedness among members of the
genus BiJdobacterium by means of the enzyme transaldolase as an evolutionary marker. International
Journal of Systematic BacterwlogjJ 32, 37-42.
SIBSON,R. (1972). Order invariant methods for data
analysis. Journal ofthe Royal Statistical Society, B 34,
31 1-338.
SILVESTRI,
L., TURRI,M., HILL,L. R. & GILARDI,
E.
(1962). A quantitative approach to the systematics of
actinomycetes based on overall similarity. Symposia
of the Society for General Microbiology 12, 333360.
SKERMAN,
V. B. D., MCGOWAN,
V. &SNEATH,
P. H. A.
(editors) (1980). Approved lists of bacterial names.
International Journal of Systematic Bacteriology 30,
22 5-420.
SNEATH,P. H . A. (1974). Test reproducibility in
relation to identification. International Journal of
Systematic Bacteriology 24, 508-523.
SNEATH,P. H . A. (1976). Phenetic taxonomy at the
species level and above. Taxon 25, 437-450.
SNEATH,P. H . A. (1977). A method for testing the
distinctness of clusters: a test of the disjunction of
two clusters in Euclidean space as measured by their
1073
overlap. Journal of’the International Association ,for
Mathematical Geology 9, 123-1 43.
SNEATH,P. H . A. (1978). Classification of microorganisms. In Essays in Microhiology, pp. 9/1-9/3 1 ,
Edited by J. R. Norris & M . H. Richmond.
Chichester : John Wiley.
SNEATH,P. H. A. (1979a). BASIC program for a
significance test for clusters in UPGMA dendrograms obtained from squared Euclidean distances.
Computers and Geosciences 5, 127- 137.
SNEATH,P. H. A. (1979b). BASIC program for a
significance test for two clusters in Euclidean space
as measured by their overlap. Computers and
Geosciences 5, 143-155.
SNEATH,
P. H. A. (1980). The probability that distinct
clusters will be unrecognized in low dimensional
ordinations. Classrfication Society Bulletin 4, no. 4,
22-43.
SNEATH,
P. H. A. & JOHNSON, R. (1972). The influence
on numerical taxonomic similarities of errors in
microbiological tests. Journal of’ General Microbiology 72, 377-392.
SNEATH,
P. H. A. & SOKAL,R. R. (1973). Numerical
Taxonomy : the Principles and Practice of Numerical
Classijication. San Francisco: W. H. Freemar,.
SOMERVILLE,
W. J . & JONES, M. L. (1972). DNA
competition studies within the Bacillus cereus group
of bacilli. Journal of General Microbiology 73, 257265.
WILKINSON,
C. (1970). Adding a point to a Principal
Coordinates analysis. Systematic Zoology 19, 258263.
WILLIAMS,
W. T., CLIFFORD,
H. T. & LANCE,G . N.
(197 1). Group-size dependence : a rationale for
choice between numerical classifications. Computer
Journal 14, 1 57- 162.
WILLIAMSON,
M. H . (1978). The ordination of incidence data. Journal of Ecology 66, 9 1 1-920.
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Sun, 18 Jun 2017 11:03:55