Point pattern matching in the analysis of two

Electrophoresis 1999, 20, 3483±3491
Josef Pµnek
JirÏí VohradskyÂ
Institute of Microbiology,
Czech Academy of
Sciences, Prague, Czech
Republic
3483
Point pattern matching in the analysis of twodimensional gel electropherograms
In the automation of proteome analysis, matching of two-dimensional (2-D) electropherograms represents a bottleneck in the process. Here we present a point pattern
recognition approach to the matching of spots in 2-D electropherograms. The algorithm
is based on a comparison of spot neighborhoods, converted to point patterns between
reference and compared gels. The neighborhood was characterized by a syntactic
descriptor which minimized the influence of spot displacements. A combined criterion
utilizing the similarity of point patterns and a metric definition of position similarity was
derived. The efficiency and accuracy of the algorithm was tested on a set of 69 gels
with different levels of similarity. For a typical gel the accuracy of matching was higher
than 98% and the number of correctly identified spots was higher than 95%.
Keywords: Point pattern matching / Two-dimensional electrophoresis / Pattern recognition / Proteome
EL 3663
Point pattern recognition is one of the most powerful
methods of identification of objects in image. As any twodimensional (in principle any dimensional) object can be
simplified to a point pattern, the identification of a point
pattern can serve to recognize defined objects. The most
intuitive use of point pattern recognition is in the case of
ªnaturalº point patterns, such as the comparison of star
maps [1, 2]. Here the point pattern was defined by a central point (in principle any star) and all other neighboring
points of the star map were defined ªby the viewº of the
central point (ªworld viewº). The formal description of the
surroundings was then compared with all possible
descriptions from the compared map and the star with the
most similar neighborhood was defined as the matching
one. This example illustrates an approach to the problem
of matching based on the comparison of neighborhood
point patterns.
Two-dimensional electrophoresis is one of the ªnaturalº
point patterns as well. Each protein spot in an electropherogram can be characterized by the position of its
centroid in the coordinate system of the gel. The electropherogram can then be viewed as a point pattern or a set
of smaller local point patterns. Matching of two-dimensional gels, that is, identification of the spots belonging to
the same protein on two different gels or a set of gels, can
then be viewed as a comparison of point patterns. This
approach was used, for example by Lemkin and Lipkin
Correspondence: Dr. J. VohradskyÂ, Institute of Microbiology,
Czech Academy of Sciences, VidenÆskµ 1083, 14220 Prague-4,
Czech Republic
E-mail: [email protected]
Fax: +42-02-47-222-57
WILEY-VCH Verlag GmbH, 69451 Weinheim, 1999
[3±5] to match spots in the GELLAB system. A similar
approach, but using a syntactic description of the neighborhood, was published by Vincens and Tarroux [6].
Other approaches using different features of gel images
were used. First, systems utilized the principle of transformation of the coordinate system of a matched gel to fit
the reference gel using a set of user-defined landmark
spots. Each landmark spot defined a known match and
was introduced by the user. A set of landmark coordinates was then loaded to a least squares minimizer and
parameters of a known function (usually polynomial) were
computed. The matched gel coordinate system was
recomputed using the known function to the coordinate
system of the reference gel. The transformed gels were
then overlaid and the matched spots were defined as
those closest to each other [7]. This principle was powerful enough to have survived as a base for modern systems to date [8].
Another approach, first published by Skolnick et al. [9]
and Skolnick [10], was based on conversion of a spot
point pattern to a specific graph (Gabriel graph) where the
spots became nodes of the graph. The match was defined by comparing the graphs. Variations of this method
exist that are based, for example, on a metric description
of the graph [11]. The ELSIE/MELANIE system used a
comparison of spot clusters, instead of simplified point
patterns, for spot matching [12, 13] and a probabilistic criterion for the definition of correct match. Garrels [14] published a spot matching algorithm based on propagation of
a user-defined landmark spot, which served as the initial
match. The next match was determined by a neighborhood spot of the landmark spot which showed the best
match. This spot became a landmark and the algorithm
continued until no further sufficiently good match in the
0173-0835/99/1818-3483 $17.50+.50/0
Proteomics and 2-DE
1 Introduction
3484
J. Pµnek and J. VohradskyÂ
surroundings of the new landmark was found. Then the
next landmark was defined and the algorithm looped until
no new matches were found.
Some of the algorithms mentioned above were integrated
to commercially available gel analysis packages. We
have not found any publications that have analyzed the
performance of each of the algorithms or cross-compared
their efficiency and accuracy. From our empirical experience with some commercial packages it can be concluded that most of the algorithms work well for similar
gels with little changes in both the spot density and spot
presence. When the distortions and changes in spot
intensity become higher, the rate of misidentifications increases and the number of detected matches decreases.
In the times of automation of proteome analysis, spot
matching becomes the bottleneck of the whole process,
as it requires time-consuming operator interaction and
manual corrections. The introduction of a highly reliable
and efficient algorithm would greatly improve the efficiency of the whole process. In this paper we present the
point pattern comparison approach for matching pairs of
gels based on a syntactic description of the spot neighborhood. The algorithm for a typical pair of gels shows an
efficiency, defined by the number of detected matches,
higher than 95% and an accuracy higher than 98%.
2 Materials and methods
2.1 Test set
For the testing of algorithms, data from two match sets
created during a study of Streptomyces coelicolor development (30 and 39 gels) were used. Original radioactive
gel images were exposed to a film and processed using
pdQuest (PDI Inc.) gel analysis software. After manual
corrections of matches and mismatches, the matching
was carefully checked again. Data matrices containing
spot position coordinates on different gels in the set were
exported from pdQuest and used as testing standard.
2.2 Gel distortions
All electropherograms were matched to one standard gel
chosen from the gels of the match set. ªDistortionsº of a
gel represent differences in coordinates of two identical
spots in the reference and matched gels. They were
always considered as relative to the reference gel. The
two spots can be connected by an oriented vector whose
feature can be used to describe the overall level of distortion of the matched gel.
Electrophoresis 1999, 20, 3483±3491
¾?
Let CM = M - C be the directional vector between spot
centers M = [mx, my] on the reference gel and C = [cx, cy],
the same spot on the matched gel. Then
is its length and
is its angle with the horizontal axis. Variances s|M-C| and
saM-C of directional vectors for all spots together provide a
quantitative measure of overall electropherogram distortions.
2.3 Transformation of matched gel
Prior to searching for matching spots, global distortions
were eliminated by transformation of the coordinate system of the matched electropherogram. We used the polynomial transformation to eliminate global distortions. The
polynomials were used to approximate coordinates of a
small number of known matched spots entered by the
user (landmarks), distributed throughout the whole electropherogram. The resulting polynomial approximation
scheme was
LM,c
x = fx (LM,mx, LM,my),
(3)
LM,c
y = fy (LM,mx, LM,my)
(4)
where (LM,cx LM,cy) and (LM,mx LM,my) are landmark coordinates of the matched or reference electropherograms, respectively, followed by transformation
T,c
m
m
x = f pol
x ( x, y),
(5)
T,c
m
m
y = f pol
y ( x, y)
(6)
where (T,cx T,cy) were transformed coordinates of the
matched electropherogram, and (mx my) coordinates of
the reference electropherograms. Polynomial coefficients
were computed using a Vandermonde matrix construction. The system of linear equations was solved by QR
decomposition. Functional values of polynomials were
computed using the Horner scheme.
2.3.1 Transformation accuracy
For evaluation of the transformation accuracy, a sum of
distances SE between spots in the reference and transformed matched gel was chosen:
where coefficient T is the transformed matched gel.
Electrophoresis 1999, 20, 3483±3491
2.4 Matching
Due to local distortions and transformation inaccuracy,
matching spot positions do not overlap sufficiently to
unambiguously determine the match. In other words, the
spot in the matched gel closest to a spot in the reference
gel is not necessarily the correct match. All spots in the
neighborhood of this spot in the matched gel have to be
considered as a possible match instead. The matching
process thus investigates all possible matches (candidate
spots). At the beginning a spot in the reference gel (reference spot) was chosen. A spot in the transformed
matched gel closest to the reference spot was found. This
was the first candidate spot for match. From the local
neighborhood of the candidate spot a small set of spots
was chosen as other possible candidates. The accuracy
of matching all candidate spots and their neighborhoods
was determined. Among this set a spot with the best
matching characteristics was chosen as the best possible
match and the value of goodness of match was assigned
to this spot.
2.4.1 Directional vectors
Goodness of match was characterized by a criterion derived from the directional vectors of the candidate match
and directional vectors of the closest landmarks. Length
and angle of the directional vectors of the candidate
matches were calculated together with the lengths and
angles of directional vectors of the closest landmarks.
The differences between characteristics of the candidate
matches and closest landmark were calculated. The
resulting criterion was defined as wdT|M - C |© + wdD dD© +
waD aD©, where wd, wdD, and waD are weights, with the sum
being 1. They express the different significance of the criterion components T|M - C |, dD, and aD T|M - C | represents the Euclidean length of the directional vector calculated using the transformed gel, and dD expresses the
difference between Euclidean lengths of vectors of the
nearest landmarks and the Euclidean length of the vector
of the candidate match:
where LMnd is the number of the nearest landmarks. Similarly,
expresses the differences in absolute angles between the
candidate spot and a landmark, where LMna is the number
of the nearest landmarks. dD and aD were computed using
original gel coordinates as the transformation did not
Point pattern matching in 2-DE
3485
retain the original similarity of gel distortions. Criteria were
normalized by dividing the individual values by the mean
of all values of the candidate matches, i.e.
and
where i = 1...n, and n is the number of candidate matches.
The distance criteria are size (magnification)-dependent
and in case different size images are used, they must be
normalized to the standard width and height prior to the
analysis. This normalization does not influence the image
proportions and geometrical properties of the point patterns.
2.4.2 Spot neighborhoods
Each electrophoretic spot can be assigned a neighborhood consisting of a defined number of the closest spots.
Considering only the spot centroids, the spot neighborhood converts to a point pattern. The idea is that due to
the same electrophoretic conditions, point patterns of
matching spots neighborhoods in the reference and
compared gels have to be more similar than the point patterns of other spot pairs, even when the point patterns are
incomplete (e.g., missing spots). To define the similarity
of spot patterns, an efficient neighborhood descriptor
(ND) has to be derived. The matching is then reduced to
a comparison of the descriptors.
The surroundings of a spot can be divided by a straight
line leading through the spot center or by a circle with the
center in the middle of the spot leading to two half-planes.
Each half-plane can be assigned a unique identifier (e.g.,
character). Generally, an infinite number of half-plane
pairs can be created; in practice, 20±40 were used. Each
particular segment of one spot©s surroundings can be created as an intersection of the given number of half-planes
(see Fig. 1). The identifier of the segment is then a union
of identifiers of all half-planes whose nonempty intersection forms the segment (e.g., segment LN from Fig. 1 is
the intersection of half-planes L and N). Descriptors of
neighboring segments thus differ only in one character.
As the distance between segments grows, the number of
different characters in the descriptors grows as well.
Opposite segments have no common character.
3486
J. Pµnek and J. VohradskyÂ
Electrophoresis 1999, 20, 3483±3491
Each spot from the neighborhood of a spot (reference
point C) can be assigned an identifier, which is equal to
the descriptor of the segment in which the spot lies. This
identifier is called the identifier by the view of the reference point. In principle, each point in the pattern can
become the reference point. A complete description of
the point pattern consisting of n + 1 points is thus given by
a matrix of descriptors of all possible mutual point positions (views):
where C denotes the match candidate, numbers denote
the other neighborhood points, and ? denotes the direction of ªviewº between them, for example 1 ? C denotes
the view from point 1 to the reference point (Fig. 2).
The lower triangular matrix of the description matrix is
clearly abundant (view 1 ? C bears equivalent information as C ? 1). For the illustration, the complete descriptor ND of the simple Fig. 2 example is
Figure 1. Neighborhood segments description. (a) C
denotes the matched spot (the reference point here). The
line x = 0 divides the neighborhood plane to the right
(denoted by P ) and left (L) half-planes, and the line y = 0
to the upper (N) and lower (D) half-planes. Neighborhood
segments LN, PN, LD and PD are intersections of the corresponding half-planes L, P, N, and D. Note that the
descriptors of neighboring segments differ only in one
character (e.g., PN and PD ). Descriptors of the opposite
segments have no common characters (e.g., PN and LD).
The same effect can be obtained using only two characters in known order (0, 1), which makes the computer
implementation easier. (b) Example of a more accurate
division of the neighborhood into 8 segments. (c) Concentric circles divide the neighborhood of C into 4 segments
created by intersections of the areas inside and outside
the circles. Descriptors of segments (000, 100, 110, 111)
follow the rules mentioned in (a).
The more similar the spot neighborhoods are, the more
identical characters their NDs have. In Fig. 3, the description of mN and cN by the view of the C point is mNDC =
{(mC?m1), (mC?m2), (mC?m3), (mC?m4)}, and cNDC =
{(cC?c1), (cC?c2), (cC?m3)}, respectively. Because
both descriptors have different lengths, i.e., |mNDC|>|
c
NDC|,|mN | = 4 and |cN| = 3, it is first necessary to find the
corresponding point pairs using the similarity matrix
where mn + 1 and cn + 1 are numbers of mN and cN points.
Operator `+' denotes the similarity between positions of
points given by the number of identical characters in their
descriptors. In the example of Fig. 3, mNDC = {LDEM
LDEQ PDOQ PNOM} and cNDC = {PNOQ PNOM LDEQ}.
They give a similarity matrix of
Electrophoresis 1999, 20, 3483±3491
Point pattern matching in 2-DE
3487
Figure 2. Simple point pattern
of spot C neighborhood created
by 3 surrounding spots represented by points 1, 2, 3. Each
point in the neighborhood of C
can be assigned an identifier
identical to the segment descriptor. For descriptions of the
neighborhood segments, refer
to Fig. 1. Each point of the
pattern can become the reference point. The neighborhood is
always described relative to the
position of the reference point. (a) Neighborhood segments from the view of point Cr, which is the candidate matched spot
and also the reference point. The neighborhood descriptors of this view form the first row of the description matrix. (b)
Neighborhood segments from the view of point 1r, another reference point (second row in the description matrix).
local areas. In the ordered pattern the directional vectors
showed the same direction but the length differed as a
function of position. Translational distortions were mostly
caused by gel shift; they were very rare. We have thoroughly analyzed our set of 69 gels trying to find some
trends but the distortions were independent of the electrophoretic conditions and other experimental parameters
under which the electropherograms were run.
3.2 Transformation
Figure 3. Superimposed point patterns of neighborhoods
of the reference spot mC and matched spot cC. The first
point pattern consists of points mC, m1, m2, m3, m4, i.e.
m
N = {mC, m1, m2, m3, m4}. The second consists of points
c
C, c1, c2, c3, i.e., cN = {cC, c1, c2, c3}.
The resulting paris of points become those which have
the most similar descriptors, i.e., [m2, c3] = 4, [m4, c2] = 4,
[m3, c1] = 3. The similarity value of mN and cN (mND +
c
ND) is given by the sum of all identical characters in the
corresponding spot descriptors: (4 + 4 + 3) = 11.
3 Results
3.1 Gel distortions
Both the visual and quantitative evaluations (Fig. 4)
showed considerable differences in the character and
size of distortions. Three basic types were identified:
unordered, ordered, and translational. Unordered distortions did not show any specific pattern in distribution of
directional vectors; some order could be found only in
Prior to the matching process, coordinates of the matched
gels were transformed to fit the reference gel coordinate
system. Sets of landmarks distributed evenly throughout
the gel were selected either manually or by identification
of the highest intensity spots (as described in [13]). In our
case the landmark selection method based on the identification of spots with the highest intensity did not work for
the principle reasons. Electropherograms of the test set
came from the experiments tracking developmental
changes. The differences in spot intensity between two
electropherograms were so great that this method failed.
It was applicable only for gels in the repeats. Two methods of transformation were used: traditional polynomial
(used in [7]), and neural networks. Because of the
expected better flexibility neural networks were used. We
also expected that some systematic feature that is common for an electrophoretic run may exist for which the
neural network could be trained. Unfortunately, the gel
distortions were found to be random, independent of the
gel run and other experimental conditions. The approximation accuracy of the neural network did not show any
advantage compared to the polynomial transformation
and therefore neural networks were not used. The analysis of dependence of the polynomial SE on s | M-C | and
saM-C showed that the approximation accuracy did not
depend on distortion features significantly. It was influ-
3488
J. Pµnek and J. VohradskyÂ
Electrophoresis 1999, 20, 3483±3491
Figure 4. Examples of gel deformation and their descriptions. (a, b) Vector representation of deformation: (a) strongly unordered deformation; (b) translational deformation. Positions of reference
spots are marked by (*), direction and magnitude of deformation is depicted by oriented vectors. (c,
d) Quantitative expressions of the gel deformations shown in (a) and (b). x (+) (horizontal) and y (*)
(vertical) components of displacement vectors are plotted on vertical axis as a function of (c) x, or (d)
y coordinates of the corresponding reference spots (horizontal axis). Variances of displacement vector lengths and angles for (a) were s | M-C | = 28.89 saM-C = 81.61, and for (b) s | M-C | = 10.82 saM-C =
3.31.
enced mostly by the polynomial degree and number and
location of landmarks. The best resulting polynomial
degree was 3. An approximation using more than 20 landmarks did not decrease SE significantly and therefore
was not used. Major changes in the direction and size of
distortions were located in border parts of the gel as
expected. It is evident that an even distribution of landmarks over the gel is essential for a correct transformation.
and local distortions, each spot in the reference gel had
several candidates for matching in the matched gel. It
was found that for the determination of a correct match
five possible candidates from the matched gel closest to
the reference spot were sufficient. An increase in number
did not result in a rise of the number of correctly detected
spots; a smaller number of candidate spots caused an
increase in the number of incorrect matches. All compared gels were matched to one reference gel.
3.3 Matching
3.3.1 Directional vectors
Matching of spots in the transformed electropherograms
was accomplished by means of a comparison of spot
neighborhoods. Due to the inaccuracy of transformation
The correct match from all candidate matches was determined as inf(wdT |M-C |i© + wdD dDi© + waD aDi©), where i =
1...n, and n is the number of possible matches. The final
Electrophoresis 1999, 20, 3483±3491
settings were: (i) Values of weights were estimated empirically to wd = 0.8, wdD = 0.05, waD = 0.15. (ii) Numbers of
landmarks used to determine directional vector features
were estimated empirically to LMnd = 1 and LMna = 3.
3.3.2 Neighborhood similarity
The correct match was determined as sup(mND + cNDi),
where i = 1...n, and n is the number of possible candidate
matches. The final settings were: (i) The best ratio |mN | :
|cN | was found to be 30:20, i.e., for the definition of neighborhood patterns of the reference spots, 30 nearest spots
were used; for the pattern of a candidate matched spot,
20 nearest spots. The ratio was kept, as it proved to give
the highest probability of finding the corresponding point
pairs using the difference matrix. (ii) thirty-two angle segments and ten distance segments were used. (iii) Angles
with the line x = 0 going through the reference point of the
pattern were computed from the original gels. Distances
from the reference point of the pattern were computed
from the transformed gels.
Point pattern matching in 2-DE
3489
highest value of combined criteria was considered to be
the correct one; the remaining spots were considered
unmatched.
3.3.5 Match confidence
Confidence of the combined criterion was computed as
the ratio of its value for the best candidate match and a
median of the combined criterion values of all candidate
matches. When histograms of distribution of confidences
for the known correct and false positive matches were
plotted, they overlapped to a certain degree. This overlap
defined a range of values for high confidence of the correct and incorrect matches and a low confidence region.
The distance between the distribution edges allowed us
to set a threshold value which undoubtedly defined the
correct matches. These spots were defined as new landmarks which were used for new polynomial transformation, followed by repeated matching. The number of landmarks increased from 25 initial landmarks to 50±500 (5±
3.3.3 Combination of directional vectors and
neighborhood similarity
Matching carried out by using either the directional vector
criterion or the neighborhood similarity criterion alone
were not accurate enough. It was impossible to recognize
without a doubt all correctly and falsely matched spots
(i.e., correct matches and unmatched spots of both master and compared electropherograms) by thresholding the
values of either the directional vector criterion or neighborhood similarity. The values of true and falsely assigned spots of both criteria overlapped by 10±20%. The
use of the threshold value decreased the number of incorrectly matched spots, and the number of correctly assigned spots decreased as well. Therefore, a spot was
considered as matched only if it was matched by the two
criteria at the same time, i.e., by the MFV together with the
ND
V combination. Application of this combined criterion
caused an increase in the matching accuracy of 10±40%,
and the efficiency increased in the range of 0±20%.
3.3.4 Multiple matches
In almost all cases the reference gel contained more
spots than the matched gels. Therefore, in a certain case,
one spot of the matched gel could be assigned to more
than one spot in the reference gel (multiple match) or vice
versa. Determination of the correct match from the list of
multiple matches was necessary and followed the matching. The correct match was determined using the MFV
together with the NDV combination in the same way as in
the case of the possible matches. The match with the
Figure 5. Efficiency and accuracy (vertical axes of a, b)
of matching as a function of percentage of all known
matched spots to all spots (horizontal axes). (V) denotes
unordered, (*) ordered with low fluctuations, and (*)
ordered gel deformations. (a) The efficiency is a ratio of
correctly assigned matches to all known matches. (b) The
accuracy is a ratio of correctly assigned matches to the
found matches.
3490
J. Pµnek and J. VohradskyÂ
50%) according to the gel type. The efficiency after the
second round increased in the range of 1±6%, and the
accuracy increase was about 0±2.5%. The third round did
not show any significant improvement. Figure 5 shows
the efficiency and accuracy of matching for different types
of distortions as a function of the number of single spots.
It is evident that the negative influence of single spots
(spots which have no counterpart) clearly predominates
over the negative influence of distortion features. If there
are less than 25% single spots on both gels, the matching
results become stable at the levels of efficiency of around
95% and accuracy higher than 98%. Unordered deformations do not appear in this region as they also contained a
large number of single spots in both gels. The other types
of deformation were evenly distributed in both regions.
3.4 Matching process
The matching process can be summarized as follows:
(i) landmarks setting, (ii) transformation of matched gel,
(iii) matching using composite criterion, (iv) detection of
high confidence matches which become new landmarks,
and (v) go to (ii). Repeating steps (ii)±(v) more than twice
neither brought any improvement in efficiency not in accuracy. The routines were implemented in Matlab (The Math
Works Inc.) environment using its script language. Matlab
is an array-oriented language in which the processing
speed is highly influenced by how frequently the individual
members of an array are accessed. For this reason we do
not estimate the processing speed, which would change
greatly when implemented in some lower level language
as C.
Electrophoresis 1999, 20, 3483±3491
In practice this means that spots close to the borders of
the segment in the reference pattern can appear in different segments in the matched pattern, and thus obtain a
different descriptor. This invalidates its correct interpretation, the reliability of whole pattern description, and finally
the whole pattern comparison. We tried to bypass this
problem by introducing a composite syntactic description
which contains information about the whole neighborhood
of all segments in its formulation. In the border case mentioned above, the two spot descriptors will differ only in
one character, as the adjacent segments differ only in
one character of the multicharacter segment descriptor as
well. The longer the descriptor is, the lower the influence
of one single character and the error of assignment of a
point to a segment will be. Nevertheless, the length of the
segment descriptor (proportional to the total number of
segments) has its practical limits proportional to the average distance among spots in the reference and matched
gels. We have experimentally derived a reasonable partitioning of the spot surroundings.
In this paper we present a point pattern recognition method for matching pairs of electropherograms. The principle
consists in comparing point patterns formed by the closest neighborhood of a candidate matched spot with the
neighborhood of the spot in the reference gel. The point
patterns were converted to a set of syntactic descriptors
based on a division of the spot neighborhood to defined
segments. Each point of the point pattern can be described by its position in one of the segments of the neighborhood of reference and candidate spots. Quantitative
criteria describing the goodness of match and similarity of
the point patterns were derived.
This approach alone can reliably identify about 90% of
the spots in a typical gel pair. This may sound successful
but, in practice, it means that for a typical gel with one
thousand spots, one hundred are not matched. To
improve the results of matching we used another feature
of gel distortions. In almost all cases a small area surrounding a spot, where the direction and length of directional vectors is coherent, can be found. Therefore a criterion that is based on the information about the spots
surrounding the spot to be matched was derived. The
higher the similarity between the directional vectors of
spot surroundings and directional vectors of the candidate
spot is, the better the value of the criterion. Combining the
criteria describing the spot neighborhood pattern and the
spot directional vector led to an improvement in matching
efficiency and accuracy. In the best cases the efficiency
defined by the proportion of found matches to total known
matches reached the required value of 98%. Moreover,
the accuracy, which is the proportion of found matches to
correct known matches, reached the same value. For
automatic analysis the accuracy of matching is even more
important than the efficiency. It is always better to know
that the matches found by the algorithm were correct than
to find all matches, but with errors. In that case a detector
of mismatches has to be applied, and we are back to the
situation of lower efficiency with higher accuracy.
The principle of spot matching using information about its
surroundings was already applied previously. The principle disadvantage of this method was the existence of borders of the segmentation of spot neighborhoods. When
each segment is described by its unique identifier, the
identifier loses information about the neighbor segments.
A principal question that also exists pertains to the proportion of indistinguishable matches on a gel, i.e., the existence of two or more candidate matches that have identical neighborhoods and almost identical positions. Our
estimate is that it represents about 1±2% of all spots in a
gel. In this case some independent parameter has to be
4 Discussion
Electrophoresis 1999, 20, 3483±3491
Point pattern matching in 2-DE
3491
introduced. The required feature can be the spot densities
in the neighborhood. This, however, has another drawback: the gels are run to trace changes in spot densities;
therefore, in principle, they have to change. Then the criterion including the spot densities is, in principle, unreliable. When only a low number of spots change, then this
criterion is acceptable, but in the other cases it is not.
Therefore, without an a priori knowledge about spot density fluctuations, we cannot apply this criterion and the
level of inaccuracy will remain at the level of 1±2%. Hence
the 98% efficiency is probably the absolute limit. The
advantage of our algorithm is that, in a typical gel, its efficiency is around 95%, but the accuracy is always higher
than the efficiency. Furthermore, the algorithm can assign
each match a level of significance of the match and can
identify multiple matches (matches where there are more
equivalent candidates for match than one).
ciently reliable spot matching algorithm. The spot
neighborhood similarity seems to be the only reliable spot
pair feature by which the point pattern matching task
could be solved. In principle, we are able to identify distorted, zoomed, and, to a certain level, incomplete patterns. This suggests that, for some machine learning approaches, the dimensionless description of a point pattern
in the input to the engine has to be used. Instead of developing a single spot detector, which would improve matching results but would not principally solve the problem, we
are currently working on this approach. Any two-dimensional object (in principle, any dimension) can be reduced
to a point pattern. Matching of point patterns is thus not
strictly limited to matching of electropherograms but has a
much broader impact on pattern recognition in general. A
2-D gel is an excellent test case, as it comprises all principal problems of point pattern matching on a limited scale.
The total performance of matching is drastically influenced by the number of unmatched spots (see Fig. 5).
When less than 25% of single spots exist in both gels, the
algorithm correctly finds more than 95% of all matched
spots. The accuracy and efficiency of matching is hardly
influenced by the type of distortions. Points in the graph
of Fig. 5 are distributed evenly for all types of distortions.
Unordered matches do not appear in the high confidence
region as they also contain large portions of single spots.
When the single spots were removed (data not shown),
the algorithm worked with the same efficiency and accuracy for this type of distortion as for the other types. The
reason for the strong influence of single spots still lies in
the existence of borders in the spot pattern definition.
Even if the influence of segmentation of surroundings is
minimized by the type of description we used, the principle of segmentation and thus the existence of borders
does not disappear. We have probably reached the limits
of this approach. Probably, a different principle has to be
used to increase the performance of point pattern matching. This does not mean that, for practical purposes, the
given algorithm cannot be used. In most cases the gel
images appear in the high confidence region, where the
performance is higher than 95%. In any case, any detector of single spots would highly increase the performance
of matching.
Æ R grants No. 204/95/
This work was supported by GAC
0636 and No. 203/98/0422.
When looking at a pair of superimposed electropherograms, we are able to easily recognize similar point patterns and to find matching pairs of spots by comparing
the neighborhoods. For a novice it is always more difficult
but, in time, the user can identify any matches quite reliably; his brain neural network has become trained. Mimicing this process is probably the only way to find a suffi-
Received June 15, 1999
5 References
[1] Murtagh, F., ICPR 92 1992, 174±177.
[2] Murtagh, F., Publications of the Astronomical Society of the
Pacific 1992, 104, 301±307.
[3] Lemkin, P. F., Lipkin, L. E., Comput. Biomed. Res. 1981,
14, 355±380.
[4] Lemkin, P. F., Lipkin, L., Comput. Biomed. Res. 1981, 14,
407±446.
[5] Lemkin, P. F., Lipkin, L. E., in: Allen, R. C., Arnaud, P.
(Eds.), Electrophoresis ©81, de Gruyter, Berlin 1981,
pp. 401±411.
[6] Vincens, P., Tarroux, P., Electrophoresis 1987, 8, 100±107.
[7] Anderson, N. L., Taylor, J., Scandora, A. E., Coulter, B. P.,
Anderson, N. G., Clin. Chem. 1981, 27, 1808±1820.
[8] Appel, R. D., Palagi, P. M., Walther, D., Vargas, J. R., Sanchez, J.-C., Ravier, F., Pasquali, C., Hochstrasser, D. F.,
Electrophoresis 1997, 18, 2724±2734.
[9] Skolnick, M. M., Sternberg, S. R., Neel, J. V., Clin. Chem.
1982, 28, 969±978.
[10] Skolnick, M. M., Computer Vision, Graphics, and Image
Processing 1986, 35, 306±332.
[11] Pardowitz, I., Zimmer, H. G., Neuhoff, V., in Neuhoff, V.
(Ed.), Electrophoresis ©84, Verlag Chemie, Weinheim 1984,
pp. 315±316.
[12] Miller, M. J., Olson, A. D., Thorgeirsson, S. S., Electrophoresis 1984, 5, 297±303.
[13] Appel, R. D., Vargas, J. R., Palagi, P. M., Walther, D., Hochstrasser, D. F., Electrophoresis 1997, 18, 2735±2748.
[14] Garrels, J. I., J. Biol. Chem. 1989, 264, 5269±5282.