Electrophoresis 1999, 20, 3483±3491 Josef Pµnek JirÏí Vohradsky Institute of Microbiology, Czech Academy of Sciences, Prague, Czech Republic 3483 Point pattern matching in the analysis of twodimensional gel electropherograms In the automation of proteome analysis, matching of two-dimensional (2-D) electropherograms represents a bottleneck in the process. Here we present a point pattern recognition approach to the matching of spots in 2-D electropherograms. The algorithm is based on a comparison of spot neighborhoods, converted to point patterns between reference and compared gels. The neighborhood was characterized by a syntactic descriptor which minimized the influence of spot displacements. A combined criterion utilizing the similarity of point patterns and a metric definition of position similarity was derived. The efficiency and accuracy of the algorithm was tested on a set of 69 gels with different levels of similarity. For a typical gel the accuracy of matching was higher than 98% and the number of correctly identified spots was higher than 95%. Keywords: Point pattern matching / Two-dimensional electrophoresis / Pattern recognition / Proteome EL 3663 Point pattern recognition is one of the most powerful methods of identification of objects in image. As any twodimensional (in principle any dimensional) object can be simplified to a point pattern, the identification of a point pattern can serve to recognize defined objects. The most intuitive use of point pattern recognition is in the case of ªnaturalº point patterns, such as the comparison of star maps [1, 2]. Here the point pattern was defined by a central point (in principle any star) and all other neighboring points of the star map were defined ªby the viewº of the central point (ªworld viewº). The formal description of the surroundings was then compared with all possible descriptions from the compared map and the star with the most similar neighborhood was defined as the matching one. This example illustrates an approach to the problem of matching based on the comparison of neighborhood point patterns. Two-dimensional electrophoresis is one of the ªnaturalº point patterns as well. Each protein spot in an electropherogram can be characterized by the position of its centroid in the coordinate system of the gel. The electropherogram can then be viewed as a point pattern or a set of smaller local point patterns. Matching of two-dimensional gels, that is, identification of the spots belonging to the same protein on two different gels or a set of gels, can then be viewed as a comparison of point patterns. This approach was used, for example by Lemkin and Lipkin Correspondence: Dr. J. VohradskyÂ, Institute of Microbiology, Czech Academy of Sciences, VidenÆskµ 1083, 14220 Prague-4, Czech Republic E-mail: [email protected] Fax: +42-02-47-222-57 WILEY-VCH Verlag GmbH, 69451 Weinheim, 1999 [3±5] to match spots in the GELLAB system. A similar approach, but using a syntactic description of the neighborhood, was published by Vincens and Tarroux [6]. Other approaches using different features of gel images were used. First, systems utilized the principle of transformation of the coordinate system of a matched gel to fit the reference gel using a set of user-defined landmark spots. Each landmark spot defined a known match and was introduced by the user. A set of landmark coordinates was then loaded to a least squares minimizer and parameters of a known function (usually polynomial) were computed. The matched gel coordinate system was recomputed using the known function to the coordinate system of the reference gel. The transformed gels were then overlaid and the matched spots were defined as those closest to each other [7]. This principle was powerful enough to have survived as a base for modern systems to date [8]. Another approach, first published by Skolnick et al. [9] and Skolnick [10], was based on conversion of a spot point pattern to a specific graph (Gabriel graph) where the spots became nodes of the graph. The match was defined by comparing the graphs. Variations of this method exist that are based, for example, on a metric description of the graph [11]. The ELSIE/MELANIE system used a comparison of spot clusters, instead of simplified point patterns, for spot matching [12, 13] and a probabilistic criterion for the definition of correct match. Garrels [14] published a spot matching algorithm based on propagation of a user-defined landmark spot, which served as the initial match. The next match was determined by a neighborhood spot of the landmark spot which showed the best match. This spot became a landmark and the algorithm continued until no further sufficiently good match in the 0173-0835/99/1818-3483 $17.50+.50/0 Proteomics and 2-DE 1 Introduction 3484 J. Pµnek and J. Vohradsky surroundings of the new landmark was found. Then the next landmark was defined and the algorithm looped until no new matches were found. Some of the algorithms mentioned above were integrated to commercially available gel analysis packages. We have not found any publications that have analyzed the performance of each of the algorithms or cross-compared their efficiency and accuracy. From our empirical experience with some commercial packages it can be concluded that most of the algorithms work well for similar gels with little changes in both the spot density and spot presence. When the distortions and changes in spot intensity become higher, the rate of misidentifications increases and the number of detected matches decreases. In the times of automation of proteome analysis, spot matching becomes the bottleneck of the whole process, as it requires time-consuming operator interaction and manual corrections. The introduction of a highly reliable and efficient algorithm would greatly improve the efficiency of the whole process. In this paper we present the point pattern comparison approach for matching pairs of gels based on a syntactic description of the spot neighborhood. The algorithm for a typical pair of gels shows an efficiency, defined by the number of detected matches, higher than 95% and an accuracy higher than 98%. 2 Materials and methods 2.1 Test set For the testing of algorithms, data from two match sets created during a study of Streptomyces coelicolor development (30 and 39 gels) were used. Original radioactive gel images were exposed to a film and processed using pdQuest (PDI Inc.) gel analysis software. After manual corrections of matches and mismatches, the matching was carefully checked again. Data matrices containing spot position coordinates on different gels in the set were exported from pdQuest and used as testing standard. 2.2 Gel distortions All electropherograms were matched to one standard gel chosen from the gels of the match set. ªDistortionsº of a gel represent differences in coordinates of two identical spots in the reference and matched gels. They were always considered as relative to the reference gel. The two spots can be connected by an oriented vector whose feature can be used to describe the overall level of distortion of the matched gel. Electrophoresis 1999, 20, 3483±3491 ¾? Let CM = M - C be the directional vector between spot centers M = [mx, my] on the reference gel and C = [cx, cy], the same spot on the matched gel. Then is its length and is its angle with the horizontal axis. Variances s|M-C| and saM-C of directional vectors for all spots together provide a quantitative measure of overall electropherogram distortions. 2.3 Transformation of matched gel Prior to searching for matching spots, global distortions were eliminated by transformation of the coordinate system of the matched electropherogram. We used the polynomial transformation to eliminate global distortions. The polynomials were used to approximate coordinates of a small number of known matched spots entered by the user (landmarks), distributed throughout the whole electropherogram. The resulting polynomial approximation scheme was LM,c x = fx (LM,mx, LM,my), (3) LM,c y = fy (LM,mx, LM,my) (4) where (LM,cx LM,cy) and (LM,mx LM,my) are landmark coordinates of the matched or reference electropherograms, respectively, followed by transformation T,c m m x = f pol x ( x, y), (5) T,c m m y = f pol y ( x, y) (6) where (T,cx T,cy) were transformed coordinates of the matched electropherogram, and (mx my) coordinates of the reference electropherograms. Polynomial coefficients were computed using a Vandermonde matrix construction. The system of linear equations was solved by QR decomposition. Functional values of polynomials were computed using the Horner scheme. 2.3.1 Transformation accuracy For evaluation of the transformation accuracy, a sum of distances SE between spots in the reference and transformed matched gel was chosen: where coefficient T is the transformed matched gel. Electrophoresis 1999, 20, 3483±3491 2.4 Matching Due to local distortions and transformation inaccuracy, matching spot positions do not overlap sufficiently to unambiguously determine the match. In other words, the spot in the matched gel closest to a spot in the reference gel is not necessarily the correct match. All spots in the neighborhood of this spot in the matched gel have to be considered as a possible match instead. The matching process thus investigates all possible matches (candidate spots). At the beginning a spot in the reference gel (reference spot) was chosen. A spot in the transformed matched gel closest to the reference spot was found. This was the first candidate spot for match. From the local neighborhood of the candidate spot a small set of spots was chosen as other possible candidates. The accuracy of matching all candidate spots and their neighborhoods was determined. Among this set a spot with the best matching characteristics was chosen as the best possible match and the value of goodness of match was assigned to this spot. 2.4.1 Directional vectors Goodness of match was characterized by a criterion derived from the directional vectors of the candidate match and directional vectors of the closest landmarks. Length and angle of the directional vectors of the candidate matches were calculated together with the lengths and angles of directional vectors of the closest landmarks. The differences between characteristics of the candidate matches and closest landmark were calculated. The resulting criterion was defined as wdT|M - C |© + wdD dD© + waD aD©, where wd, wdD, and waD are weights, with the sum being 1. They express the different significance of the criterion components T|M - C |, dD, and aD T|M - C | represents the Euclidean length of the directional vector calculated using the transformed gel, and dD expresses the difference between Euclidean lengths of vectors of the nearest landmarks and the Euclidean length of the vector of the candidate match: where LMnd is the number of the nearest landmarks. Similarly, expresses the differences in absolute angles between the candidate spot and a landmark, where LMna is the number of the nearest landmarks. dD and aD were computed using original gel coordinates as the transformation did not Point pattern matching in 2-DE 3485 retain the original similarity of gel distortions. Criteria were normalized by dividing the individual values by the mean of all values of the candidate matches, i.e. and where i = 1...n, and n is the number of candidate matches. The distance criteria are size (magnification)-dependent and in case different size images are used, they must be normalized to the standard width and height prior to the analysis. This normalization does not influence the image proportions and geometrical properties of the point patterns. 2.4.2 Spot neighborhoods Each electrophoretic spot can be assigned a neighborhood consisting of a defined number of the closest spots. Considering only the spot centroids, the spot neighborhood converts to a point pattern. The idea is that due to the same electrophoretic conditions, point patterns of matching spots neighborhoods in the reference and compared gels have to be more similar than the point patterns of other spot pairs, even when the point patterns are incomplete (e.g., missing spots). To define the similarity of spot patterns, an efficient neighborhood descriptor (ND) has to be derived. The matching is then reduced to a comparison of the descriptors. The surroundings of a spot can be divided by a straight line leading through the spot center or by a circle with the center in the middle of the spot leading to two half-planes. Each half-plane can be assigned a unique identifier (e.g., character). Generally, an infinite number of half-plane pairs can be created; in practice, 20±40 were used. Each particular segment of one spot©s surroundings can be created as an intersection of the given number of half-planes (see Fig. 1). The identifier of the segment is then a union of identifiers of all half-planes whose nonempty intersection forms the segment (e.g., segment LN from Fig. 1 is the intersection of half-planes L and N). Descriptors of neighboring segments thus differ only in one character. As the distance between segments grows, the number of different characters in the descriptors grows as well. Opposite segments have no common character. 3486 J. Pµnek and J. Vohradsky Electrophoresis 1999, 20, 3483±3491 Each spot from the neighborhood of a spot (reference point C) can be assigned an identifier, which is equal to the descriptor of the segment in which the spot lies. This identifier is called the identifier by the view of the reference point. In principle, each point in the pattern can become the reference point. A complete description of the point pattern consisting of n + 1 points is thus given by a matrix of descriptors of all possible mutual point positions (views): where C denotes the match candidate, numbers denote the other neighborhood points, and ? denotes the direction of ªviewº between them, for example 1 ? C denotes the view from point 1 to the reference point (Fig. 2). The lower triangular matrix of the description matrix is clearly abundant (view 1 ? C bears equivalent information as C ? 1). For the illustration, the complete descriptor ND of the simple Fig. 2 example is Figure 1. Neighborhood segments description. (a) C denotes the matched spot (the reference point here). The line x = 0 divides the neighborhood plane to the right (denoted by P ) and left (L) half-planes, and the line y = 0 to the upper (N) and lower (D) half-planes. Neighborhood segments LN, PN, LD and PD are intersections of the corresponding half-planes L, P, N, and D. Note that the descriptors of neighboring segments differ only in one character (e.g., PN and PD ). Descriptors of the opposite segments have no common characters (e.g., PN and LD). The same effect can be obtained using only two characters in known order (0, 1), which makes the computer implementation easier. (b) Example of a more accurate division of the neighborhood into 8 segments. (c) Concentric circles divide the neighborhood of C into 4 segments created by intersections of the areas inside and outside the circles. Descriptors of segments (000, 100, 110, 111) follow the rules mentioned in (a). The more similar the spot neighborhoods are, the more identical characters their NDs have. In Fig. 3, the description of mN and cN by the view of the C point is mNDC = {(mC?m1), (mC?m2), (mC?m3), (mC?m4)}, and cNDC = {(cC?c1), (cC?c2), (cC?m3)}, respectively. Because both descriptors have different lengths, i.e., |mNDC|>| c NDC|,|mN | = 4 and |cN| = 3, it is first necessary to find the corresponding point pairs using the similarity matrix where mn + 1 and cn + 1 are numbers of mN and cN points. Operator `+' denotes the similarity between positions of points given by the number of identical characters in their descriptors. In the example of Fig. 3, mNDC = {LDEM LDEQ PDOQ PNOM} and cNDC = {PNOQ PNOM LDEQ}. They give a similarity matrix of Electrophoresis 1999, 20, 3483±3491 Point pattern matching in 2-DE 3487 Figure 2. Simple point pattern of spot C neighborhood created by 3 surrounding spots represented by points 1, 2, 3. Each point in the neighborhood of C can be assigned an identifier identical to the segment descriptor. For descriptions of the neighborhood segments, refer to Fig. 1. Each point of the pattern can become the reference point. The neighborhood is always described relative to the position of the reference point. (a) Neighborhood segments from the view of point Cr, which is the candidate matched spot and also the reference point. The neighborhood descriptors of this view form the first row of the description matrix. (b) Neighborhood segments from the view of point 1r, another reference point (second row in the description matrix). local areas. In the ordered pattern the directional vectors showed the same direction but the length differed as a function of position. Translational distortions were mostly caused by gel shift; they were very rare. We have thoroughly analyzed our set of 69 gels trying to find some trends but the distortions were independent of the electrophoretic conditions and other experimental parameters under which the electropherograms were run. 3.2 Transformation Figure 3. Superimposed point patterns of neighborhoods of the reference spot mC and matched spot cC. The first point pattern consists of points mC, m1, m2, m3, m4, i.e. m N = {mC, m1, m2, m3, m4}. The second consists of points c C, c1, c2, c3, i.e., cN = {cC, c1, c2, c3}. The resulting paris of points become those which have the most similar descriptors, i.e., [m2, c3] = 4, [m4, c2] = 4, [m3, c1] = 3. The similarity value of mN and cN (mND + c ND) is given by the sum of all identical characters in the corresponding spot descriptors: (4 + 4 + 3) = 11. 3 Results 3.1 Gel distortions Both the visual and quantitative evaluations (Fig. 4) showed considerable differences in the character and size of distortions. Three basic types were identified: unordered, ordered, and translational. Unordered distortions did not show any specific pattern in distribution of directional vectors; some order could be found only in Prior to the matching process, coordinates of the matched gels were transformed to fit the reference gel coordinate system. Sets of landmarks distributed evenly throughout the gel were selected either manually or by identification of the highest intensity spots (as described in [13]). In our case the landmark selection method based on the identification of spots with the highest intensity did not work for the principle reasons. Electropherograms of the test set came from the experiments tracking developmental changes. The differences in spot intensity between two electropherograms were so great that this method failed. It was applicable only for gels in the repeats. Two methods of transformation were used: traditional polynomial (used in [7]), and neural networks. Because of the expected better flexibility neural networks were used. We also expected that some systematic feature that is common for an electrophoretic run may exist for which the neural network could be trained. Unfortunately, the gel distortions were found to be random, independent of the gel run and other experimental conditions. The approximation accuracy of the neural network did not show any advantage compared to the polynomial transformation and therefore neural networks were not used. The analysis of dependence of the polynomial SE on s | M-C | and saM-C showed that the approximation accuracy did not depend on distortion features significantly. It was influ- 3488 J. Pµnek and J. Vohradsky Electrophoresis 1999, 20, 3483±3491 Figure 4. Examples of gel deformation and their descriptions. (a, b) Vector representation of deformation: (a) strongly unordered deformation; (b) translational deformation. Positions of reference spots are marked by (*), direction and magnitude of deformation is depicted by oriented vectors. (c, d) Quantitative expressions of the gel deformations shown in (a) and (b). x (+) (horizontal) and y (*) (vertical) components of displacement vectors are plotted on vertical axis as a function of (c) x, or (d) y coordinates of the corresponding reference spots (horizontal axis). Variances of displacement vector lengths and angles for (a) were s | M-C | = 28.89 saM-C = 81.61, and for (b) s | M-C | = 10.82 saM-C = 3.31. enced mostly by the polynomial degree and number and location of landmarks. The best resulting polynomial degree was 3. An approximation using more than 20 landmarks did not decrease SE significantly and therefore was not used. Major changes in the direction and size of distortions were located in border parts of the gel as expected. It is evident that an even distribution of landmarks over the gel is essential for a correct transformation. and local distortions, each spot in the reference gel had several candidates for matching in the matched gel. It was found that for the determination of a correct match five possible candidates from the matched gel closest to the reference spot were sufficient. An increase in number did not result in a rise of the number of correctly detected spots; a smaller number of candidate spots caused an increase in the number of incorrect matches. All compared gels were matched to one reference gel. 3.3 Matching 3.3.1 Directional vectors Matching of spots in the transformed electropherograms was accomplished by means of a comparison of spot neighborhoods. Due to the inaccuracy of transformation The correct match from all candidate matches was determined as inf(wdT |M-C |i© + wdD dDi© + waD aDi©), where i = 1...n, and n is the number of possible matches. The final Electrophoresis 1999, 20, 3483±3491 settings were: (i) Values of weights were estimated empirically to wd = 0.8, wdD = 0.05, waD = 0.15. (ii) Numbers of landmarks used to determine directional vector features were estimated empirically to LMnd = 1 and LMna = 3. 3.3.2 Neighborhood similarity The correct match was determined as sup(mND + cNDi), where i = 1...n, and n is the number of possible candidate matches. The final settings were: (i) The best ratio |mN | : |cN | was found to be 30:20, i.e., for the definition of neighborhood patterns of the reference spots, 30 nearest spots were used; for the pattern of a candidate matched spot, 20 nearest spots. The ratio was kept, as it proved to give the highest probability of finding the corresponding point pairs using the difference matrix. (ii) thirty-two angle segments and ten distance segments were used. (iii) Angles with the line x = 0 going through the reference point of the pattern were computed from the original gels. Distances from the reference point of the pattern were computed from the transformed gels. Point pattern matching in 2-DE 3489 highest value of combined criteria was considered to be the correct one; the remaining spots were considered unmatched. 3.3.5 Match confidence Confidence of the combined criterion was computed as the ratio of its value for the best candidate match and a median of the combined criterion values of all candidate matches. When histograms of distribution of confidences for the known correct and false positive matches were plotted, they overlapped to a certain degree. This overlap defined a range of values for high confidence of the correct and incorrect matches and a low confidence region. The distance between the distribution edges allowed us to set a threshold value which undoubtedly defined the correct matches. These spots were defined as new landmarks which were used for new polynomial transformation, followed by repeated matching. The number of landmarks increased from 25 initial landmarks to 50±500 (5± 3.3.3 Combination of directional vectors and neighborhood similarity Matching carried out by using either the directional vector criterion or the neighborhood similarity criterion alone were not accurate enough. It was impossible to recognize without a doubt all correctly and falsely matched spots (i.e., correct matches and unmatched spots of both master and compared electropherograms) by thresholding the values of either the directional vector criterion or neighborhood similarity. The values of true and falsely assigned spots of both criteria overlapped by 10±20%. The use of the threshold value decreased the number of incorrectly matched spots, and the number of correctly assigned spots decreased as well. Therefore, a spot was considered as matched only if it was matched by the two criteria at the same time, i.e., by the MFV together with the ND V combination. Application of this combined criterion caused an increase in the matching accuracy of 10±40%, and the efficiency increased in the range of 0±20%. 3.3.4 Multiple matches In almost all cases the reference gel contained more spots than the matched gels. Therefore, in a certain case, one spot of the matched gel could be assigned to more than one spot in the reference gel (multiple match) or vice versa. Determination of the correct match from the list of multiple matches was necessary and followed the matching. The correct match was determined using the MFV together with the NDV combination in the same way as in the case of the possible matches. The match with the Figure 5. Efficiency and accuracy (vertical axes of a, b) of matching as a function of percentage of all known matched spots to all spots (horizontal axes). (V) denotes unordered, (*) ordered with low fluctuations, and (*) ordered gel deformations. (a) The efficiency is a ratio of correctly assigned matches to all known matches. (b) The accuracy is a ratio of correctly assigned matches to the found matches. 3490 J. Pµnek and J. Vohradsky 50%) according to the gel type. The efficiency after the second round increased in the range of 1±6%, and the accuracy increase was about 0±2.5%. The third round did not show any significant improvement. Figure 5 shows the efficiency and accuracy of matching for different types of distortions as a function of the number of single spots. It is evident that the negative influence of single spots (spots which have no counterpart) clearly predominates over the negative influence of distortion features. If there are less than 25% single spots on both gels, the matching results become stable at the levels of efficiency of around 95% and accuracy higher than 98%. Unordered deformations do not appear in this region as they also contained a large number of single spots in both gels. The other types of deformation were evenly distributed in both regions. 3.4 Matching process The matching process can be summarized as follows: (i) landmarks setting, (ii) transformation of matched gel, (iii) matching using composite criterion, (iv) detection of high confidence matches which become new landmarks, and (v) go to (ii). Repeating steps (ii)±(v) more than twice neither brought any improvement in efficiency not in accuracy. The routines were implemented in Matlab (The Math Works Inc.) environment using its script language. Matlab is an array-oriented language in which the processing speed is highly influenced by how frequently the individual members of an array are accessed. For this reason we do not estimate the processing speed, which would change greatly when implemented in some lower level language as C. Electrophoresis 1999, 20, 3483±3491 In practice this means that spots close to the borders of the segment in the reference pattern can appear in different segments in the matched pattern, and thus obtain a different descriptor. This invalidates its correct interpretation, the reliability of whole pattern description, and finally the whole pattern comparison. We tried to bypass this problem by introducing a composite syntactic description which contains information about the whole neighborhood of all segments in its formulation. In the border case mentioned above, the two spot descriptors will differ only in one character, as the adjacent segments differ only in one character of the multicharacter segment descriptor as well. The longer the descriptor is, the lower the influence of one single character and the error of assignment of a point to a segment will be. Nevertheless, the length of the segment descriptor (proportional to the total number of segments) has its practical limits proportional to the average distance among spots in the reference and matched gels. We have experimentally derived a reasonable partitioning of the spot surroundings. In this paper we present a point pattern recognition method for matching pairs of electropherograms. The principle consists in comparing point patterns formed by the closest neighborhood of a candidate matched spot with the neighborhood of the spot in the reference gel. The point patterns were converted to a set of syntactic descriptors based on a division of the spot neighborhood to defined segments. Each point of the point pattern can be described by its position in one of the segments of the neighborhood of reference and candidate spots. Quantitative criteria describing the goodness of match and similarity of the point patterns were derived. This approach alone can reliably identify about 90% of the spots in a typical gel pair. This may sound successful but, in practice, it means that for a typical gel with one thousand spots, one hundred are not matched. To improve the results of matching we used another feature of gel distortions. In almost all cases a small area surrounding a spot, where the direction and length of directional vectors is coherent, can be found. Therefore a criterion that is based on the information about the spots surrounding the spot to be matched was derived. The higher the similarity between the directional vectors of spot surroundings and directional vectors of the candidate spot is, the better the value of the criterion. Combining the criteria describing the spot neighborhood pattern and the spot directional vector led to an improvement in matching efficiency and accuracy. In the best cases the efficiency defined by the proportion of found matches to total known matches reached the required value of 98%. Moreover, the accuracy, which is the proportion of found matches to correct known matches, reached the same value. For automatic analysis the accuracy of matching is even more important than the efficiency. It is always better to know that the matches found by the algorithm were correct than to find all matches, but with errors. In that case a detector of mismatches has to be applied, and we are back to the situation of lower efficiency with higher accuracy. The principle of spot matching using information about its surroundings was already applied previously. The principle disadvantage of this method was the existence of borders of the segmentation of spot neighborhoods. When each segment is described by its unique identifier, the identifier loses information about the neighbor segments. A principal question that also exists pertains to the proportion of indistinguishable matches on a gel, i.e., the existence of two or more candidate matches that have identical neighborhoods and almost identical positions. Our estimate is that it represents about 1±2% of all spots in a gel. In this case some independent parameter has to be 4 Discussion Electrophoresis 1999, 20, 3483±3491 Point pattern matching in 2-DE 3491 introduced. The required feature can be the spot densities in the neighborhood. This, however, has another drawback: the gels are run to trace changes in spot densities; therefore, in principle, they have to change. Then the criterion including the spot densities is, in principle, unreliable. When only a low number of spots change, then this criterion is acceptable, but in the other cases it is not. Therefore, without an a priori knowledge about spot density fluctuations, we cannot apply this criterion and the level of inaccuracy will remain at the level of 1±2%. Hence the 98% efficiency is probably the absolute limit. The advantage of our algorithm is that, in a typical gel, its efficiency is around 95%, but the accuracy is always higher than the efficiency. Furthermore, the algorithm can assign each match a level of significance of the match and can identify multiple matches (matches where there are more equivalent candidates for match than one). ciently reliable spot matching algorithm. The spot neighborhood similarity seems to be the only reliable spot pair feature by which the point pattern matching task could be solved. In principle, we are able to identify distorted, zoomed, and, to a certain level, incomplete patterns. This suggests that, for some machine learning approaches, the dimensionless description of a point pattern in the input to the engine has to be used. Instead of developing a single spot detector, which would improve matching results but would not principally solve the problem, we are currently working on this approach. Any two-dimensional object (in principle, any dimension) can be reduced to a point pattern. Matching of point patterns is thus not strictly limited to matching of electropherograms but has a much broader impact on pattern recognition in general. A 2-D gel is an excellent test case, as it comprises all principal problems of point pattern matching on a limited scale. The total performance of matching is drastically influenced by the number of unmatched spots (see Fig. 5). When less than 25% of single spots exist in both gels, the algorithm correctly finds more than 95% of all matched spots. The accuracy and efficiency of matching is hardly influenced by the type of distortions. Points in the graph of Fig. 5 are distributed evenly for all types of distortions. Unordered matches do not appear in the high confidence region as they also contain large portions of single spots. When the single spots were removed (data not shown), the algorithm worked with the same efficiency and accuracy for this type of distortion as for the other types. The reason for the strong influence of single spots still lies in the existence of borders in the spot pattern definition. Even if the influence of segmentation of surroundings is minimized by the type of description we used, the principle of segmentation and thus the existence of borders does not disappear. We have probably reached the limits of this approach. Probably, a different principle has to be used to increase the performance of point pattern matching. This does not mean that, for practical purposes, the given algorithm cannot be used. In most cases the gel images appear in the high confidence region, where the performance is higher than 95%. In any case, any detector of single spots would highly increase the performance of matching. Æ R grants No. 204/95/ This work was supported by GAC 0636 and No. 203/98/0422. When looking at a pair of superimposed electropherograms, we are able to easily recognize similar point patterns and to find matching pairs of spots by comparing the neighborhoods. For a novice it is always more difficult but, in time, the user can identify any matches quite reliably; his brain neural network has become trained. Mimicing this process is probably the only way to find a suffi- Received June 15, 1999 5 References [1] Murtagh, F., ICPR 92 1992, 174±177. [2] Murtagh, F., Publications of the Astronomical Society of the Pacific 1992, 104, 301±307. [3] Lemkin, P. F., Lipkin, L. E., Comput. Biomed. Res. 1981, 14, 355±380. [4] Lemkin, P. F., Lipkin, L., Comput. Biomed. Res. 1981, 14, 407±446. [5] Lemkin, P. F., Lipkin, L. E., in: Allen, R. C., Arnaud, P. (Eds.), Electrophoresis ©81, de Gruyter, Berlin 1981, pp. 401±411. [6] Vincens, P., Tarroux, P., Electrophoresis 1987, 8, 100±107. [7] Anderson, N. L., Taylor, J., Scandora, A. E., Coulter, B. P., Anderson, N. G., Clin. Chem. 1981, 27, 1808±1820. [8] Appel, R. D., Palagi, P. M., Walther, D., Vargas, J. R., Sanchez, J.-C., Ravier, F., Pasquali, C., Hochstrasser, D. F., Electrophoresis 1997, 18, 2724±2734. [9] Skolnick, M. M., Sternberg, S. R., Neel, J. V., Clin. Chem. 1982, 28, 969±978. [10] Skolnick, M. M., Computer Vision, Graphics, and Image Processing 1986, 35, 306±332. [11] Pardowitz, I., Zimmer, H. G., Neuhoff, V., in Neuhoff, V. (Ed.), Electrophoresis ©84, Verlag Chemie, Weinheim 1984, pp. 315±316. [12] Miller, M. J., Olson, A. D., Thorgeirsson, S. S., Electrophoresis 1984, 5, 297±303. [13] Appel, R. D., Vargas, J. R., Palagi, P. M., Walther, D., Hochstrasser, D. F., Electrophoresis 1997, 18, 2735±2748. [14] Garrels, J. I., J. Biol. Chem. 1989, 264, 5269±5282.
© Copyright 2025 Paperzz