in: Proceedings of the IWVF3 1997, World Scientific, Singapore, 1997. Object Classication Based on Contours with Elastic Graph Matching Efthimia Kefalea Institut fur Neuroinformatik, Ruhr-Universitat Bochum, 44780 Bochum, Germany Olaf Rehse Institut fur Neuroinformatik, Ruhr-Universitat Bochum, 44780 Bochum, Germany Christoph von der Malsburg Institut fur Neuroinformatik, Ruhr-Universitat Bochum, 44780 Bochum, Germany University of Southern California, Dept. of Computer Science and Section for Neurobiology, Los Angeles, CA 90089-2520,USA We describe a system for the detection, classication and pose estimation of simple objects. The system is robust with respect to surface markings and cluttered background. Recognition is achieved by comparing the image to stored two-dimensional object views. Stored views are represented as labeled graphs and are derived automatically from images of blank object models. Graph nodes are labeled by edge information, graph edges by distance vectors in the image plane. Graphs emphasize occluding boundaries and inner object edges. These are identied by extracting local maxima in the Mallat wavelet transform of the image. Stored graphs are compared to test images by elastic matching. Our experiments demonstrate that the system is capable of fairly reliable shape classication and pose estimation of objects in natural scenes. Keywords: computer vision, object recognition, Mallat wavelet trans- form. 1 Introduction We present an approach to object classication in terms of shape. It contributes to our group's project of developing a service robot that can manipulate objects in natural environments and that can be trained by lay persons. The challenge is to reliably estimate object shape, position, size and pose as a basis for grasping, and to autonomously learn this capability from examples of new object types. During recognition the system has to be robust with respect to surface texture and to cluttered background. 1 Our system makes no attempt to derive three-dimensional shape directly from visual data. Previous work has shown that this is possible (see for instance 2 ), although it turns out that it is dicult and requires availability of perfect contours as input 9 or visibility of certain object points (as pointed out in 12 ), conditions which are rarely met in natural scenes. It is true that it would suce to extract 3D information from clean training images, relying during later performance on stored shape information. However, we don't even attempt that here. In our robot system we plan to deduce all required shape information during grasping attempts, an approach already realized in simulation 11 . At present, we distinguish between four classes of shapes: cubes, bricks, spheres and cylinders. A particular challenge is the volume of raw data (pixel arrays), which obviously forces a great number of degrees of freedom on the system, and the great variability with which a particular object can appear in the image. We deal with the depth rotation problem by using a multiview approach based upon contour representations. This approach to object classication relies upon object-adapted representations from dierent viewpoints 7 and is motivated by results from psychophysics 1 . The system's knowledge about shape classes is described by model graphs, their nodes being labeled by sets of local features called jets. Jets represent edge information that is extracted from a small patch of an image by using the multiresolution analysis introduced by Mallat 4 . We describe the edge extraction and model graph creation of our system in sections 2.1 and 2.2, respectively. Input images to be analysed are preprocessed in two steps. In the rst step, we use the above mentioned Mallat method and in the second step a specic condence-based algorithm introduced by us in 8. In this way, we obtain a reliable edge interpretation of the scene. The preprocessing procedure of input images is discussed in section 2.3. Our matching process is described in section 2.4. It is based on Elastic Graph Matching (EGM) described in 3 , our present version being that described in 10 . EGM proceeds by comparing stored model graphs to the image in terms of similarities between stored jets and jets extracted from the image, adapting the location and size of model graphs until an optimum is found. EGM is a simple algorithmic caricature of Dynamic Link Maching, a neural model based on synchrony coding of feature binding and rapidly reversible synaptic plasticity 5 . We conclude with the results of our experiments in section 3 and a short discussion in section 4. 2 Figure 1: A two-dimensional Mallat lter in the spatial domain. 2 Description of the System 2.1 Edge Extraction For the representation of objects the system employs labeled model graphs. A labeled graph G consists of N nodes positioned on contour points of the object at positions ~xn, n = 1; :::; N and E edges between them. Edges connecting neighbouring nodes are labeled with distance vectors between node positions. Nodes are labeled with image information referring to features lying on the contour. These labels are called jets. They are derived from linear lter operations in the form of convolutions of the image I (~x) (~x 2 M 2; M := f1; : : : ; rg) with lters s( ) (~x) (h) (~x) s T~s (~ x) = I (~ x) (v) s (~x) where represents a convolution, h and v stand for horizontal and vertical, and si (si = s0 2i ; i 2 N ) represents the width of a Gaussian the derivatives of which are used as lters. The absolute value ai (~x) = jT~s (~x)j and the angle T (~x) 'i (~ x) = arctan T (~x) measure strength and orientation of an intensity change at scale si and position ~x. Modulus Maxima positions mark local maxima of the strength ai , representing precisely the trace of an edge. Fig. 1 shows a two-dimensional Mallat lter in the spatial domain. As we use 5 image i i i i (h) si (v) si 3 Figure 2: Views of dierent shape classes and their model graphs. Note that these are not test images, but examples of the system's knowledge about shape classes. resolution levels, a feature vector coding local image information also called jet J is a collection of 2 5 values (ai; 'i ), i = 0; : : : ; 4 with 0 denoting the highest frequency level. 2.2 Creation of Model Graphs Model graphs represent the system's knowledge about shape classes. The system employs three dierent sizes of model graphs for each shape class. The structure of each model graph is object-adapted, its outline depending on object contours. Fig. 2 depicts model graphs superimposed on the original images. Graphs of dierent views dier in geometry and local features. In order to be able to classify objects irrespective of viewpoint, we use a multiple-view approach. Due to the way we represent contour information, our system can recognize views of a shape class it has never seen before, by simply using a limited subset of all possible views. We employ a so-called multigraph structure. Each multigraph consists of a certain number of model graphs representing the same shape from dierent viewpoints. The procedure for creating such a multigraph is as follows: We rst create a discrete view sphere of an object. We then take object images by rotating the object according to Fig. 3 while having a constant object distance and varying angle . In ,direction we step with constant stepwidth from min to max taking n views. Furthermore, we instruct the system as 4 Z object Y X θ Figure 3: Sampling of the view sphere used for our multiple-view approach. Figure 4: Each object is represented by multigraphs. to which model graphs (representing dierent views) represent the same object, by simply giving identity labels to the corresponding model graphs (see Fig. 4). Since the labeling procedure is an external process, it needs no further computational time. This is an advantage of our approach in comparison to other multi-view object recognition systems, such as that described in 6 . Our gallery of models consists of simple wooden objects. We have four object classes: cubes, bricks, cylinders and spheres. Within each class there are objects of three dierent sizes. For each size there are 10 dierent views. These 30 graphs constitute the multigraph of that object. We recorded 8-bit grey-scale images of 128x128 pixels. Examples of our model gallery are given in Fig. 2. To create a model graph a simple segmentation procedure is applied which requires a picture of the object. We transform this picture using the Mallat wavelet transform. All nodes on a square lattice of points with a spacing of 4 pixels are visited in the image. Each node has a maximum of eight neighbours. Nodes not lying on the object's contour are deleted according to the following procedure: 1. All nodes are deleted for which the average magnitude of the wavelet response at level 1 over a 4x4 square of pixels is below a given threshold value. In this way, contour nodes are favored over those positioned within the object or the background. This step leads to graphs corresponding 5 to connected regions. 2. Among the thus created graphs, only the one with the greatest number of nodes is kept (thus getting rid of clutter). 3. All nodes with 6 neighbours are deleted, as they presumably lie inside the object. After these steps, all remaining nodes lie at or directly neighbouring to lines of local modulus maxima (computed at scale s1 ). Some trivial precision adjustments of the nodes' positions may be xed manually, in order to eliminate shadows or to have a node come to lie exactly on a modulus maximum. We still optimize the threshold value for each model, although we plan to eliminate this dependence. For the resulting graph, nodes are labeled with the jets (ai; 'i ), i = 0; : : : ; 4. The resulting graph is stored as a model graph. This process of creating graphs has the advantage of positioning the nodes on the contour of objects. 2.3 Preprocessing of Input Images Input test images, showing scenes which are unknown to the system, are preprocessed in two dierent steps. In a rst step we use the multiresolutional analysis introduced by Mallat and described in section 2.1. Thus we obtain edge information, represented by the absolute values ai (~x) and the orientation 'i (~x) at dierent scales si. The lters extract all edges which represent contours, texture, shadows and noise. From these we have to separate object contours from all other edges. This is done by a \condence-based" algorithm described in 8 by assigning a \condence value" to every detected local edge element. Initially, this is equal to the absolute value of the lter outputs ai (~x). The condence values are then modied by a specic algorithm that emphasizes local edge elements which are part of a continuous curve. This proceeds on the assumption that object contours are the dominant structures of an image and that noise, shadows or texture edges are continuous only on a ner scale. The algorithm which emphasizes continuous curves combines lter results of dierent scales by searching a counterpart at scale sn for each edge element detected at a ner scale si (i < n). Since there is no one-to-one mapping between edges at dierent scales, localization of edge information on coarser scales being imprecise, one has to search a local area for an appropriate counterpart. This is done with the help of a similarity function which measures the degree of the similarity between an edge at scale si and possible counterparts at scale sn , taking in account strength and direction of the detected edges. 6 Figure 5: Preprocessing of images. First row: original grey level images. Second row: absolute value of Mallat lter results at scale s0 , interpreted as unmodied condence values. Third row: Modied condence values, emphazising the contours of the object, used as an input for the further processing steps. Note that the condence values assigned to the lter results of higher scales si (i > 0) can be modied in the same way. 7 The condence values assigned to the edge elements at scale si are modied by multiplication with the similarity of the best tting counterpart. By doing this, the condence of only such local edge elements are emphazised which are also represented at a larger scale and thus belong to dominant structures in the image. As the modied condence values can help to obtain a more stable edge description of a scene, they are used in our system instead of the ai(~x) extracted by Mallat lters. Fig. 5 shows some results of this preprocessing. It can be seen that the modied condence values emphasize the contours as the dominant structures in the images while disturbing texture is weakened. 2.4 Elastic Graph Matching After the preprocessing of an input test image, the process of nding optimal similarities between it and our model graphs follows. This process is called elastic graph matching (EGM). A model graph is compared node by node to jet information extracted at the current position of the input image. The function used to nd similarities is called similarity function and is dened as the normalized scalar product of the two jets, J~1 and J~2 , ~ ~ T = J~ J~ jJ jjJ j: 1 1 2 2 The total similarity of the model graph is optimized by shifting and scaling it. The optimal similarity value for a model graph determines its t to the image. In order to classify the object in terms of its shape we use the whole gallery of class models for matching. The model graph with the highest similarity determines the shape class but it also species the size and the position of the object within the image. Due to the multiview representation of our models, we also obtain a rough estimate of the object's orientation. The complete graph matching process used in this paper proceeds in two steps: First step: rough location of the object in the image. The graph remains undistorted. Object location corresponds to the position with maximal similarity between the model graph and the input image. Second step: adaptation of scale and improvement of location. The graph from the rst step is allowed to vary in size by scaling it in the x, and the y ,direction by a common factor, shifting the position of the resulting graph by a few pixels to nd maximal similarity. Since we are using model graphs of dierent sizes for each class, the scale factor always is between 0.8 and 2.0. 8 3 Results We have tested our system with objects that are textured and the shape of which deviates from that of our stored model classes, see Fig. 6. We also used dierent backgrounds and natural scenes. For testing purposes we used 180 test images (natural and synthetic). In spite of varying illumination conditions, the system has been able to classify correctly 147 out of the 180 objects. Among these test scenes were also dicult cases containing several objects in one image, where the system had to nd the dominant object in the scene. Some typical examples of the matching results are shown in Fig. 6. Recognition against complex background is dicult since parts of the background may be false targets and can easily be mistaken as parts of the object. By using the condence-based preprocessing we were able to overcome this diculty. Diculties may also arise in cases where part of the object is occluded, as the last example in Fig. 6 where only the upper part of the can has been classied as a cylinder. Classication time on a Sparc-20 was 0.5-2 minutes per object, depending on the number of model graphs we used for matching. The preprocessing time is under 2.3 sec while matching with one modelgraph requires 1-1.5 sec. In 103 out of 154 attempts (67%) the orientation of the object was estimated correctly. We plan to capture more views of each model in our gallery (at present only 10) in order to achieve better performance. 4 Discussion Departing from the face recognition system described in 10 we here take up the challenge of classifying unknown objects in spite of varying surface markings and substantial rotation in depth. Central to our approach is our emphasis of object contours and our multiview representation. We intent to systematically explore the multiview approach in order to obtain a stable and robust representation for our models with a minimal amount of data. In the future we plan to improve segmentation by using relative motion and by recognizing occluding objects. We also plan to organize the recognition process in a hierarchical way: (i) identication of object position, size and orientation in the image, (ii) coarse shape classication, (iii) ne shape classication and (iv) renement of pose estimation. Acknowledgements We wish to thank Christian Eckes for making available his segmentation algorithm for model graph creation. We thank also Laurenz Wiskott, Michael 9 Figure 6: Examples of results of the matching procedure. The left column represents test objects, in the middle column the superimposed graphs are model graphs belonging to the objects shown in the right column. 10 Potzsch and Thomas Maurer for fruitful discussions. This work was supported by the German Ministry for Science and Technology (grant 01IN504E9). References 1. Biederman I., \Recognition by Components: A theory of human image understanding", Psychological Review, 94, pp. 115-147, 1987. 2. Havaldar P. and Medioni G., \Inference of Segmented, Volumetric Shape from Three Intensity Images", in Proc. of the CVPR, 1996. 3. Lades M., Vorbruggen J.C., Buhmann J., Lange J., v.d. Malsburg C., Wurtz R.P. and Konen W., \Distortion Invariant Object Recognition in the Dynamic Link Architecture", IEEE Trans. on Computers , 42(3), pp. 300-311, 1992. 4. Mallat S. and Zhong S., \Characterization of Signals from Multiscale Edges", IEEE Trans. on PAMI , 14(7), pp. 710-732, 1992. 5. v.d. Malsburg C., \The correlation theory of brain function", Intern. Rep., 81-2, MPI Biophysikalische Chemie, Gottingen, 1981. Repr. in E. Domany, J.L. van Hemmen, and K. Schulten, eds, Models of Neural Networks II, pp. 95-119. Springer, Berlin, 1994. 6. Poggio T. and Edelman S., \A network that learns to recognize threedimensional objects", Nature, 343, pp. 263-266, 1990. 7. Reiser K., \Learning persistent structure", PhD thesis, Res. Report 584, Hughes Aircraft Co., 1991. 8. Rehse O., Potzsch M. and v.d. Malsburg C.,\Edge Information: A Condence Based Algorithm Emphazising Steady Curves", in Proc. of Int. Conf. on Articial Neural Networks, pp. 851-856, Bochum 1996. 9. Ulupinar F. and Nevatia R.,\Perception of 3-D surfaces from 2-D contours", IEEE Trans. on PAMI, pp. 3-18, Jan. 1993. 10. Wiskott L., Fellous J.-M., Kruger N. and v.d. Malsburg C. ,\Face Recognition and Gender Determination", in Proc. of the International Workshop on Automatic Face- and Gesture Recognition, Zurich 1995. 11. Zadel S., \Ein lernfahiges, selbstorganisierendes System zum visuell gesteuerten Greifen bei Robotern", PhD thesis, VDI-Verlag, in preparation, Dusseldorf 1997. 12. Zerroug M. and Nevatia R., \Segmentation and Recovery of SHGCs from a Single Intensity Image", in Proc. of the European Conference on Computer Vision, pp. 319-340, Stockholm 1994. 11
© Copyright 2026 Paperzz