Object Classi cation Based on Contours with Elastic Graph Matching

in: Proceedings of the IWVF3 1997,
World Scientific, Singapore, 1997.
Object Classication Based on Contours with Elastic Graph
Matching
Efthimia Kefalea
Institut fur Neuroinformatik, Ruhr-Universitat Bochum,
44780 Bochum, Germany
Olaf Rehse
Institut fur Neuroinformatik, Ruhr-Universitat Bochum,
44780 Bochum, Germany
Christoph von der Malsburg
Institut fur Neuroinformatik, Ruhr-Universitat Bochum,
44780 Bochum, Germany
University of Southern California, Dept. of Computer Science and Section for
Neurobiology, Los Angeles, CA 90089-2520,USA
We describe a system for the detection, classication and pose estimation of simple
objects. The system is robust with respect to surface markings and cluttered background. Recognition is achieved by comparing the image to stored two-dimensional
object views. Stored views are represented as labeled graphs and are derived automatically from images of blank object models. Graph nodes are labeled by edge
information, graph edges by distance vectors in the image plane. Graphs emphasize
occluding boundaries and inner object edges. These are identied by extracting
local maxima in the Mallat wavelet transform of the image. Stored graphs are
compared to test images by elastic matching. Our experiments demonstrate that
the system is capable of fairly reliable shape classication and pose estimation of
objects in natural scenes.
Keywords: computer vision, object recognition, Mallat wavelet trans-
form.
1 Introduction
We present an approach to object classication in terms of shape. It contributes to our group's project of developing a service robot that can manipulate objects in natural environments and that can be trained by lay persons.
The challenge is to reliably estimate object shape, position, size and pose as a
basis for grasping, and to autonomously learn this capability from examples of
new object types. During recognition the system has to be robust with respect
to surface texture and to cluttered background.
1
Our system makes no attempt to derive three-dimensional shape directly
from visual data. Previous work has shown that this is possible (see for instance
2
), although it turns out that it is dicult and requires availability of perfect
contours as input 9 or visibility of certain object points (as pointed out in 12 ),
conditions which are rarely met in natural scenes. It is true that it would suce
to extract 3D information from clean training images, relying during later
performance on stored shape information. However, we don't even attempt
that here. In our robot system we plan to deduce all required shape information
during grasping attempts, an approach already realized in simulation 11 .
At present, we distinguish between four classes of shapes: cubes, bricks,
spheres and cylinders. A particular challenge is the volume of raw data (pixel
arrays), which obviously forces a great number of degrees of freedom on the
system, and the great variability with which a particular object can appear
in the image. We deal with the depth rotation problem by using a multiview
approach based upon contour representations. This approach to object classication relies upon object-adapted representations from dierent viewpoints 7
and is motivated by results from psychophysics 1 .
The system's knowledge about shape classes is described by model graphs,
their nodes being labeled by sets of local features called jets. Jets represent
edge information that is extracted from a small patch of an image by using
the multiresolution analysis introduced by Mallat 4 . We describe the edge
extraction and model graph creation of our system in sections 2.1 and 2.2,
respectively.
Input images to be analysed are preprocessed in two steps. In the rst step,
we use the above mentioned Mallat method and in the second step a specic
condence-based algorithm introduced by us in 8. In this way, we obtain a
reliable edge interpretation of the scene. The preprocessing procedure of input
images is discussed in section 2.3.
Our matching process is described in section 2.4. It is based on Elastic
Graph Matching (EGM) described in 3 , our present version being that described
in 10 . EGM proceeds by comparing stored model graphs to the image in terms
of similarities between stored jets and jets extracted from the image, adapting
the location and size of model graphs until an optimum is found. EGM is
a simple algorithmic caricature of Dynamic Link Maching, a neural model
based on synchrony coding of feature binding and rapidly reversible synaptic
plasticity 5 .
We conclude with the results of our experiments in section 3 and a short
discussion in section 4.
2
Figure 1: A two-dimensional Mallat lter in the spatial domain.
2 Description of the System
2.1 Edge Extraction
For the representation of objects the system employs labeled model graphs. A
labeled graph G consists of N nodes positioned on contour points of the object
at positions ~xn, n = 1; :::; N and E edges between them. Edges connecting
neighbouring nodes are labeled with distance vectors between node positions.
Nodes are labeled with image information referring to features lying on the
contour. These labels are called jets. They are derived from linear lter operations in the form of convolutions of the image I (~x) (~x 2 M 2; M := f1; : : : ; rg)
with lters s( ) (~x)
(h) (~x)
s
T~s (~
x) = I (~
x) (v)
s (~x)
where represents a convolution, h and v stand for horizontal and vertical,
and si (si = s0 2i ; i 2 N ) represents the width of a Gaussian the derivatives
of which are used as lters. The absolute value ai (~x) = jT~s (~x)j and the angle
T (~x)
'i (~
x) = arctan
T (~x) measure strength and orientation of an intensity change
at scale si and position ~x. Modulus Maxima positions mark local maxima
of the strength ai , representing precisely the trace of an edge. Fig. 1 shows
a two-dimensional Mallat lter in the spatial domain. As we use 5 image
i
i
i
i
(h)
si
(v)
si
3
Figure 2: Views of dierent shape classes and their model graphs. Note that these are not
test images, but examples of the system's knowledge about shape classes.
resolution levels, a feature vector coding local image information also called
jet J is a collection of 2 5 values (ai; 'i ), i = 0; : : : ; 4 with 0 denoting the
highest frequency level.
2.2 Creation of Model Graphs
Model graphs represent the system's knowledge about shape classes. The system employs three dierent sizes of model graphs for each shape class. The
structure of each model graph is object-adapted, its outline depending on object contours. Fig. 2 depicts model graphs superimposed on the original images.
Graphs of dierent views dier in geometry and local features. In order
to be able to classify objects irrespective of viewpoint, we use a multiple-view
approach. Due to the way we represent contour information, our system can
recognize views of a shape class it has never seen before, by simply using a
limited subset of all possible views. We employ a so-called multigraph structure.
Each multigraph consists of a certain number of model graphs representing
the same shape from dierent viewpoints. The procedure for creating such a
multigraph is as follows:
We rst create a discrete view sphere of an object. We then take object images by rotating the object according to Fig. 3 while having a constant object
distance and varying angle . In ,direction we step with constant stepwidth
from min to max taking n views. Furthermore, we instruct the system as
4
Z
object
Y
X
θ
Figure 3: Sampling of the view sphere used for our multiple-view approach.
Figure 4: Each object is represented by multigraphs.
to which model graphs (representing dierent views) represent the same object, by simply giving identity labels to the corresponding model graphs (see
Fig. 4). Since the labeling procedure is an external process, it needs no further
computational time. This is an advantage of our approach in comparison to
other multi-view object recognition systems, such as that described in 6 .
Our gallery of models consists of simple wooden objects. We have four
object classes: cubes, bricks, cylinders and spheres. Within each class there
are objects of three dierent sizes. For each size there are 10 dierent views.
These 30 graphs constitute the multigraph of that object. We recorded 8-bit
grey-scale images of 128x128 pixels. Examples of our model gallery are given
in Fig. 2.
To create a model graph a simple segmentation procedure is applied which
requires a picture of the object. We transform this picture using the Mallat
wavelet transform. All nodes on a square lattice of points with a spacing of 4
pixels are visited in the image. Each node has a maximum of eight neighbours.
Nodes not lying on the object's contour are deleted according to the following
procedure:
1. All nodes are deleted for which the average magnitude of the wavelet
response at level 1 over a 4x4 square of pixels is below a given threshold
value. In this way, contour nodes are favored over those positioned within
the object or the background. This step leads to graphs corresponding
5
to connected regions.
2. Among the thus created graphs, only the one with the greatest number
of nodes is kept (thus getting rid of clutter).
3. All nodes with 6 neighbours are deleted, as they presumably lie inside
the object.
After these steps, all remaining nodes lie at or directly neighbouring to
lines of local modulus maxima (computed at scale s1 ). Some trivial precision
adjustments of the nodes' positions may be xed manually, in order to eliminate
shadows or to have a node come to lie exactly on a modulus maximum. We still
optimize the threshold value for each model, although we plan to eliminate this
dependence. For the resulting graph, nodes are labeled with the jets (ai; 'i ),
i = 0; : : : ; 4. The resulting graph is stored as a model graph. This process of
creating graphs has the advantage of positioning the nodes on the contour of
objects.
2.3 Preprocessing of Input Images
Input test images, showing scenes which are unknown to the system, are preprocessed in two dierent steps. In a rst step we use the multiresolutional
analysis introduced by Mallat and described in section 2.1. Thus we obtain
edge information, represented by the absolute values ai (~x) and the orientation 'i (~x) at dierent scales si. The lters extract all edges which represent
contours, texture, shadows and noise. From these we have to separate object
contours from all other edges. This is done by a \condence-based" algorithm
described in 8 by assigning a \condence value" to every detected local edge
element. Initially, this is equal to the absolute value of the lter outputs ai (~x).
The condence values are then modied by a specic algorithm that emphasizes local edge elements which are part of a continuous curve. This proceeds
on the assumption that object contours are the dominant structures of an image and that noise, shadows or texture edges are continuous only on a ner
scale.
The algorithm which emphasizes continuous curves combines lter results
of dierent scales by searching a counterpart at scale sn for each edge element
detected at a ner scale si (i < n). Since there is no one-to-one mapping
between edges at dierent scales, localization of edge information on coarser
scales being imprecise, one has to search a local area for an appropriate counterpart. This is done with the help of a similarity function which measures the
degree of the similarity between an edge at scale si and possible counterparts
at scale sn , taking in account strength and direction of the detected edges.
6
Figure 5: Preprocessing of images. First row: original grey level images. Second row:
absolute value of Mallat lter results at scale s0 , interpreted as unmodied condence values.
Third row: Modied condence values, emphazising the contours of the object, used as an
input for the further processing steps. Note that the condence values assigned to the lter
results of higher scales si (i > 0) can be modied in the same way.
7
The condence values assigned to the edge elements at scale si are modied
by multiplication with the similarity of the best tting counterpart. By doing
this, the condence of only such local edge elements are emphazised which are
also represented at a larger scale and thus belong to dominant structures in
the image.
As the modied condence values can help to obtain a more stable edge
description of a scene, they are used in our system instead of the ai(~x) extracted
by Mallat lters. Fig. 5 shows some results of this preprocessing. It can be seen
that the modied condence values emphasize the contours as the dominant
structures in the images while disturbing texture is weakened.
2.4 Elastic Graph Matching
After the preprocessing of an input test image, the process of nding optimal
similarities between it and our model graphs follows. This process is called
elastic graph matching (EGM). A model graph is compared node by node to
jet information extracted at the current position of the input image. The
function used to nd similarities is called similarity function and is dened as
the normalized scalar product of the two jets, J~1 and J~2 ,
~ ~
T = J~ J~
jJ jjJ j:
1
1
2
2
The total similarity of the model graph is optimized by shifting and scaling it.
The optimal similarity value for a model graph determines its t to the image.
In order to classify the object in terms of its shape we use the whole gallery
of class models for matching. The model graph with the highest similarity
determines the shape class but it also species the size and the position of the
object within the image. Due to the multiview representation of our models,
we also obtain a rough estimate of the object's orientation. The complete
graph matching process used in this paper proceeds in two steps:
First step: rough location of the object in the image. The graph remains
undistorted. Object location corresponds to the position with maximal
similarity between the model graph and the input image.
Second step: adaptation of scale and improvement of location. The
graph from the rst step is allowed to vary in size by scaling it in the
x, and the y ,direction by a common factor, shifting the position of
the resulting graph by a few pixels to nd maximal similarity. Since we
are using model graphs of dierent sizes for each class, the scale factor
always is between 0.8 and 2.0.
8
3 Results
We have tested our system with objects that are textured and the shape of
which deviates from that of our stored model classes, see Fig. 6. We also used
dierent backgrounds and natural scenes. For testing purposes we used 180 test
images (natural and synthetic). In spite of varying illumination conditions, the
system has been able to classify correctly 147 out of the 180 objects. Among
these test scenes were also dicult cases containing several objects in one
image, where the system had to nd the dominant object in the scene. Some
typical examples of the matching results are shown in Fig. 6. Recognition
against complex background is dicult since parts of the background may
be false targets and can easily be mistaken as parts of the object. By using
the condence-based preprocessing we were able to overcome this diculty.
Diculties may also arise in cases where part of the object is occluded, as
the last example in Fig. 6 where only the upper part of the can has been
classied as a cylinder. Classication time on a Sparc-20 was 0.5-2 minutes
per object, depending on the number of model graphs we used for matching.
The preprocessing time is under 2.3 sec while matching with one modelgraph
requires 1-1.5 sec.
In 103 out of 154 attempts (67%) the orientation of the object was estimated correctly. We plan to capture more views of each model in our gallery
(at present only 10) in order to achieve better performance.
4 Discussion
Departing from the face recognition system described in 10 we here take up the
challenge of classifying unknown objects in spite of varying surface markings
and substantial rotation in depth. Central to our approach is our emphasis of
object contours and our multiview representation. We intent to systematically
explore the multiview approach in order to obtain a stable and robust representation for our models with a minimal amount of data. In the future we plan
to improve segmentation by using relative motion and by recognizing occluding objects. We also plan to organize the recognition process in a hierarchical
way: (i) identication of object position, size and orientation in the image, (ii)
coarse shape classication, (iii) ne shape classication and (iv) renement of
pose estimation.
Acknowledgements
We wish to thank Christian Eckes for making available his segmentation algorithm for model graph creation. We thank also Laurenz Wiskott, Michael
9
Figure 6: Examples of results of the matching procedure. The left column represents test
objects, in the middle column the superimposed graphs are model graphs belonging to the
objects shown in the right column.
10
Potzsch and Thomas Maurer for fruitful discussions. This work was supported
by the German Ministry for Science and Technology (grant 01IN504E9).
References
1. Biederman I., \Recognition by Components: A theory of human image
understanding", Psychological Review, 94, pp. 115-147, 1987.
2. Havaldar P. and Medioni G., \Inference of Segmented, Volumetric Shape
from Three Intensity Images", in Proc. of the CVPR, 1996.
3. Lades M., Vorbruggen J.C., Buhmann J., Lange J., v.d. Malsburg C.,
Wurtz R.P. and Konen W., \Distortion Invariant Object Recognition in
the Dynamic Link Architecture", IEEE Trans. on Computers , 42(3),
pp. 300-311, 1992.
4. Mallat S. and Zhong S., \Characterization of Signals from Multiscale
Edges", IEEE Trans. on PAMI , 14(7), pp. 710-732, 1992.
5. v.d. Malsburg C., \The correlation theory of brain function", Intern.
Rep., 81-2, MPI Biophysikalische Chemie, Gottingen, 1981. Repr. in
E. Domany, J.L. van Hemmen, and K. Schulten, eds, Models of Neural
Networks II, pp. 95-119. Springer, Berlin, 1994.
6. Poggio T. and Edelman S., \A network that learns to recognize threedimensional objects", Nature, 343, pp. 263-266, 1990.
7. Reiser K., \Learning persistent structure", PhD thesis, Res. Report 584,
Hughes Aircraft Co., 1991.
8. Rehse O., Potzsch M. and v.d. Malsburg C.,\Edge Information: A Condence Based Algorithm Emphazising Steady Curves", in Proc. of Int.
Conf. on Articial Neural Networks, pp. 851-856, Bochum 1996.
9. Ulupinar F. and Nevatia R.,\Perception of 3-D surfaces from 2-D contours", IEEE Trans. on PAMI, pp. 3-18, Jan. 1993.
10. Wiskott L., Fellous J.-M., Kruger N. and v.d. Malsburg C. ,\Face Recognition and Gender Determination", in Proc. of the International Workshop on Automatic Face- and Gesture Recognition, Zurich 1995.
11. Zadel S., \Ein lernfahiges, selbstorganisierendes System zum visuell gesteuerten Greifen bei Robotern", PhD thesis, VDI-Verlag, in preparation,
Dusseldorf 1997.
12. Zerroug M. and Nevatia R., \Segmentation and Recovery of SHGCs from
a Single Intensity Image", in Proc. of the European Conference on Computer Vision, pp. 319-340, Stockholm 1994.
11