Evaluation of a hypothesizer for silhouette-based 3

Pattern Recognition 36 (2003) 69 – 78
www.elsevier.com/locate/patcog
Evaluation of a hypothesizer for silhouette-based 3-D object
recognition
Boaz J. Super ∗ , Hao Lu
Computer Vision and Robotics Laboratory, Department of Computer Science, University of Illinois at Chicago,
851 S. Morgan St., Chicago, IL 60607, USA
Received 1 May 2001; accepted 2 January 2002
Abstract
Shape retrieval and shape-based object recognition are closely related problems; however, they have di2erent task contexts,
performance criteria, and database characteristics. In previous work, we proposed a method for similarity-based 2-D shape
retrieval using scale-space part decompositions, part-frequency distributions, and structural indexing. In this paper, we evaluate
the use of that shape retrieval method as the hypothesis generation component of silhouette-based 3-D object recognition
systems, using a performance criterion and test database appropriate for the new application. ? 2002 Pattern Recognition
Society. Published by Elsevier Science Ltd. All rights reserved.
Keywords: Object recognition; Hypothesis generation; Silhouette; View-based; Scale-space; Index vector; Structural indexing; Shape retrieval
1. Introduction
Shape retrieval and shape-based object recognition are
closely related problems with similar solution methods.
However, the task context and performance criteria are
di2erent. Shape retrieval is a user tool for information retrieval; for example, to help users search a pictorial catalog
of botanical varieties [1], or to index images and video
based on the shapes contained in them [2,3]. A shape retrieval method is successful if many of the retrieved results
are similar to the query (high precision), or many of the
similar items in the database are retrieved (high recall).
Because the ultimate arbiter of similarity is the human
user, these performance measures are necessarily subjective. Object recognition, on the other hand, is typically part
of an automated system. Success can be objectively measured as the proportion of images of test objects correctly
recognized.
∗ Corresponding author. Tel.: +1-312-413-8719; fax: +1-312413-0024.
E-mail address: [email protected] (B.J. Super).
To avoid the high cost of comparing an image to every
object model in a database, object recognition systems often
employ a fast hypothesizer to retrieve a set of likely candidate objects, followed by a more costly and more accurate
veri-er which reduces the set of hypotheses. Ideally, the
reduced set will consist of one correct match. In a silhouette
view-based system, a set of object silhouettes imaged from
di2erent viewpoints is stored for each model object, and a
silhouette of the object to be recognized is compared to the
stored silhouettes. A shape retrieval engine can be used as
a hypothesizer to return a set of silhouettes similar to the
input, but it may be the case that only a few, or none, of these
silhouettes belong to the correct object model. Thus, high
precision or recall are not suBcient; what matters is whether
any one of the retrieved similar silhouettes are from views
of the correct object model. The well-known distinction
between object shape and object identity prevents these
two criteria from being the same.
An alternative scheme is for the hypothesizer to rank the
database silhouettes and the veriCer to process them in rank
order, stopping when a good match is found. In this case,
the performance of a particular hypothesizer should be evaluated by measuring the reduction in the expected number of
0031-3203/02/$22.00 ? 2002 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved.
PII: S 0 0 3 1 - 3 2 0 3 ( 0 2 ) 0 0 0 2 3 - 7
70
B.J. Super, H. Lu / Pattern Recognition 36 (2003) 69 – 78
silhouettes that the veriCer must process. This is a measure
of the reduction in the cost of the veriCcation stage obtained
by using the hypothesizer. We adopt the hypothesis ranking
approach in this paper.
In Refs. [4 – 6], we developed a fast 2-D shape retrieval
system, based on scale-space decompositions of silhouettes,
part-frequency histograms coded as index vectors, and structural indexing. In experimental evaluations, the method
retrieved at least one similar shape in the top three retrieved
items 99 –100% of the time on two test databases. Good
scalability was observed over database sizes from O(102 )
to O(103 ), with average retrieval times from 0.7 to 7 ms,
respectively. Graceful degradation under occlusion was
also demonstrated.
The aim of the present paper is to evaluate our 2-D
shape retrieval engine as a hypothesizer for silhouette-based
recognition of isolated 3-D objects, using performance measures appropriate to this new application. Since silhouette
databases for 3-D object recognition tend to be qualitatively
di2erent from databases used in 2-D shape retrieval applications (e.g., [1,5,7,8]), we use a new test database constructed
for this evaluation.
By “isolated object,” we refer to a single object placed
against a contrasting background. This case is of course
much easier to handle than recognition of objects in clutter,
and therefore can be solved more e2ectively using computer vision techniques available today. There are numerous
applications in which processing of isolated objects is of
interest; for example, digitization of museum collections,
archaeological artifacts, and zoological=botanical specimens. Automatic acquisition of silhouettes is particularly
reliable in these controlled situations, and can be performed simultaneously with regular image acquisition or
3-D scanning.
Object recognition systems that match an input silhouette
against a database of stored silhouette views have a long
history (e.g., [9 –13]). Methods of this type di2er in the representation of the silhouette (e.g., Fourier descriptors [9,10],
moments [10], positions of maxima in curvature scale space
[13], attributed strings [14]), and also di2er in the indexing
scheme (e.g., geometric hashing [15], structural indexing
[16], hierarchical indexing [14]). A detailed discussion of
silhouette-based methods for both shape retrieval and object
recognition was given in Ref. [6]. However, because both
the hypothesizer evaluated here and [13] use curvature scale
space, it will be helpful to describe the di2erences brieLy
here. In Ref. [13], each signiCcant segment of the silhouette
contour is represented by a point in curvature scale space.
Matching is performed by alignment of these point sets in
scale space. Our method employs curvature scale space in
a very di2erent way. First, curvature scale space is used to
partition the silhouette contour into overlapping local parts
at all scales; then, contour scale space is used to classify the
parts into types. Matching is performed not by alignment but
by comparing part-type histograms. EBciency is achieved
by indexing on local parts instead of by preCltering.
Silhouettes have also been employed for purposes besides
view-based object recognition. Silhouettes have been used
for recovery of 3-D shape by shape-from-contour [17] and
by volume intersection [18,19]. Although there are many
object recognition methods based on matching images
to 3-D models, or matching 3-D models directly to 3-D
models, we Cnd that object recognition based on matching
image silhouettes is fast and has low storage requirements.
Furthermore, acquisition of silhouettes for view-based
object recognition can be performed easily and inexpensively using an uncalibrated camera (see Section 2.1 for a
description of our imaging platform).
The remainder of the paper is organized as follows. Section 2 describes the object recognition system design. This
section includes a summary of the shape retrieval engine
from Refs. [5,6] that will be evaluated as the hypothesizer.
Section 3 describes the test database we constructed to perform the evaluation. Section 4 quantitatively evaluates the
hypothesizer.
2. System design
A block diagram of a typical silhouette-based object
recognition system is shown in Fig. 1. It has four parts: acquisition, silhouette contour extraction, hypothesis generation, and hypothesis veriCcation. The shape retrieval engine
from Refs. [5,6] will be used as the hypothesis-generation
stage. As the goal of this paper is to evaluate the hypothesizer independently of any speciCc veriCer, we use a general
model of a veriCer as the last stage. The following sections
describe the system.
2.1. Acquisition
Objects were placed on a turntable that could be rotated
◦
through 360 , and imaged with a single camera mounted on
a stand. The turntable top surface and the background had the
same color, and the illumination was controlled, in order to
obtain good discrimination between object and background.
(An alternative arrangement would be to use backlighting.)
Fig. 2 illustrates the imaging station and geometry. Views
of the object at di2erent azimuths could be acquired by
rotating the turntable. Views at di2erent elevations required
adjustment of the camera height. In an environment for automatic digitization of large numbers of objects it would be
advantageous to have computer control of both azimuth and
elevation. Fig. 3 shows examples of object images. Fig. 5
◦
shows 16 silhouettes of object 14 at elevation 0 and
◦
azimuth angles 22:5 apart.
2.2. Silhouette contour extraction
Iterative thresholding was performed to automatically separate the object from the background. Then
a boundary-following algorithm extracted the contour
B.J. Super, H. Lu / Pattern Recognition 36 (2003) 69 – 78
object
image
acquisition
silhouette
extraction
hypothesis
generation
ranking
hypothesis
verification
71
recognized
object
silhouette
database
Fig. 1. Block diagram of silhouette-based object recognition system.
elevation
turntable
azimuth
Fig. 2. Image acquisition platform.
ing of sequences of neighboring scale-space regions to form
shape parts. The advantage of the scale-space approach is
that it does not presuppose a single scale of shape analysis,
and that the embedded segmentation method is essentially
parameter-free.
We assume that an object silhouette is represented by a
planar contour c(u) = (x(u); y(u)), where u is the parameter. A two-parameter contour scale space c(u; ) is generated by convolving the coordinate functions of c(u) by 1-D
Gaussians over a range of scales :
c(u; ) = (x(u; ); y(u; )) ≡ c(u) ∗ g (u)
corresponding to the object’s silhouette [20]. The contour
was stored as an ordered list of points. Figs. 4 –5 show
examples of silhouettes. The contours were expanded to Ct
the individual panels in the Cgures to make the silhouettes
legible. This normalization is for display only; the method
itself does not require normalization.
2.3. Hypothesizer
The hypothesizer is the shape retrieval engine we developed in Refs. [4 – 6]. As this is the module evaluated in this
paper, we summarize the method in this section to make the
paper self-contained. For a detailed description and a
discussion of its relation to alternative methods in the
literature, see Ref. [6]. The hypothesizer consists of three
components. The Crst component decomposes the silhouette
contour into a collection of scale-space parts. The second
component constructs histograms of the part types and
represents them as index vectors which are stored in the
database. The third component compares the input silhouette’s index vector with the index vectors in the database,
using structural indexing to perform the comparison eBciently. We review each of the components below.
The term “feature” is used in the literature to mean either
a part or a measurement. To prevent confusion, in this paper
“feature” will always refer to a measurement.
2.3.1. Scale-space part-based representation of silhouettes
A scale-space method is used to generate a rich and redundant set of parts at multiple scales of the silhouette contour.
The method consists of three stages: generation of the contour scale space and the curvature scale space, segmentation
of the silhouette contour into scale-space regions, and link-
= (x(u) ∗ g (u); y(u) ∗ g (u));
where ‘∗’ denotes convolution and g (u) = (2
2 )−1=2
2
2
e−u =2 is a 1-D Gaussian with scale [21]. As increases,
the contour becomes progressively smoother, a process
called contour evolution. Fig. 6a shows an example.
At any one scale, the silhouette contour is segmented at
zero-crossings of its curvature function, which correspond
to the contour’s inLection points. These points are invariant
to translation, rotation, and scaling. By deCnition, the segments between zero crossings have no internal changes of
sign, so they bend in only one direction, forming an alternating sequence of protrusions and concavities. Other types
of critical points, such as negative curvature minima [22],
could also be used.
To Cnd the zero crossings, a curvature scale space is
computed [21]:
x (u; )y (u; ) − x (u; )y (u; )
(u; ) =
;
[x2 (u; ) + y2 (u; )]3=2
where
x (u; ) = x(u) ∗ g (u);
x (u; ) = x(u) ∗ g (u);
y (u; ) = y(u) ∗ g (u);
y (u; ) = y(u) ∗ g (u)
are smoothed versions of the Crst and second derivatives of
the contour’s coordinate functions. (u; ) is the curvature at
position u on the contour at scale . A zero-crossing graph
shows the u-values of the zero crossings as functions of ,
as shown in Fig. 6b. 1 At any one scale , the contour is
partitioned into segments, represented in the zero-crossing
1 The zero-crossing graph is called a curvature scale space image
in Ref. [21].
72
B.J. Super, H. Lu / Pattern Recognition 36 (2003) 69 – 78
Fig. 3. Twenty-Cve objects in the test set.
Fig. 4. Silhouette contours of the views in Fig. 3.
B.J. Super, H. Lu / Pattern Recognition 36 (2003) 69 – 78
73
◦
Fig. 5. Sixteen views of object 14 (dinosaur toy), at azimuth intervals of 22:5 .
Fig. 6. (a) Evolution of a contour. The bold line indicates the original contour. (b) The corresponding zero-crossing graph showing the
positions (u) of zero crossings as functions of scale .
graph as horizontal spans between successive zero-crossing
traces. The segments at adjacent scales are generally not
independent. A contour segment typically exists over some
range of scales until it is Cnally smoothed away. This process
will now be explained more precisely.
As increases, zero crossings shift position, and eventually merge and disappear in pairs, forming the loops in
the zero-crossing graph [21]. 2 Zero-crossings can never be
created, thus the loops always point downwards. When two
zero crossings merge and disappear at some scale 1 , the
three segments containing them are merged into one new
segment. That segment persists until it is merged with other
segments at a larger scale 2 . The range of scales (1 ; 2 )
2 The loops in Fig. 6b are not closed because in a practical
implementation, is sampled discretely.
over which a segment exists will be called the lifetime of
that segment. A scale-space segment corresponds to a region
in the zero-crossing graph, bounded on the left and right by
two zero crossings and above and below by the scales at
which zero-crossing merging events occur (see Fig. 7). This
is a special case of Witkin’s [23] deCnition of regions in
general 1-D signal scale spaces. In the following, the term
segment without qualiCcation will refer to a scale-space
segment, and contour segment will refer to a section
of the original contour c(u).
The database matching method is based on comparing histograms of the part types occurring in two silhouette shapes.
Thus, parts that can distinguish among di2erent shapes are
needed. Individual segments may not have suBcient discriminating power, so sequences of segments are used instead. SpeciCcally, a part of length L is deCned as a spatial
74
B.J. Super, H. Lu / Pattern Recognition 36 (2003) 69 – 78
The part feature vector is deCned as follows. The total
curvature of a segment, curv(s), is measured by integrating
the derivative of local orientation along the segment. The
length contrast of two neighboring segments si ; si+1 is measured as
l(si+1 ) − l(si )
LC(si ; si+1 ) =
;
l(si+1 ) + l(si )
Fig. 7. Diagram of a typical portion of a zero-crossing graph.
Six zero crossings (z1 ; : : : ; z6 ) and their traces as a function of are shown. Regions of scale space corresponding to scale-space
segments (s1 ; : : : ; s7 ) are indicated by shading. Parts consist of
sequences of regions that are adjacent at a common scale. The
parts of length 3 in this example, indicated by the double-headed
arrows, are p1 = (s1 ; s2 ; s3 ); p2 = (s2 ; s3 ; s4 ); p3 = (s3 ; s4 ; s5 ), and
p4 = (s6 ; s4 ; s5 ).
sequence of L adjacent scale-space segments existing at a
common scale (Fig. 7). If the same sequence of scale-space
segments (regions) exists at more than one scale it is counted
only once. A part persists as long as none of the segments
composing it undergo any merges. Parts overlap both in
space and in scale, providing redundancy which enhances
robustness.
In the following, P(S) will denote the set of parts generated from silhouette S. For convenience, only parts of a
Cxed length L will be used, although this is not necessary. In
Ref. [6], we showed that the number of parts of length L is
bounded above by (L=2 + 1) times the number of segments
in the original silhouette contour. Since L is typically small
(L = 3 in this paper), the use of scale-space parts instead
of contour segments does not incur a substantial additional
cost. Furthermore, the Cnest scales ( 6 3 pixels) are not
used, since structures at those scales are often due to discretization noise. As a result, the number of parts may be
less than the number of segments on the original silhouette
contour.
2.3.2. Part-type histograms and index vectors
The shape parts are sequences of scale-space regions.
This raw multiscale representation is not convenient for
database indexing. Instead, the system classiCes parts into
types, and the database is indexed by the part types. Each
silhouette shape will then be represented in compact form
by the frequencies with which the part types occur in the
silhouette.
The parts are classiCed into part types by Crst representing
them with local shape features and then quantizing those
features into a Cnite number of classes.
where l(s) is the arc length of segment s. Length contrast
is between −1 and 1. LC ¡ 0 indicates a long segment followed by a short segment; LC ¿ 0 indicates the reverse; and
LC ≈ 0 indicates that the two segments are commensurate
in size. Then, denoting the segments of a part of length L by
s1 ; : : : ; sL , the part is represented by the (2L−1)-dimensional
feature vector
H = (curv(s1 ); LC(s1 ; s2 ); curv(s2 ); : : : ;
LC(sL−1 ; sL ); curv(sL )):
The feature vector is then quantized by quantizing each
feature dimension. Each distinct possible quantized feature
vector is one part type, corresponding to one hyper-rectangle
in the feature space. T will denote the set of part types used
by the hypothesizer, which may be a subset of all possible
part types. T (S) ⊆ T will denote the set of part types occurring in silhouette S. Note that T (S) is a set of part types,
while P(S) is a set of parts; i.e., individual occurrences of
part types.
The use of coarse quantization to reduce quantization errors is recommended by Califano and Mohan
[24], Stein and Medioni [16]. Accordingly, segment
curvature is quantized into eight non-uniform ranges:
(−∞; −
); [−
; −
=2), [−
=2; −
=4), [−
=4; 0), [0; =4),
[
=4; =2), [
=2; ); [
; ∞), and length contrast is quantized into three ranges: [−1; − 13 ); [− 13 ; 13 ), and [ 13 ; 1]. The
number of part types of length L is thus 8L 3L−1 .
The silhouette as a whole is represented by the number
of occurrences of each part type in that silhouette. Let T =
{t1 ; : : : ; tn } be the set of part types used by the system and
let PS (ti ) = {pi1 ; : : : ; pini } denote the set of occurrences of
part type ti in silhouette S. DeCne the saliency-weighted
part-type histogram of a silhouette S given T as
DT∗ (ti ; S) =
w(pij ); for i = 1; : : : ; n;
(1)
pij ∈Ps (ti )
where the saliency weight w(pij ) is a function of the part
and=or the part type. When w ≡ 1; DT∗ is simply the histogram of the part types. In the current implementation, the
saliency of a part p is deCned as
L
2 (si )
w(p) =
log2
;
(2)
1 (si )
i=1
where s1 ; : : : ; sL are the scale-space segments in the part and
1 (s); 2 (s) are the smallest and largest scales of segment
s, respectively. This saliency function emphasizes parts that
have a long scale-space lifetime.
B.J. Super, H. Lu / Pattern Recognition 36 (2003) 69 – 78
The part-type histogram of a silhouette S will be repre∗
sented by an index vector v=(v
R
1 ; : : : ; vn ), where vi =DT (ti ; S)
for i = 1; : : : ; n. Since the same set T is used for all silhouettes handled by the system, all index vectors have the same
length, and corresponding components refer to the same part
type.
The similarity of two silhouettes is measured by comparing part-type histograms using a normalized histogram
intersection measure
n
i ; qi )
i=1 min(v
M (v;
R q)
R =
;
(3)
max( ni=1 vi ; ni=1 qi )
where v;
R qR are the index vectors of two silhouettes. The normalization factor is the maximum of the sums over the two
histograms, so that silhouettes that are very dissimilar in
complexity will not produce a high similarity score.
To summarize: The part feature vectors are used to classify silhouette parts, and to deCne the meaning of individual
components of the index vectors. Silhouettes are compared
by matching index vectors, not feature vectors.
We note that the method is invariant to translation, rotation, and scaling since invariant features are used, and is
invariant to starting point on the silhouette contour due to
the use of part-type histograms for matching.
2.3.3. Fast ranking of silhouette match hypotheses
This section summarizes an eBcient method for computing the similarity scores between the input silhouette’s index
vector and the index vectors of the stored silhouettes.
Let vR1 ; : : : ; vRm denote the index vectors of all the silhouettes of all the objects in the database, and let qR be the index
vector of the input silhouette. The similarity score M (vRj ; q)
R
is computed for all j = 1; : : : ; m and the stored silhouettes
are ranked. The sums in the denominator of M (Eq. (3)) can
be precomputed for the stored silhouettes; therefore, the expensive part of the computation is the numerator of M , since
every component of the input index vector is compared to
the corresponding component of every index vector in the
database.
Typically, the part type set T is large but the number of
part types occurring in any one silhouette is small. Thus,
most components of the index vectors are zero and contribute nothing to the similarity score. High speed can be
achieved by computing only the non-zero terms of M .
This is accomplished by indexing the database independently by each part type (structural indexing [16]). Each
part type t ∈ T that appears in the database points to a list
of the silhouettes in the database that contain at least one
occurrence of t. The similarity scores M (vRj ; q)
R for all stored
silhouettes j = 1; : : : ; m are then computed as follows. For
each non-zero component i of the input index vector q,
R the
list of stored silhouettes containing part type ti is fetched using a hash table. For each entry on the list, corresponding to
some stored silhouette Sj , the ith term of the numerator of
M (vRj ; q)
R is computed and added to the score for silhouette
Sj . Normalization of the scores for each silhouette can be
75
performed after the sums have been completed, and all the
scores are then ranked. The process is fast: in Ref. [6], the
average time to match an input vector to a 1310-silhouette
database and to rank all 1310 silhouettes was 7 ms on a Sun
Ultra-10 333 MHz computer.
2.4. Model of veri-er
Since our goal is to evaluate the hypothesizer independently of any speciCc veriCer, we use a general model of
a veriCer, which satisCes two properties: (1) the veriCer
checks candidate matches in the rank order produced by
the hypothesizer, and (2) the veriCer makes no errors. This
model will allow us to measure the hypothesizer’s contribution to performance without confounding it with veriCer
errors. SpeciCcally, we will compute the expected number
of veriCcations performed by the ideal veriCer as a function
of di2erent hypothesizers.
3. Test database
We constructed a test database for evaluation of the hypothesizer, as follows. Twenty-Cve common objects and
toys shown in Fig. 3 were used. Silhouettes of the views in
Fig. 3 are shown in Fig. 4. Sixteen views were taken of each
◦
◦
object, at elevation 0 and azimuth spacing of 22:5 . Thus,
there were a total of 400 images stored in the database. Each
image was 320 × 243 pixels and was in color to enhance
the reliability of automatic thresholding. The silhouette contours were represented at one-pixel resolution.
Database index vectors were generated for the silhouettes
as described in Section 2.3 above. SigniCcant storage compression was achieved: the original images required 77,760
pixels each; the contours required 685 points on average;
and the index vectors contained 70 non-zero components on
average.
4. Evaluation
Each of the 400 images was used as a test image in turn.
The self-match of the test image to its copy in the database
is excluded from all results reported below. Thus, each test
◦
image is a minimum of 22:5 of separation from the nearest
view of the correct object. This is a fairly stringent test, since
◦
a 22:5 viewpoint change can result in a signiCcant change
in the silhouette in some cases.
We consider the database to be a diBcult one due to high
levels of noise in the silhouette contour extraction (see the
dinosaur’s leg in view 13 of Fig. 5 for an example), and due
to the similarity of some of the views of di2erent objects
(e.g., objects 20 and 24, or objects 16 and 18, in Fig. 3).
Cross-object matches are always counted as incorrect, even
in cases where a human observer would confuse views from
two similar objects.
76
B.J. Super, H. Lu / Pattern Recognition 36 (2003) 69 – 78
Table 1
Statistics of the rank of Crst correct match for di2erent hypothesizers. For the random ordering case only, the expected value is
reported instead of the sample mean
Rank of Crst correct match
Hypothesizer
Mean
Part-based (histogram intersection) 5.2
None (random ordering)
25.0
Global (aspect ratio)
11.7
Combined (part-based & global)
3.6
Table 2
Statistics of the rank of Crst correct match for di2erent hypothe◦
◦
sizers, for viewpoint sampling at 45 instead of 22:5 (8 views
per objects instead of 16). For the random ordering case only, the
expected value is reported instead of the sample mean
Rank of Crst correct match
Median
1
18
6
1
Fig. 8. Cumulative distribution of rank of Crst correct hypothesis
for the four hypothesizer cases.
The hypothesizer ranked all 399 database silhouettes in
descending order of similarity to the test silhouette, where
similarity is deCned as in Eq. (3). The ideal veriCer would
test each silhouette in this ordering, stopping when the Crst
correct match is found. The mean rank of the Crst correct
match, computed over all 400 test inputs, was 5.2; the median was 1 (Table 1). The cumulative distribution of the
rank of the Crst correct hypothesis is shown in Fig. 8 (the
solid line).
To provide a basis for comparison, the system was compared to an object recognition system with no hypothesizer,
and also to an object recognition system with a hypothesizer
that used a perceptually salient global feature of silhouettes,
the aspect ratio.
The system without a hypothesizer was modeled by
presenting the veriCer with database silhouettes in arbitrary
order. We assumed all orderings of database silhouettes
to be equally likely. The cumulative distribution of the rank
of the Crst correct hypothesis is plotted as a dotted line in
Hypothesizer
Mean
Part-based (histogram intersection) 7.3
None (random ordering)
25.0
Global (aspect ratio)
12.2
Combined (part-based & global)
6.5
Median
2
19
6
2
Fig. 8. The expected rank of the Crst correct match is
25; and the median is 18 (Table 1). Thus, the part-based
hypothesizer reduced the expected number of candidates
the veriCer must process by approximately 20 (25 – 5.2),
yielding a veriCcation-stage speedup factor of 4.81.
In the system with the global-feature
hypothesizer, the as
pect ratio was computed as Emin =Emax , where Emin ; Emax
are the minimum and maximum (invariant) second moments. The mean rank of the Crst correct match was 11.7 and
the median was 6. This hypothesizer thus reduced the mean
number of hypotheses processed by the veriCer by about 13
instead of 20. The veriCcation-stage speedup is more than a
factor of two (2.25) greater for the part-based hypothesizer
than for the global-feature hypothesizer.
We also evaluated a combination of the global-feature
and part-based hypothesizers [6]. SpeciCcally, the ranking
of the database silhouettes was modiCed so that silhouettes
with aspect ratios within 0.2 of the input silhouette’s aspect ratio were ranked before all of the others. Within each
group (6 0:2 and ¿ 0:2 aspect ratio di2erences), the ordering based on the similarity measure in Eq. (3) was maintained. The mean rank of the Crst correct match was 3.6
and the median was 1. The combination of the two similarity measures thus yielded the best results, providing a small
but signiCcant additional reduction of 1.6 silhouettes in the
mean number of silhouettes processed by the veriCer. Comparison of the distributions of all four cases in Fig. 8 shows
that the largest contribution to the performance of the combined hypothesizer came from the part-based component.
As viewpoint sampling resolution becomes sparse, the
performance of any silhouette-based object recognition
system is expected to decrease. To determine whether the
hypothesizer still provides a beneCt under these conditions,
◦
we repeated the evaluation using views spaced 45 apart
◦
in azimuth instead of 22:5 apart. Thus, each test object
was represented by 8 views in the database instead of 16.
As expected, overall performance decreased; however, the
hypothesizer still provided a large beneCt, as shown in
Table 2. The part-based hypothesizer reduced the mean
number of veriCcations by approximately 18, from 25
to 7.3, yielding a veriCcation-stage speedup of 3.42. The
B.J. Super, H. Lu / Pattern Recognition 36 (2003) 69 – 78
combined method provided an additional reduction of 0.8
veriCcations.
5. Conclusion
The aim of this paper was to quantitatively evaluate the
use of a part-based shape retrieval engine as a hypothesizer for a silhouette-based object recognition system, using a performance measure and test database appropriate to
the new application. We constructed a test database of 25
objects with a total of 400 views, and measured the speedup
of the veriCcation stage obtained by using the hypothesizer.
Substantial speedups were obtained relative to a recognition system without a hypothesizer, and relative to using a
recognition system with a global-feature based hypothesizer.
Best results were obtained by combining both global and
local (part-based) information, with the greater contribution
coming from the local information. Substantial speedups
were also obtained on a database with sparse viewpoint
sampling.
[8]
[9]
[10]
[11]
[12]
[13]
Acknowledgements
[14]
The Crst author was supported in part by NIH grant 1-RO1
EY11747 during the period of this project. The Cgures are
used by permission of CVRL and SPIE.
[15]
References
[1] G. van der Heijden, M. Worring, Domain concept to feature
mapping for a plant variety image database, in: A. Smeulders,
R. Jain (Eds.), Image Databases and Multi-Media Search,
World ScientiCc, New Jersey, 1997, pp. 301–308.
[2] C. Faloutsos, R. Barber, M. Flickner, J. Hafner, W. Niblack,
D. Petkovic, W. Equitz, EBcient and e2ective querying by
image content, J. Intell. Inf. Syst. 3 (1994) 231–262.
[3] M. Flickner, et al., Query by image and video content: the
QBIC system. IEEE Comput. 28 (1995) 23–32.
[4] X. Li, B.J. Super, Fast shape retrieval using term
frequency vectors, Proceedings of the IEEE Workshop
on Content-Based Access of Image and Video Libraries,
Ft. Collins, Colorado, 1999, pp. 18–22.
[5] B.J. Super, Visual shape retrieval using multiscale term
distributions, Proceedings of the SPIE Conference on Storage
and Retrieval for Media Databases, SPIE Proceedings, Vol.
3972, San Jose, CA, 2000, pp. 222–233.
[6] B.J. Super, Fast retrieval of isolated visual shapes, in
preparation.
[7] F. Mokhtarian, S. Abbasi, J. Kittler, EBcient and robust
retrieval by shape content through curvature scale space,
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
77
in: A. Smeulders, R. Jain (Eds.), Image Databases and
Multi-Media Search, World ScientiCc, New Jersey, 1997,
pp. 51–58.
S. Sclaro2, Distance to deformable prototypes: encoding shape
categories for eBcient search, in: A. Smeulders, R. Jain (Eds.),
Image Databases and Multi-Media Search, World ScientiCc,
New Jersey, 1997, pp. 149–164.
T.P. Wallace, P.A. Wintz, An eBcient three-dimensional
aircraft recognition algorithm using normalized Fourier
descriptors, Comput. Graphics Image Process. 13 (1980)
99–126.
A.P. Reeves, R.J. Prokop, S.E. Andrews, F.P. Kuhl,
Three-dimensional shape analysis using moments and Fourier
descriptors, IEEE Trans. Pattern Anal. Mach. Intell. 10 (6)
(1988) 937–943.
P.J. Besl, R.C. Jain, Three-dimensional object recognition,
Comput. Surv. 17 (1985) 75–145.
B. Vijayakumar, D. Kriegman, J. Ponce, Invariant-based
recognition of complex curved 3D objects from image
contours, Comput. Vision Image Understanding 72 (3) (1998)
287–303.
S. Abbasi, F. Mokhtarian, ABne-similar shape retrieval:
application to multiview 3-D object recognition. IEEE Trans.
Image Process. 10 (1) (2001) 131–139.
A. Del Bimbo, P. Pala, Shape indexing by multi-scale
representation, Image Vision Comput. 17 (1999) 245–261.
Y. Lamdan, H.J. Wolfson, Geometric hashing: a general and
eBcient model-based recognition scheme, Proceedings of the
Second International Conference on Computer Vision, Tarpon
Springs, FL, 1988, 238–249.
F. Stein, G. Medioni, Structural indexing: eBcient 2-D object
recognition. IEEE Trans. Pattern Anal. Mach. Intell. 14 (12)
(1992) 1198–1204.
F. Ulupinar, R. Nevatia, Shape from contour, IEEE Trans.
Pattern Anal. Mach. Intell. 17 (2) (1995) 120–135.
A. Laurentini, The visual hull concept for silhouette-based
image understanding, IEEE Trans. Pattern Anal. Mach. Intell.
16 (2) (1994) 150–162.
K.N. Kutulakos, S.M. Seitz, A theory of shape by space
carving, Int. J. Comput. Vision 38 (3) (2000) 199–218.
R. Jain, R. Kasturi, B. Schunck, Machine Vision,
McGraw-Hill, New York, 1995.
F. Mokhtarian, A. Mackworth, Scale-based description and
recognition of planar curves and two-dimensional shapes,
IEEE Trans. Pattern Anal. Mach. Intell. 8 (1) (1986) 34–43.
D.D. Ho2man, W.A. Richards, Parts of recognition, Cognition
18 (1985) 65–96.
A. Witkin, Scale space Cltering, Proceedings of the
International Joint Conference on ArtiCcial Intelligence,
Karlsruhe, Germany, 1983, pp. 1019 –1022.
A. Califano, R. Mohan, Multidimensional indexing for
recognizing visual shapes, IEEE Trans. Pattern Anal. Mach.
Intell. 16 (4) (1994) 373–392.
About the Author—BOAZ J. SUPER received the Ph.D. degree in Computer Science from the University of Texas at Austin in 1992. He
then became a Research Associate and co-founder of the Center for Vision and Image Sciences at the University of Texas at Austin. In
1997 he joined the University of Illinois at Chicago where he is currently an Assistant Professor in the Department of Computer Science.
78
B.J. Super, H. Lu / Pattern Recognition 36 (2003) 69 – 78
Dr. Super’s current research interests in computer vision and visual perception include perceptual organization, shape matching, object
recognition, and multimedia retrieval.
About the Author—HAO LU received an M.S. degree in Electrical Engineering and Computer Science from the University of Illinois at
Chicago in 2000 and an M.S. degree in Rhetorical and Technical Communication from Michigan Technological University in 1999. In 2000
he joined Tellabs in Lisle, Illinois where he is currently a software engineer in the Optical Networking Group Department. Hao Lu’s current
work is focused on SONET and other broadband technologies.