Salient Object Detection through Over-Segmentation

2012 IEEE International Conference on Multimedia and Expo
Salient Object Detection through Over-Segmentation
Xuejie Zhang, Zhixiang Ren, Deepu Rajan
School of Computer Engineering
Nanyang Technological University
Singapore
{zhangxuejie,renz0002,asdrajan}@ntu.edu.sg
Abstract—In this paper we present a salient object detection
model from an over-segmented image. The input image is
initially segmented by the mean-shift segmentation algorithm
and then over-segmented by a quad mesh to even smaller
segments. Such segmented regions overcome the disadvantage
of using patches or single pixels to compute saliency. Segments
that are similar and spread over the image receive low saliency
and a segment which is distinct in the whole image or in
a local region receives high saliency. We express this as a
color compactness measure which is used to derive saliency
level directly. Our method is shown to outperform six existing
methods in the literature using a saliency detection database
containing images with human-labeled object contour ground
truth. The proposed saliency model has been shown to be useful
for an image retargeting application.
Keywords-Saliency detection, image segmentation, image retargeting
I. I NTRODUCTION
Saliency detection is the process of detecting interesting
visual information in an image. It is used in applications
such as image and video retartgeting, object detection and
recognition [1] and video analysis [2].
One of the most popular saliency models is based on
a center-surround mechanism inspired by biological processes in the mammalian vision system [3]. In this model,
the center-surround mechanism is realized through filtering
feature maps of the input image by Difference-of-Gaussian
(DoG) filters. The DoG response maps from different scales
are combined to form a saliency map. Other saliency models analyze visual information in the frequency domain.
Through observing the properties of DoG filters in frequency domain, Achanta et al. proposed the frequency tuned
saliency model [4] which simplifies the effect of a group of
DoG filters as the difference between a large and a small
Gaussian kernel. The saliency map is generated through
filtering the image by these two kernels only. Another
frequency domain method is the spectral residual method
proposed by Hou and Zhang [5]. This method produces
saliency by obtaining the spectral residual of the image
and removing redundant or non-salient information. Another
approach for saliency detection is based on the theory
of sparse coding. Incremental coding length method [6]
belongs to this category and it models visual information as
978-0-7695-4711-4/12 $26.00 © 2012 IEEE
DOI 10.1109/ICME.2012.166
Yiqun Hu
School of Computer Science & Software Engineering
University of Western Australia
Perth, Australia
[email protected]
sparse basis functions and detects salient feature channels.
A saliency map is produced from the responses to the salient
feature channels. Gopalakrishnan et al. presented a method
to derive saliency through modeling the distributions of color
and orientation [7]. Goferman et al. proposed a contextaware saliency model by comparing each image patch with
other image patches and considering the surrounding context
information [8].
In this paper we propose a simple but effective saliency
detection model through over-segmenting an image and
analyzing the color compactness in the image. The proposed
model detects salient object with an accurate contour, which
is not possible for many existing models. Many saliency
detection models such as [6], [8] utilize image patches as the
processing unit for saliency analysis. Image patches suffer
from the curse of dimensionality. Moreover, patches with
complex distribution of color appear more salient or different
from other patches. In our method the input image is oversegmented to small segments with perceptually uniform
color properties. The small segments are the basic processing
units whose color and spatial positions are used for saliency
detection. A color compactness measure is also proposed.
The relationships between the small segments in both spatial and color domains are used for calculating the color
compactness of a region. We demonstrate the effectiveness
of the proposed saliency model on a benchmark data set
and also by applying it for image retargeting with improved
results.
The rest of this paper is organized as follows. Section
II presents our model, including the over-segmentation and
color compactness measure. Section III presents a benchmarking experiment and compares our models with other
saliency models. An image retargeting application is also
presented in this section. Section IV concludes this paper.
II. T HE P ROPOSED M ETHOD
We present a simple and efficient method to identify
salient objects in an image in a bottom-up manner. The
first step is to obtain an over-segmented image so that
each segment is small and has uniform color. This helps
to improve saliency detection accuracy when compared to
patch-based techniques which are not suitable for describing
1033
(a)
(b)
(c)
(d)
Figure 1: Left: Original image; Top right: Visualization of
mean-shift segmentation of zoomed region; Bottom right:
Visualization of over segmentation using a mesh grid.
region appearances. Specifically, these methods produce
high saliency on patches falling on object boundaries since
they are more different from surrounding patches in high dimensional patch space [8]. Although multi-scale approaches
have been proposed to alleviate this problem, we show in
the experimental results that such methods, even with the
additional contextual information, is not sufficient to extract
the salient object completely. Following over-segmentation,
each segmented region is compared to all other regions
through a compactness measure that is a function of color
similarity and spatial distance. Indeed, this simple formulation of compactness measure serves as the measure of
saliency of a segment.
Figure 2: (a) Input image, (b) Initial mean-shift segmentation, (c) Visualization of initial segmentation labels and (d)
Visualization of segmentation labels after over-segmentation.
A. Image Over-Segmentation
B. Compactness Measure
An initial segmentation of the image is done through
the popular mean-shift segmentation [9]. Our objective is
to model the compactness of color information. With the
initial segmentation, the segments vary widely in size and
are of irregular shapes. In these kinds of segments, it is
not meaningful to compute the compactness of color. On
the other hand, considering each pixel individually leads to
computationally expensive algorithms. Hence, we take the
middle ground of over-segmentation which results in small
segments that have uniform chromatic properties while not
suffering from the curse of dimensionality.
Over-segmentation is achieved by overlaying a quad mesh
on the mean-shift segmented image and splitting each segment into smaller segments. We use the quad size of 20×20
pixels. If a quad contains only one label as determined by the
initial segmentation, it is considered as one segment. If the
neighboring quad also has the same label, it is considered as
a different segment. This is illustrated in fig. 1 which shows
an eagle image on the left and a visualization of the meanshift segmentation of a zoomed region in the top right where
each color represents a region label. The over-segmented
visualization on the bottom right shows that the sky region
is split into several segments although they contain the same
label. If a quad contains more than one label, then each
We define saliency as the compactness of color information in a particular segment. Since the segments themselves
are small, comparison of color compactness with other
segments gives saliency information at a finer resolution.
The two hypotheses that guide our formulation of color
compactness in a segment are the following
label is considered as a segment. This is shown in the over
segmentation of the wing regions where each quad may have
segments from the wing as well as the background sky. Fig.
2 shows the results of over segmentation for the entire image
in which fig. 2(a) and (b) show the original image and the
mean-shift segmented image, respectively while fig. 2(c) and
(d) show a visualization of the initial segments and the oversegmented regions, respectively.
1) When two segments have the same or similar color,
the nearer they are to each other, the higher is the
compactness of this color.
2) Given two segments at a certain distance from each
other, the more similar their colors are, the lower is
the compactness of this color.
Hypothesis 1 is illustrated in fig. 3, where the spread of
the red color implies the color is not compact in the left
image but is compact in the right image leading to higher
saliency of the color. Hypothesis 2 is illustrated in fig. 4 in
which dissimilar colors of segments separated by the same
distance indicate higher saliency. Thus, in the left image, the
red color has less saliency compared to the right image.
The compactness of color information in an image can be
measured using the relationship between each segment and
the rest of the segments in the image. We express these two
hypothesis as a compactness measure in the following form:
1034
(a)
Figure 3: Spread of color indicating its compactness.
the saliency map of the eagle image in fig. 2. Comparing to
the result of [8] in fig. 5(b), our method produces consistent
high saliency level for the salient object in the image and the
boundary of the eagle is also well preserved. The drawback
of [8] is that it only produces higher saliency for regions
with very high contrast boundaries such as the boundary of
the eagle’s head and tail and leaves hollow space inside these
boundaries with low saliency.
Figure 4: The distances between two segments in the left
and right images are the same. Red is more compact in the
right image.
1 1 + C(i, j)
COM P (i) =
N
2 + P (i, j)
(b)
Figure 5: The saliency map of the eagle image: (a) Our result
and (b) The result by [8].
C. Why Over-Segmentation
The reason for an over-segmentation with a mesh grid
is to divide the image into small patches with uniform
color properties. It is natural to ask what happens if over
segmentation is not done and the compactness measure is
applied to the initial segmented image. The compactness
of a color might not be correctly derived if the segments
are large. This can be shown through a simple experiment.
Fig. 6(a) shows the ground truth salient region for the
original image shown in fig. 7(a). If the over-segmentation
step in the proposed method is replaced by a normal meanshift segmentation, the result of saliency detection is shown
in Fig. 6(b). We can see that in such a scene where there is
a large background segment spreading across the image, it
is possible that the large segment gains a high saliency level
although it is not compact in the image. Fig. 6(c) is the
result by our method using mesh grid for over-segmentation
and is clearly closer to human perception and to the ground
truth.
It is possible to set the mean-shift algorithm to oversegmentation mode by varying its parameters. However,
we show that this type of over-segmentation will also fail.
Consider fig. 7(b) which shows the segmented regions when
the mean-shift algorithm parameters are set as follows:
spatial radius = 2, feature radius = 4, and minimum segment
size = 10. In this case, the entire background still formed
a single large segment that can not be compared directly
with other segments for color compactness because its size
is much larger than the rest of the segments. Fig. 7(c)
shows our over-segmentation result using a mesh grid. This
is suitable for deriving the compactness of color since the
segment are all small. The saliency map obtained by using
over-segmentation by the mean-shift algorithm is shown in
fig. 6(d). Although there is a slightly lower saliency for the
background, it does not get completely eliminated as in the
(1)
j=i
where C(i, j) is the Euclidean distance between the mean
color of segment i and j in CIELUV color space, P (i, j) is
the distance between the centroids of segment i and j in the
image. Both C and P are normalized to the range of [0, 1]. N
is the number of segments j. The reason for a small constant
value 1 (taken as 0.1) is that, without 1 , if two segments
have the same color, i.e., C(i, j) = 0, the compactness
measure would be independent of how far the segments
are, which is contrary to the first hypothesis. 2 in the
denominator serves to avoid dividing by 0 when P (i, j) = 0.
Of course, this is a rare case when the centroid of a segment
coincides with that of a neighboring segment if one of
them is concave. In order to ignore segments whose color
is very different from the segment under consideration, we
choose a subset of segments to compute C(i, j) instead of
all the segments in the image. Thus, we consider only those
segments i such that C(s, i) < mean(C(s, j), j = s). This
implies that we use a flexible color distance (C) threshold
for the number of segments over which the compactness is to
be computed instead of using a fixed number. Note that the
spatial distance (P ) between a segment s and other segments
is not chosen as the threshold for segments selection since
we want the color compactness over the entire image.
After deriving the color compactness for each segment,
each pixel in the image is assigned a saliency level equal to
the color compactness of the segment to which it belongs.
We smooth this initial saliency map with a Gaussian kernel
of size 8 × 8 and standard deviation of 3. This is only to
provide a better visual experience of the saliency map by filtering out the over-segmented boundaries. It is optional and
will not affect the final results significantly. Fig. 5(a) shows
1035
(a)
(b)
(c)
(d)
Figure 6: (a) Ground truth salient region as labeled by
human. Saliency map using (b) Initial mean-shift segmentation only (image border was included for illustration),
(c) Proposed over-segmentation method and (d) Mean-shift
over-segmentation.
(a)
(b)
Figure 9: Average ROC curves for different methods.
is large. The FT method tends to assign higher saliency to
regions with very uniform intensity, which produces poor
quality saliency maps for cases such as row 1 and row 2
in fig. 8 where the bowl and the black table receive high
saliency. The SR method does not generate satisfying result
when the image contains textured regions across the image
as seen in row 3, row 4 and row 7 in the figure. Both FT
and SR do not make use of color information and so they
may lose some salient colored object such as the brown
cross in row 3. The ICL method divides the image into
overlapping patches and detects salient feature channels.
Thus it produces higher saliency for positions with strong
contours and corners because these are rare image patches
in the image. Similar to ICL, the CA method detects only
corners and contours and may need further hole-filling to
mark the entire salient object. From the results of ICL and
CA, it is shown that using the image patches directly as
a processing unit may lead to higher saliency for contours
and corners as seen in row 3 and row 9. The COD method
also models distributions of color but since it is based
on patches, the boundaries of the salient objects are not
extracted faithfully. thus it produces reasonable results for
most images but not satisfying for the brown cross image.
In summary, our methods have most promising performance
in detecting salient objects with good contours.
We present a quantitative evaluation of our method and
compare it to the other six methods using ROC curves.
Each saliency map was split into a salient region and a
non-salient region using a threshold and compared with the
ground truth saliency map which is generated by human.
In this case salient is considered as positive and non-salient
is considered as negative. True positive rate (TPR) is the
ratio of the number of pixels correctly classified as salient
to the total number of pixels classified as salient. False
positive rate (FPR) is the ratio of the number of pixels
wrongly classified as salient to the total number of pixels
classified as non-salient. The ROC curve is a plot of TPR
(c)
Figure 7: (a) Original image, (b) Mean-shift oversegmentation, and (c) Our over-segmentation.
proposed method shown in fig. 6(c).
III. E XPERIMENTAL R ESULTS
We evaluate the performance of the proposed salient
object detection method on a benchmark data set of 1000
images selected from the salient object database [10] whose
ground truth is manually labeled [4]. We compare our
performance with six other existing models that report their
results on the same database and whose code is made available. These are Itti’s saliency model (ITTI) [3], frequency
tuned saliency (FT) [4], the spectral residual model (SR)
[5], the incremental coding length method (ICL) [6], the
color and orientation modeling method (COD) [7], and the
context aware saliency detection model (CA) [8]. In the
COD method, only the distribution of color information is
used here for fair comparison with our model because our
model uses color information only. We also demonstrate the
effectiveness of the proposed method through an application
in image retargeting.
A. Saliency Detection
Fig. 8 shows the saliency maps of the six methods and
also our method. Since ITTI models the center-surround
contrast, it tends to follow strong contours which give high
responses in DoG filters in different frequency spectrum. It
is not able to mark the whole object when the salient object
1036
1
2
3
4
5
6
7
8
9
(a) Image (b) ITTI
(c) FT
(d) SR
(e) ICL
(f) COD
(g) CA
(h) Ours
(i) Human
Figure 8: Comparison of saliency maps obtained by different mehtods: (a) Original image, (b) Itti’s method [3], (c) Frequency
tuned saliency [4], (d) Spectral residual [5], (e) Incremental coding length [6], (f) Color and orientation modeling [7], (g)
Context-aware saliency [8], (h) Our method and (i) Ground truth.
versus FPR for different thresholds. A larger area under ROC
curve indicates better agreement with human labeled ground
truth saliency maps. Fig. 9 shows the resulting average ROC
curves for 1000 images. The area under the curve of our
method is the largest among all the methods, indicating a
superior agreement with the human-labeled salient object
ground truth.
pan. For the second image, the human subject marked the
monk as the salient object. Our method only produces high
saliency for the red robe of the monk. It is difficult for
saliency detection methods to produce high saliency for the
monk’s face and body without including some kind of face
recognition methods. Indeed, there are methods that include
face recognition as part of salient object detection, e.g. [11]
but such methods are not as simple as the proposed method.
However, our methods performs very well for the face image
in row 8 of Fig. 8. Other methods listed in Fig. 8 also
produce results that are comparable with or even worse than
the result of our method for these two images.
Failure cases. Fig. 10 shows two cases in which the
proposed method fails. In the first image, our method only
produces high saliency for the egg yolk regions because
the yellow color is more compact in this image. However,
the human subject marked the whole egg region in the
1037
(a)
(b)
(c)
Figure 10: Failure cases: (a) Original image, (b) Ground
truth salient object and (c) Our results.
IV. C ONCLUSIONS
We present a simple and effective method for salient
object detection in images. This method generates saliency
maps with full resolution and preserves object contours
accurately. Based on the experiments using a benchmark
image data set labeled with ground truth salient region, it
outperforms six existing saliency models. This model can be
used in many potential applications such as photo composition, image editing and video compression. We demonstrated
an image retargeting application using the saliency maps
produced by our model.
ACKNOWLEDGMENT
This research is supported by the Singapore National
Research Foundation under its Interactive & Digital Media (IDM) Public Sector R & D Funding Initiative and
administered by the IDM Programme Office (Grant No.
NRF2008IDM-IDM004-032).
R EFERENCES
[1] D. Gao and N. Vasconcelos, “Integrated learning of saliency,
complex features, and object detectors from cluttered scenes,”
in CVPR, 2005.
[2] N. Jacobson and T. Q. Nguyen, “Video processing with scaleaware saliency: Application to frame rate up-conversion,” in
ICASSP, 2011.
[3] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based
visual attention for rapid scene analysis,” IEEE Trans. on
Pattern Analysis and Machine Intelligence, vol. 20, no. 11,
pp. 1254–1259, 1998.
(a)
(b)
(c)
(d)
Figure 11: Illustration of image retargeting using the proposed salient object detection method. (a) Original image,
(b) proposed saliency map, Image retargeting by (c) Seam
carving [12] and (d) Seam carving using our saliency maps.
[4] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk,
“Frequency-tuned salient region detection,” in CVPR, 2009.
B. An Image Retargeting Application
Saliency models can be used in many applications. Here
we demonstrate an image retargeting application combining
our saliency map and the seam carving method [12]. Image
retargeting is the process of adaptively resizing an image
to fit to another display size while preserving important
content in the image. The seam carving method removes
connected seams from the input image to reduce the image
size. The removed seam possesses the lowest gradient energy
among all possible seams. As our method can successfully
determine the salient region in the image, we substitute our
saliency map to the seam carving method instead of using a
simple gradient map. Fig. 11 shows the results of retargeting
the input image to 75% width. Compared to the original
seam carving method, the saliency maps generated from our
method can help to better preserve salient regions in the
image.
[7] V. Gopalakrishnan, Y. Hu, and D. Rajan, “Salient region
detection by modeling distributions of color and orientation,”
IEEE Trans. on Multimedia, vol. 11, no. 5, pp. 892–905, 2009.
[5] X. Hou and L. Zhang, “Saliency detection: A spectral residual
approach,” in CVPR, June 2007.
[6] ——, “Dynamic visual attention: searching for coding length
increments,” in NIPS, 2008, pp. 681–688.
[8] S. Goferman, L. Zelnik-Manor, and A. Tal, “Context-aware
saliency detection,” in CVPR, 2010.
[9] D. Comaniciu and P. Meer, “Mean shift: A robust approach
toward feature space analysis,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 603–619,
2002.
[10] T. Liu, J. Sun, N.-N. Zheng, X. Tang, and H.-Y. Shum,
“Learning to detect a salient object,” in CVPR, 2007.
[11] M. Cerf, J. Harel, W. Einhauser, and C. Koch, “Predicting
human gaze using low-level saliency combined with face
detection,” in NIPS, 2007.
[12] S. Avidan and A. Shamir, “Seam carving for content-aware
image resizing,” in SIGGRAPH, 2007.
1038