Automatic Identification of Perceptually Important Regions in an Image

Automatic Identification of Perceptually Important Regions in an Image
Wilfried Osberger
Space Centre for Satellite Navigation
Queensland University of Technology
GPO Box 2434, Brisbane 4001 Australia
[email protected]
Abstract
We present a method for automatically determining the
perceptual importance of different regions in an image. The
algorithm is based on human visual attention and eye movement characteristics. Several features known to influence
human visual attention are evaluated for each region of a
segmented image to produce an importance value for each
factor and region. These are combined to produce an Importance Map, which classifies each region of the image in
relation to its perceptual importance. Results shown indicate that the calculated Importance Maps correlate well
with human perception of visually important regions. The
Importance Maps can be used in a variety of applications, including compression, machine vision, and image
databases. Our technique is computationally efficient and
flexible, and can easily be extended to specific applications.
1. Introduction
A challenging problem in image processing is the emulation of human vision tasks. The ease and simplicity with
which humans perform complex visual tasks often masks
the tremendous computation involved in performing these
functions. Despite the inherent difficulties involved in automatically emulating human visual processing, the benefits in being able to perform this have led to continued
widespread research in the area.
In many image processing applications (e.g. regionbased compression), it is desirable to perform an image
segmentation in a manner analogous to humans. Following this segmentation, it is useful to be able to identify
which regions a human observer considers to be of the highest perceptual importance (i.e. the areas likely to attract
our attention). Studies of visual attention and eye movements [3, 9, 11] have shown that humans generally only
attend to a few areas in an image. Even when given unlimited viewing time, subjects will continue to focus on these
few areas rather than scan the whole image. These areas
are often highly correlated amongst different subjects, when
viewed in the same context.
In order to automatically determine the parts of an image
that a human is likely to attend to, we need to understand
the operation of human visual attention and eye movements.
Research into the attentional mechanisms of the Human Visual System (HVS) has revealed several low level and high
Anthony J. Maeder
School of Engineering
University of Ballarat
PO Box 663, Ballarat 3353 Australia
[email protected]
level factors which influence attention and eye movements.
In this paper, we use these factors to develop an Importance Map (IM) [5]. The IM predicts, for each region in
the image, its perceptual importance. To obtain the IM, a
segmentation is first performed. Each region in the image
is then assessed with regard to a number of different low
and high level factors. These factors are combined to produce the overall IM for the image. Such a map is useful
for a wide range of image processing applications, including image compression and machine vision. The method is
flexible and is amenable to extension to image sequences.
2. Detection of important regions by the Human Visual System
In order to efficiently deal with the masses of information present in the surrounding environment, the HVS operates using variable resolution. Although our field of view is
around 180 degrees horizontally and 140 degrees vertically,
we only possess a high degree of visual acuity over a very
small area (around 2 degrees in diameter) called the fovea.
Thus in order to accurately inspect the various objects in
our environment, eye movements are required. Rapid shifts
in the eye’s focus of attention (saccades) occur every 100–
500 milliseconds. Visual attention mechanisms are used to
control these saccades. Our pre-attentive vision [7] operates
in parallel, looking in the periphery for important areas and
uncertain areas for the eye to foveate on at the next saccade.
Thus a very strong relationship exists between eye movements and attention.
2.1. Eye movement correlation between subjects
If there was not a strong correlation between the directions of gaze of different people, then eye movements would
be impossible to predict, and it would be difficult to make
general use of eye movement information. However, studies on human eye movement patterns for both images and
video indicate that eye movements are indeed highly correlated amongst subjects. The original work of Yarbus [11]
showed that a strong correlation between viewer eye movements exists, as long as the subjects were viewing the image
in the same context (i.e. with the same instructions and motivation). Yarbus also demonstrated that even if given unlimited viewing time, we will not scan all areas of a scene,
but will instead attend to a handful of important regions
which continually attract our attention. Similar results have
been found for video by Stelmach [10]. This suggests that
eye movements are not idiosyncratic, and that a strong relationship exists between the direction of gaze of different
subjects, viewing an image in the same context.
Contrast
2.2. Factors which influence eye movements and
attention
Size
In order to automatically determine the importance of the
different regions in an image, we need to determine the factors which influence human visual attention. Our attention
is controlled by both high and low level factors. High level
factors generally involve some feedback process from memory and may involve template matching. Low level processes are generally fast, feed forward mechanisms involving relatively simple processing. A general observation is
that objects which stand out from their surrounds are likely
to attract our attention, since one of the main goals of the
HVS is to minimise uncertainty. This is also in agreement
with Gestalt organisation theories.
Low level factors which have been found to influence
visual attention include:
Original
Image
Segmentation
Shape
Combine
Factors
Importance
Map
Location
Background
Figure 1. Block diagram for Importance Map
calculation.
2.3. Considerations for modeling visual attention
Although many factors which influence visual attention
have been identified, little quantitative data exists regarding
the exact weighting of the different factors and their interrelationship. Some factors are clearly of very high importance (e.g. motion), but it is difficult to determine exactly
how much more important one factor is than another. A particular factor may be more important than another factor in
one image, while in another image the opposite may be true.
Due to this lack of information, it is necessary to consider
a large number of factors when modeling visual attention
[6, 7, 12]. This caters for the case when not all of the factors
are used all of the time. It is also desirable that the factors
used be independent, so that a particular type of factor does
not exert undue influence on the overall importance.
High level factors such as context can be very useful in
determining a region’s importance. In situations where a
template of a target is known a priori, viewer eye movements can be modeled with high accuracy. However, in the
general case, little is known about the context of viewing
and about the content of the scene, so such high level information cannot be used.
Contrast. The HVS converts luminance into contrast at
an early stage of processing. Region contrast is therefore a very strong low-level visual attractor. Regions
which have a high contrast with their surrounds attract
our attention and are likely to be of greater visual importance [3, 9, 11].
Size. Findlay [3] has shown that region size also has an
important effect in attracting attention. Larger regions
are more likely to attract our attention than smaller
ones. However a saturation point exists, after which
the importance due to size levels off.
Shape. [4, 9] Regions whose shape is long and thin
(edge-like) have been found to be visual attractors.
They are more likely to attract attention than rounder
regions of the same area and contrast.
Colour. [1, 7] Colour has been found to be important
in attracting attention. Some particular colours (e.g.
red) have been shown to attract our attention. A strong
influence occurs when the colour of a region is distinct
from the colour of its background.
3. Algorithm for Importance Map calculation
The basic operation of our automatic region importance
classifier can be seen in Figure 1. The original grey-scale
image is input and is segmented into homogeneous regions.
We have used a recursive split and merge technique to perform the segmentation. It gives satisfactory results and is
computationally inexpensive.
Our split and merge technique uses the local region variance as the splitting and merging criterion. We have found
a variance threshold of 250 to work well for most images.
Regions with a size less than 16 pixels are merged with their
most similar neighbour, to avoid excessively small regions.
The segmented image is then analysed by a number of
different factors known to influence attention, and an importance is assigned to each region for each factor. We
have chosen 5 different importance factors in this algorithm.
However, the flexible structure of our approach allows additional factors to easily be incorporated (e.g. if information
was known about the image or context a priori). The different factors that we have chosen are:
Motion. [7] Motion has been found to be one of the
strongest influences on visual attention. Our peripheral
vision is highly tuned to detecting changes in motion,
and our attention is involuntarily drawn to peripheral
areas undergoing motion distinct from its surrounds.
Other low level factors which have been found to influence
attention include brightness, orientation, and line ends.
Several high level factors have also been determined:
Location. Eye-tracking experiments have shown that
viewers eyes are directed at the centre 25% of a screen
for a majority of viewing material [2].
Foreground / Background. Viewers are more likely to
be attracted to objects in the foreground than those in
the background [1].
Contrast ofregion
with background. The contrast importance
of a region is calculated as:
People. Many studies [4, 9, 11] have shown that we are
drawn to focus on people in a scene, in particular their
faces, eyes, mouth, and hands.
Context. [4, 11] Viewers eye movements can be dramatically changed, depending on the instructions they
are given whilst observing the image.
!
"
()
*
$ #%'&
(1)
where + is the mean grey level of region , and
"!
()
*
$#%'&
is the mean grey-level of all of the
2
We would like to assign a higher importance to areas
which rank very strongly in some factors. A simple averaging of the importance factors would not provide this. We
have therefore chosen to square and sum the factors to produce the final IM:
]
neighbouring regions of . Subtraction is used rather
than division, since it is assumed that
,
the
grey-levels
is scaled to
are a perceptually linear space.
the range [0 1].
Size of region. Importance due to region size is calculated as:
$-
! /.1032
4
465
798:"; < +[\
(2)
@A 4
(3)
Location of region. Importance due to location of a
region is calculated as:
LM
LM J
4 (4)
where
is the number of pixels in region J'K also
J in the centre NPO = of the image. Thus
which Hare
regions contained entirely within the central quarter of
an image will have a location importance of 1.0, and
regions with no central pixels will have a location importance of 0.0.
This paper has presented a novel method for identifying
perceptually important regions in an image. The technique
is computationally efficient and has been designed in a flexible manner to easily accommodate modifications and application specific requirements. As demonstrated in Section 4,
the results have been promising for both simple and complex images. However, many areas for improvement still
remain. The results obtained by our technique are limited
by the success of the segmentation, so we are currently investigating improved segmentation methods. Our method
currently uses only grey-scale images as input, so inclusion of a colour importance factor is also desirable. We
have investigated extending the method to image sequences
by including a motion importance factor, and have recently
adopted this technique to control an MPEG encoder [8]. In
specific applications, a priori information may be known regarding the content of the images or context of viewing, so
additional high level importance factors may easily be included (e.g. face detection in a videophone application).
IMs can also be used in other areas such as machine vision, human perception research, and image databases. Any
application where it is desirable to focus on the perceptually relevant in an image (or sequence of images) and to
discard areas of lower perceptual importance areas could
readily benefit from the use of IMs.
M AVU 2 L S J L @ S MT M AVU 8:"; < (5)
<W; OYX
2
J
@S M)T M AVU where
2 is the number of pixels in region
J border on the image, and L S L @S MT M AZU 2
which also
J
is the total number of image border pixels.
&B# G
:"; < Q.R032
(6)
5. Discussion
Foreground / Background Region. We detect background regions by determining the proportion of the
total image border that is contained in each region.
Regions with a high number of image border pixels
will be classified as belonging to the background and
will have a low Background/Foreground importance as
given by:
B _ ,dc
We have tested our technique on a wide variety of scene
types. The results have been promising. For simple scenes,
our method typically identifies the most salient regions in
an image. Results for more complex scenes have also been
good, if accurate segmentation of the image was possible.
Figure 2 demonstrates the processes involved in obtaining the IM for the relatively simple Miss America image. As
can be seen in Figure 2(b), the split and merge segmentation
slightly oversegments the image, with 147 separate regions
being produced. The outputs from the 5 importance factors
are seen in Figures 2(c)–(g). These factors are combined to
produce the final IM in Figure 2(h). This map gives a good
estimate of the subjectively important areas in the picture. It
can however be seen that the fine segmentation has resulted
in an IM which is blotchy around the face. This problem
could be solved by a more accurate segmentation, or by a
post-smoothing operation.
IMs for more complex scenes have also been generated.
Figure 3 shows the IM for the Lighthouse image. This scene
has a perceptually dominant lighthouse which attracts our
attention, but also contains several areas of secondary perceptual importance such as the houses, the horizon, and the
crashing wave. These areas of secondary visual importance
have been accurately identified by our IM.
is the number of pixels inA the region C
where
which border with other
A regions, and D is a constant.
We found a value of D of 1.75 to provide a good discrimination of shapes. Thus long, thin regions will
have a high shape importance, while for rounder regions it will be lower. The final shape importance is
normalised to fit in the range [0 1].
E$ F
GIHJ'K
_`ba
4. Results
Shape of region. Importance due to region shape is
calculated as:
B >
@A
>?! %
^
where e sums through the 5 importance factors. The final
IM is produced by scaling the result so that the region of
highest importance has an importance value of 1.0.
where
7 4 is the area of region in pixels, and
4 5
is a constant used to prevent excessive weighting
7
being given to very large regions. We have set 4 5
to be equal to :)= of the total image area.
@S MT
By this stage we have assigned, for each region in the
image, an importance factor rating for each of the 5 factors. Each rating has been normalised to be in the range [0
1]. We now need to combine these 5 factors for each region to produce an overall IM for the image. As mentioned
in Section 2.3, there is little quantitative data which indicates the relative importance of these different factors, and
this relation is likely to change from one image to the next.
We therefore choose to treat each factor as being of equal
importance. However, if it was known that a particular factor was of higher importance, a scale factor could be easily
incorporated.
3
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Figure 2. Importance Map for the Miss America image. Intermediate importance factor calculations
are also shown. (a) original image, (b) segmented image, (c) contrast factor, (d) size factor, (e) shape
factor, (f) location factor, (g) background factor, and (h) final Importance Map produced by summing
(c)–(g). For (c)–(h), lighter regions represent higher importance.
(a)
(b)
Figure 3. Importance Map for the Lighthouse image. (a) original image, and (b) final Importance Map,
with lighter regions representing higher importance.
References
[7] E. Niebur and C. Koch. Computational architectures for attention. In R. Parasuraman, editor, The Attentive Brain. MIT
Press, 1997.
[8] W. Osberger, A. Maeder, and N. Bergmann. An MPEG encoder incorporating perceptually based quantisation. In Proceedings SPIE 3299, San Jose, Jan 1998.
[9] J. Senders. Distribution of attention in static and dynamic
scenes. In Proceedings SPIE 3016, pages 186–194, San
Jose, Feb 1997.
[10] L. Stelmach, W. Tam, and P. Hearty. Static and dynamic
spatial resolution in image coding: An investigation of eye
movements. In Proceedings SPIE 1453, pages 147–152, San
Jose, Feb 1992.
[11] A. Yarbus. Eye Movements and Vision. Plenum Press, New
York NY, 1967.
[12] J. Zhao, Y. Shimazu, K. Ohta, R. Hayasaka, and Y. Matsushita. An outstandingness oriented image segmentation
and its application. In ISSPA, pages 45–48, Gold Coast,
Australia, Aug 1996.
[1] B. L. Cole and P. K. Hughes. Drivers don’t search: they just
notice. In D. Brogan, editor, Visual Search, pages 407–417.
Taylor and Francis, 1990.
[2] G. Elias, G. Sherwin, and J. Wise. Eye movements while
viewing NTSC format television. SMPTE Psychophysics
Subcommittee white paper, Mar 1984.
[3] J. Findlay. The visual stimulus for saccadic eye movement
in human observers. Perception, 9:7–21, Sept. 1980.
[4] A. Gale. Human response to visual stimuli. In W. Hendee
and P. Wells, editors, The Perception of Visual Information,
pages 127–147. Springer-Verlag, 1997.
[5] A. Maeder, J. Diederich, and E. Niebur. Limiting human
perception for image sequences. In Proceedings SPIE 2657,
pages 330–337, San Jose, Feb 1996.
[6] X. Marichal, T. Delmot, V. De Vleeschouwer, and B. Macq.
Automatic detection of interest areas of an image or a sequence of images. In ICIP, pages 371–374, Lausanne,
Switzerland, Sep 1996.
4