Proceedings of the 2006 IEEE International Conference on Robotics and Automation
Orlando, Florida - May 2006
Real Time, Online Detection of Abandoned Objects in
Public Areas
Nathaniel Bird, Stefan Atev, Nicolas Caramelli, Robert Martin, Osama Masoud, Nikolaos Papanikolopoulos
Department of Computer Science and Engineering
University of Minnesota
Minneapolis, MN 55455
{bird, atev, caramel, martin, masoud, npapas}@cs.umn.edu
Abstract –This work presents a method for detecting
abandoned objects in real-world conditions.
The method
presented here addresses the online and real time aspects of such
systems, utilizes logic to differentiate between abandoned objects
and stationary people, and is robust to temporary occlusion of
potential abandoned objects. The capacity to not detect still
people as abandoned objects is a major aspect that differentiates
this work from others in the literature. Results are presented on
3 hours 36 minutes of footage over four videos representing both
sparsely and densely populated real-world situations, also
differentiating this work from others in the literature.
Index Terms – Automated surveillance, human activities
recognition
I. INTRODUCTION
A. Problem Definition
With color video cameras becoming affordable, their
abundance for security-related monitoring is increasing.
Unfortunately, it is often not possible to have someone
watching so many cameras all of the time and when someone
is watching, their attention is often divided between several
cameras. This is very unfortunate because one of the purposes
of installing the cameras is to detect unusual events and
dangerous behaviors before anyone gets hurt. This paper
addresses this issue by presenting an algorithm for automated
detection of abandoned objects. Detecting such objects is
useful because bombs placed in abandoned luggage are a
prevalent and horrific terrorist weapon. The prospect of being
able to detect such weapons before they are used makes this
research valuable.
The problem addressed by this work is to locate
abandoned objects in a single color camera’s field of view as
quickly and reliably as possible. We define an abandoned
object to be a stationary object that has not been touching a
person (someone had to leave it) for some time threshold.
There are many requirements for such a system. Among these
are that the method must work online in real time, it must stay
active around the clock, it must not detect still people as
abandoned objects, and it must be able to detect abandoned
objects even if they are occluded by moving crowds of people
for periods of time. In addition, we wish to use color image
data because most security cameras these days are color, and
the additional information provided is very useful for vision
tasks. Thus far, no method has been introduced in the
0-7803-9505-0/06/$20.00 ©2006 IEEE
3775
Fig. 1 Example abandoned object left in a scene.
literature that takes all of these requirements into
consideration.
The requirements specify that video processing must be
real time and online. If this were not the case, detection might
occur too late to take proper action, and detecting an
abandoned object only after it may have exploded is not
useful. The algorithm presented here runs in real time on a
Pentium 4 system running Windows XP, and all results
presented were computed in real time.
B. Literature Review
There are several systems that have been presented that
can detect abandoned objects. Most of those in the literature
use only intensity-level images such as the system presented by
Sacchi and Regazzoni [5], and the W4 system [4]. Lately,
however, some have moved on to use color images as well
such as the system presented by Yang, et al. [7], and the
system by Beynon, et al. [3]. The work presented here utilizes
color information as a useful cue for background
segmentation. These works also present results only on scenes
with very few, non-occluding people.
It should be noted that the system presented by Beynon, et
al. [3] utilizes multiple cameras focused on the same scene,
which, while interesting, is not practical for many areas where
automated abandoned object detection may be desirable.
The system developed must be able to stay active around
the clock, thus requiring a dynamically learned background
model that can change as the natural lighting does throughout
the day. A variation of the background segmentation method
presented by Stauffer and Grimson [6] was used so that small
changes in the background can be learned. Special care is
required when using this method since abandoned objects and
still people should not be learned into the background model.
The system must be able to deal with people who stop and
sit for extended periods of time and not regularly detect them
as abandoned objects. A logic-based system is introduced to
classify detected objects as either an abandoned object or a
still person.
The remainder of this paper is organized as follows.
Section II discusses the method used, providing a brief
description of the standard low-level vision algorithms and a
more detailed look at the logical-level image interpretation.
Section III presents the results, including a description of the
testing procedures used.
Finally, Section IV discusses
conclusions from this work and possible areas for future
research.
II. METHOD DESCRIPTION
The system presented here uses standard algorithms to
perform the low-level image processing to detect and track
blobs.
Using the data collected from these low-level
algorithms, a two-tiered high-level system is used to detect still
objects and to determine which still objects correspond to
actual abandoned objects and which do not. By using such
logic, a person sitting or sleeping on a bench will not be
classified as an abandoned object, while an often occluded true
abandoned object will still be classified correctly. This twotiered system uses a short-term and a long-term logic to make
these decisions.
A. Low-Level Processing
Background segmentation is performed using the mixture
of Gaussians method presented by Stauffer and Grimson [6],
but using a complete covariance matrix for every pixel as
described by Atev, et al. [1]. Four Gaussians are used to
represent the color at every pixel. In the interest of real time
operation, after an initialization period, not all frames need to
be processed. Processing only every tenth frame is still a high
enough rate considering the temporal scale at which the events
of interest occur.
The background segmentation method is restricted to userspecified regions of interest in the image. No background
learning is performed on the areas outside this region mask.
The purpose of this is to block out areas of the image where
any background changes detected can safely be considered
noise (such as walls), and to remove areas that are too far from
the camera for accurate abandoned object identification. This
mask is further updated by the long-term logic to prevent
abandoned objects from being learned into the background, as
is explained later. See Fig. 2 for an example of the region
mask.
Noise reduction of the binary foreground mask is
performed using the structural noise reduction algorithm
described by Bevilacqua [2]. The results of this noise
reduction method are qualitatively similar to an erode/dilate
operation but more accurately preserves the underlying shape
of the object. Thus, regions corresponding to objects in the
scene, such as people are emphasized while random noise is
suppressed.
3776
(a)
(b)
Fig. 2 (a) Example scene and (b) region mask. White indicates an area
which is not processed while black indicates that it is.
Blob extraction is then performed on the binary
foreground mask. Correlating the blobs detected in the last
frame with the blobs detected in the current frame, blob
tracking is performed. Blob tracking maintains the current
state of internal variables such as which blobs in the current
frame overlap with which blobs in the previous frame and how
long the blobs have been tracked in a one-to-one manner. This
information is used by the short-term logic to classify the blobs
based upon their recent behavior.
B. Short-Term Logic
Blobs are subdivided into four types and are classified as
one of these by the short-term logic. The classification
depends upon their recent behavior within the scene. The blob
types are Abandoned Object (A), Person (P), Still Person (SP),
and Unknown (U). The four blob behaviors considered for
classification are blob creation, blob splits, blob merges, and
blob centroid velocity.
We define a Person Group (PG) as a special set of blobs
used when a person (P) or still person (SP) blob splits. Since
these blob types are by definition a person, then if such a blob
splits, at least one of the new blobs must be a person. A PG is
created containing the new blobs that split from a P or an SP.
All blobs contained within a PG are classified as U until one
of them becomes a P, at which point the person group is
disbanded and all other blobs within it can be classified
normally once more. This is to stop a sitting person from
being incorrectly classified as an abandoned object if they
place a bag beside them that splits from their blob. Without
PGs, both blobs resulting from this spit would be classified as
U, and later classified as A. The system remembers that at
least one of these split blobs must be a P or SP by keeping
track of PGs.
We define vθ to be the threshold velocity above which an
abandoned object, still person, or unknown becomes a person.
Blob velocity is taken to be the centroid velocity of a blob.
The following rules govern the blob classification:
New blob:
new → U
Velocity changes:
(vU = 0) ∧ (U ∉ PG ) → A
(v P = 0) → SP
(v A > vθ ) → P
(v SP > vθ ) → P
(vU > vθ ) → P , if (U ∈ PG ) → remove PG
Splits:
A → A1 , A2
P → (U 1 , U 2 ) ∈ new PG
SP → (U 1 , U 2 ) ∈ new PG
U → U 1 ,U 2 , if (U ∈ PGi ) → (U1 , U 2 ) ∈ PGi
Merges:
P∧P→P
P∧A→P
P ∧ SP → P
P ∧U → P
SP ∧ SP → SP
SP ∧ A → SP
SP ∧ U → SP
A∧ A → A
A ∧U → A
U ∧U →U
Fig. 3 Example result plot.
C. Long-Term Logic
Long-term logic takes as input the blob labels from the
short-term logic at every frame.
The long-term logic
complements the short-term logic, which does a good job of
initial identification of abandoned objects and still people.
The long-term logic goes a step further by maintaining the
position of stationary objects (abandoned objects and still
people), and then ending the alert if they are no longer present.
The long term logic maintains a set of potential
abandoned objects and a set of still people. Potential
abandoned objects and still people are stored identically; as a
contour in the image plane corresponding to their location
along with a timestamp of when they were first detected.
When an A type blob is found, it is first checked to ensure
that it does not overlap with any items in the potential
abandoned object or still person sets. If it overlaps an existing
object, it is ignored. Otherwise, the blob contour is copied
into the potential abandoned object set and the initial detection
timestamp is set. In a similar manner, when a SP type blob is
found, it is checked to ensure that it does not overlap with any
items in the potential abandoned object or still person sets. If
it overlaps an existing object, it is ignored. Otherwise, it is
added to the still person set.
All items in the potential abandoned object set and the
still person set are checked every time a frame is processed. If
their corresponding area in the binary foreground mask (from
the low-level processing) is filled less than some percentage, p,
then the item is dropped. Appropriate values for p were
empirically found to be between 75% and 80%.
A user defined time threshold t defines how long an item
must exist in the potential abandoned object set before an
alarm is triggered. Thus, if during a check it is discovered that
t time has elapsed since an item was added to the potential
abandoned object set, an alarm is triggered for that object.
After the alarm is triggered, the long-term logic adjusts a
mask used by the background segmentation module so that it
will not improperly learn the abandoned object into the
background. This is only performed after an object has been
positively identified as an abandoned object so that normal
3777
background changes are not erroneously prevented from being
learned into the background.
D. Image Similarity
The Image Similarity module is used by the long-term
logic as a cue as to whether a potential abandoned object is
really an abandoned object or just a very still person. When
the potential abandoned object is first detected, a copy of the
image at its location is saved. At every time step, the area
immediately surrounding the object (a “halo” excluding the
object) is checked for significant foreground activity.
Significant foreground detection in this area indicates that
something may be occluding the object. When there is no
foreground detected in the halo region, it is unlikely that the
object is occluded. If the object is unocculded, then it should
look the same as it did when it was first discovered. Thus, if
there is no noticeable foreground activity in the halo, the
current image is compared pixel-by-pixel to the stored image
for the potential abandoned object, and an average per-pixel
difference is calculated. An exponential running average of
this difference is then updated. If the value of the exponential
running average exceeds an empirically determined threshold,
the potential abandoned object is deemed to be moving too
much to be a stationary object. Thus, it is then reclassified
from being a potential abandoned object to being a still
person.
III. RESULTS
A. Alarm Description
To test whether an alarm is accurate or not, meaningful
data about the alarm is required. The following information is
what is recorded for every alarm:
1) Identification Number: Unique number used to
distinguish alarms from each other.
2) Start Time: Timestamp corresponding to the time a
potential abandoned object is first detected by the system.
3) Trigger Time: Timestamp corresponding to when the
method sent notification in the form of an alarm.
4) End Time: Timestamp corresponding to when the
abandoned object ceases to be an abandoned object.
5) Image-Plane Location: Location in the camera image
plane where the alarm takes place. This is a bounding box
outlining the abandoned object.
EXAMPLE FRAME
B. Ground Truth
The ground truth for a given video sequence is determined
manually for every sequence by a human operator.
Essentially, the operator watches the video sequence and
marks the frame number and image-plane location for the start,
trigger, and end times for every alarm that takes place in that
sequence. This information is then recorded in a machinereadable format that can be used later to test the performance
of the module.
C. PED/PAT Score Description
The scoring of the performance on a video sequence is
evaluated by using Percent Events Detected (PED) and
Percent Alarms True (PAT) scores. The PED score represents
the ratio of real alarms in the ground truth that were
successfully detected by the module to the total number of
alarms in the ground truth. See Eqn. (1). The PAT score
represents the ratio of alarms that correspond to real alarms in
the ground truth to the total number of alarms detected by the
module (Eqn. (2)). A high PED score indicates that the
module detects most objects that should trigger an alarm. A
high PAT score indicates that the module rarely triggers false
alarms.
Number of Real Alarms Detected
(1)
PED =
Number of Real Alarms
Number of Real Alarms Detected
PAT =
Total Number of Alarms Detected
(2)
Matching of alarms detected by the module to the ground
truth is performed as follows. Every alarm detected by the
module is compared to every alarm in the ground truth, in
order to find the differences between their locations and
timestamps. A candidate match is declared if there is
sufficient spatial proximity and/or overlap between the two
alarms as well as a temporal distance below a specific
tolerance. The candidate matches will usually result in a manyto-many relationship and can be represented by edges in a
bipartite graph. The next step is to find a one-to-one
relationship. This can be done by finding a maximum
cardinality matching. Because several such matches may exist,
we are interested in the one that has the edges with the least
temporal difference. We therefore give each edge a weight
equal to the timestamp difference between the two alarms and
then find the minimum-weighted maximum-cardinality
matching. Although this does not affect the computed
PED/PAT scores numerically, it is useful for manual checking
of the results. The PED score is calculated to be the maximum
number of these one-to-one matches divided by the number of
alarms in the ground truth. The PAT score is the maximum
number of one-to-one matches divided by the number of
alarms detected by the module.
3778
TABLE I: TEST VIDEO SEQUNCES
SEQUENCE DESCRIPTION
Name: BSHigh
Length: 49m, 03s
Population: Sparse
Description: This sequence was
taken simultaneously with
BSLow, two stories above the
region of interest. Two
participants alternated
wandering around, leaving small
and large packages to test
abandoned object detection
capabilities. Light extra foot
traffic from the general public.
Name: BSLow
Length: 48m, 23s
Population: Sparse
Description: Same as BSHigh,
but only one story above the
region of interest.
Name: MTC8
Length: 1h, 44m, 55s
Population: Dense
Description: This sequence was
captured using one of the
cameras installed at a transit
station. Two individuals wander
around and leave abandoned
objects while the general public
uses the station as normal.
Name: MTC5
Length: 13m, 32s
Population: Sparse
Description: This sequence was
captured using one of the
cameras installed at a transit
station. Two individuals wander
around and leave abandoned
objects while the general public
uses the station as normal.
D. Overall Score Description
The PED and PAT scores for a given video sequence give
a good intuitive feel for how well the algorithm did on that
video in two separate regards, but these scores are lacking in
that they do not provide a very good metric to describe how
well the algorithm does overall. Thus, we define an overall
score as follows:
(1 − log 2 (x )) × TrueDetected
(3)
Score =
TotalDetected − TotalEvents × log 2 ( x )
In Eqn. (3), TrueDetected is the number of true positives,
TotalDetected is the total number of alarms issued,
TotalEvents is the number of alarms in the ground truth data,
and x is the relative importance we wish to give to the PED
score over the PAT score. For instance x = 0 will be a PAT
plot, x = 0.5 will weight PED and PAT equally, and x = 1 will
(a)
(b)
(c)
(d)
Fig. 4 Result plots for the test sequences described in Table I. PED is red, PAT is green, and Score is blue. (a) BSHigh; (b) BSLow; (c) MTC8; (d)
MTC5
be a PED plot. For the tests reported here, we use a value of x
= 0.75 because we consider finding true alarms more
important than some false positives.
E. Result Plots
The result plots are created by varying the time threshold
(temporal tolerance) between 0 and 120 seconds and plotting
the PED, PAT, and overall score for the given sequence using
that time threshold on a scale of 0-100%. The distance
threshold is left unchanged. See Fig. 3 for an example of a
result plot. The result plots provide a good indication of the
expected percentage of events the system will alert an operator
to at some time past when the event took place. Looking at
Fig. 3, we can see that the PED, PAT, and overall score
functions all rise monotonically as the time threshold
advances.
It is important to note that a low score at very early values
of the time threshold does not mean that nothing will ever get
detected. If the PED score at 40s is 40% while the PED score
at 80s is 60%, 40% of the events in the video are detected
within 40s of their occurrence, while another 20% are detected
between 40s and 80s of their occurrence. Thus, the extra 20%
detected between 40s and 80s are still detected, they just take
longer to be detected than the previous 40%. Thus, the result
3779
plots are useful tools for demonstrating when an abandoned
object will be detected after it is placed within a scene.
E. Test Sequences
The test video sequences used are described in Table I. In
total, these test sequences represent approximately 3 hours 36
minutes of footage. All videos have instances of abandoned
objects and still people. There are four videos taken from two
different venues. Each video sequence tests a different
viewpoint.
The densities of people within the scenes are noted in the
sequence descriptions. In a sparse scene, people do not
usually occlude one another (there are not many crowds). In a
dense scene, there are crowds of people so people occlude
each other all the time.
It should be noted that there were actually three video
sequences from the same camera that were agglomerated into
the results for MTC8, totaling the 1h 45m 55s length listed in
Table I. All of these sequences were densely populated.
F. Results
The four test sequences were pre-recorded using portable
camcorders or existing security cameras, and then captured at
320x240 in RGB color. For testing, these sequences were
processed on a 3.4 GHz Pentium 4 computer running
Windows XP. The algorithm processed this digital video in
real time, so it is reasonable to expect that similar results
would be found if the video were coming live from a remote
sensor and not pre-recorded.
The result plots for the four example videos can be seen in
Fig. 4. As can be seen in this figure, BSHigh scored extremely
well for both PED and PAT scores. Thus, the method does not
throw many spurious alarms and correctly identifies most true
abandoned objects in the scene. BSLow also scores very well
for the PED score. The PAT score for BSLow is somewhat
low because of segmentation failures due to the large potted
bush in the scene (see Table I). It can be seen how a low PAT
score affects the overall score by comparing the result plots for
BSHigh and BSLow. The result plot for the MTC5 shows a
very good PED score but a much poorer PAT score.
It is interesting to note that the three sequences depicting
sparsely populated scenes (BSHigh, BSLow, and MTC5) all
have very uniform overall scores. They are all around 70%.
This indicates that the method should provide fairly uniform
results in such circumstances.
Unfortunately, the results for the densely populated MTC8
video sequences were not as good as the others. The poor
results on MTC8 compared to the sparsely populated scenes is
due to several reasons, primarily because blobs in crowded
scenes do not behave as closely to the Short-Term Logic as
blobs in sparse scenes do. Essentially, the Short-Term Logic
predicts that blobs associated with people will have a
measurable velocity, but when one considers a group of people
with nonzero velocity walking past each other, the blob
corresponding to these people may very well have a near zero
velocity. Situations like these where actual can lead to
misclassifications that prevent proper abandoned object
detection. It is important to note that even though the results
are not as good for densely populated scenes as they are for
sparse scenes, dense scenes represent a very real challenge to
any computer vision algorithm. It is hoped that providing
base-line scores for such real-world scenes will encourage
vision researchers to take real-world locations into account and
not stick to safe, sparse, laboratory tests.
IV. CONCLUSIONS AND FUTURE WORK
In this paper, we have presented a method to detect
abandoned objects that works online in real time, uses color
data, can adapt to scene changes around the clock, does not
detect still people as abandoned objects, and detects
abandoned objects even if they are occluded by moving
crowds of people for periods of time. The results presented
show that the method works best in sparsely populated areas
where people are regularly detected separately, ensuring good
input to the short-term logic, which accurately characterizes
the behavior of blobs corresponding to individuals. The
results for densely populated scenes are not as good, indicating
that future research should look into defining a short-term
logic that characterizes the behavior of blobs corresponding to
crowds. For instance, it may be possible to use skin tone
3780
detection or the periodic blob motion corresponding to gait for
person (P type blob) classification in the Short-Term Logic.
ACKNOWLEDGEMENTS
We would like to thank the Department of Homeland
Security, the ITS Institute at the University of Minnesota, and
the National Science Foundation (through grant IIS-0219863)
for their generous support of this research. We would also like
to thank the anonymous reviewers for their constructive
comments.
REFERENCES
[1] S. Atev, O. Masoud, and N. Papanikolopoulos, “Practical mixtures of
Gaussians with brightness monitoring,” Proceedings of IEEE Intelligent
Transportation Systems Conference 2004, pp. 423-428.
[2] A. Bevilacqua, “Effective object segmentation in a traffic monitoring
application,” Proceedings of Indian Conference on Computer Vision,
Graphics, and Image Processing, December 2002.
[3] M. D. Beynon, D. J. Van Hook, M. Seibert, A. Peacock, and D. Dudgeon,
“Detecting abandoned packages in a multi-camera video surveillance
system,” Proceedings of the IEEE Conference on Advanced Video and
Signal Based Surveillance, pp. 221-228, July 2003.
[4] I. Haritaoglu, D. Harwood, and L. S. Davis, “W4: real-time surveillance
of people and their activities,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 20, no. 8, pp. 809-830, August 2000.
[5] C. Sacchi and C. S. Regazzoni, “A distributed system for detection of
abandoned objects in unmanned railway environments,” IEEE
Transactions on Vehicular Technology, vol. 49, no. 5, September 2000.
[6] C. Stauffer and W. E. L. Grimson, “Adaptive background mixture models
for real-time tracking,” Proceedings of the IEEE Computer Vision and
Pattern Recognition, vol. 2, pp. 2246-2252, June 1999.
[7] T. Yang, Q. Pan, S. Z. Li, and J. Li, “Multiple layer based background
maintenance in complex environment,” Proceedings of the Third
International Conference on Image and Graphics, pp. 112-115,
December 2004.
© Copyright 2026 Paperzz