Rule-based real-time detection of context

Rule-based real-time detection of
context-independent events in video shots ?
Aishy Amer a Eric Dubois b Amar Mitiche c
a Electrical
b School
and Computer Engineering, Concordia University, Montréal, Québec,
Canada. Email: [email protected]
of Information Technology and Engineering, University of Ottawa,
Ottawa, Canada. Email: [email protected]
c INRS-Télécommunications,
Université du Québec, Montréal, Québec, Canada.
Email: [email protected]
A bstract
The purpose of this paper is to investigate a real-time system to detect contextindependent events in video shots. We test the system in video surveillance environments with a fixed camera. We assume that objects have been segmented (not
necessarily perfectly) and reason with their low-level features, such as shape, and
mid-level features, such as trajectory, to infer events related to moving objects.
Our goal is to detect generic events, i.e., events that are independent of the context of where or how they occur. Events are detected based on a formal definition
of these and on approximate but efficient world models. This is done by continually
monitoring changes and behavior of features of video objects. When certain conditions are met, events are detected. We classify events into four types: primitive,
action, interaction, and composite.
Our system includes three interacting video processing layers: enhancement to
estimate and reduce additive noise, analysis to segment and track video objects,
and interpretation to detect context-independent events. The contributions in this
paper are the interpretation of spatio-temporal object features to detect contextindependent events in real time, the adaptation to noise, and the correction and
compensation of low-level processing errors at higher layers where more information
is available.
The effectiveness and real-time response of our system are demonstrated by extensive experimentation on indoor and outdoor video shots in the presence of multiobject occlusion, different noise levels, and coding artifacts.
Key words:
Event detection, context independence, motion interpretation, occlusion, split,
noise adaptation, video surveillance.
Preprint submitted to Elsevier Science
5 February 2005
1
Introduction
The subject of the majority of video is related to moving objects, in particular
people, that perform activities and interact to create events (Pentland (2000);
Bobick (1997)). The human visual system is strongly attracted to moving
objects creating luminance change (Nothdurft (1993); Franconeri and Simons
(2003); Abrams and Christ (2003)).
Generally, a video displays multiple objects with their low-level features (e.g.,
shape), mid-level features (e.g., trajectory), high-level features (e.g., events
and meaning), and syntax (i.e., spatio-temporal relation between objects).
High-level object features are generally related to the movement of objects.
In this paper, we concentrate on context-independent events with the aim of
providing assistance to users of a video system, for example, in video surveillance to detect object removal, or in visual fine arts applications to extract
interacting persons.
The significant increase of video data in various domains such as video surveillance requires efficient and automated event detection systems. In this paper,
we propose a real-time, unsupervised event detection system. Our system is
based on three video processing layers: video enhancement to reduce noise
and artifacts, video analysis to extract low- and mid-level features, and video
interpretation to infer events.
The remainder of this paper is organized as follows. Section 2 classifies events
into four categories, Section 3 discusses related work, Section 4 gives an
overview of our system, Section 5 proposes our approach to detect contextindependent events, Section 6 presents experimental results, and Section 7
contains a conclusion.
2
A classification of video events
An event is an interpreted spatio-temporal relation involving one or several
objects (Benitez et al. (2002); Benitez and Chang (2003); Bremond et al.
(2004)) such as “Object A deposits object B” or “Object A disappears”. In
this paper, we look at events that involve variations of object features at two
successive time instants of a video shot. (A video sequence contains, usually,
many shots. To facilitate event detection, a video sequence has to first be
? This work was supported, in part, by the Natural Sciences and Engineering Research Council (NSERC) of Canada.
2
segmented into shots (see, e.g., Naphade et al. (1998); Bouthemy et al. (1997)).
In this paper, the term video refers to a video shot.)
A video event expresses a particular behavior of a finite set of objects in a
sequence of a small number of consecutive frames of a video shot. In the
simplest case, an event is the appearance of a new object into the scene. In
more complex cases, an event starts when a higher-level behavior of objects
changes.
An event consists of a context-independent (or a fixed meaning) and a contextdependent component associated with a time and a location. For example, the
event deposit has a fixed context-independent meaning, “the object is added
to the scene”, that is common to all applications but deposit may have a
variable meaning in different contexts such as “the object is added at a specific
significant location”. Events are generally applicable when they convey fixed
meaning independently of context of the input video. A generally applicable
event-detection system detects context-independent events.
Note that we distinguish here between application independence and context
independence. For example, moving fast is context independent as the highlevel details of this event do not matter. The threshold to determine which
motion is fast is, however, application dependent. Deposit is context independent as it is detected independently of where and what is being deposited.
The importance of the position of the deposit is context dependent.
We distinguish between four types of events: primitive (e.g., “Object A moves”
or “Object C appears’), action (e.g., “Object A moves too fast”), interaction
(e.g., “Object A deposits Object C”), and composite events (e.g., “Object
A stands”, note that the event stand may be concluded following the events
move, slow motion stops).
Primitive events describe basic object behavior while action, interaction, and
composite events describe non-basic behavior of objects. A primitive event
is defined directly from the temporal features of a moving object, i.e., “simple” change of object motion state. They represent the finest granularity of
events. Action, interaction, and composite events are related to “non-simple”
(“complex”) behavior of objects. An action event is related to one moving
object achieving a task and an interaction event is related to two or more objects. A composite event may consist of a combination of 1) several primitive
events involving possibly multiple objects or 2) primitive, action, and interaction events. Composite events directly depend on the goals of the underlying
application.
3
3
Related work
Research in the area of detecting high-level video content such as events, especially related to people, has become a central topic in video processing (Bobick
(1997); Lippman et al. (1998); Pentland (2000)). In recent years, research interest in video content detection has shifted from motion and tracking toward
detection and recognition of activities and events.
Most event detection systems are developed for smaller scope applications,
for example, hand sign applications or Smart-Camera based cooking (IEEE
PAMI (2000); Lippman et al. (1998); Vasconcelos and Lippman (1997); Bobick
(1997)). In these systems, prior knowledge is inserted in the event recognition
inference system and the focus is on recognition and logical formulation of
events and actions.
Little work on context-independent detection exists and such systems are rarely
tested in real applications, e.g., in the presence of illumination changes, noise,
or multiple objects with occlusion. The system of Courtney (1997) is based on
prediction and nearest-neighbor matching. The system is able to detect events
such as deposit. It can operate in simple environments where one human is
tracked under the assumption of translational motion. It is limited to applications of indoor environments, does not address occlusion, and is sensitive to
noise. Moreover, the definition of events is limited to a few environments.
The event detection system for indoor surveillance applications of Stringa and
Regazzoni (1998) consists of object detection and event detection modules.
The event detection module classifies objects using a neural network. The
classification includes: abandoned object, person, and object. The system is
limited to one abandoned object event in unattended environments. The definition of abandoned object is specific to the given application. The system
cannot associate abandoned objects and the person who deposited them.
The system of Haering et al. (2000) is designed for off-line processing applications and uses domain knowledge to facilitate detection of events (wildlife hunt
events). The system of Haritaoglu et al. (2000) tracks several people simultaneously and uses appearance-based models to identify people. It determines
whether a person is carrying an object, and can segment the object from the
person. It also tracks body parts such as head or hands. The system imposes,
however, restrictions on the object movements. Objects are assumed to move
upright and with little to no occlusion. Moreover, it can only detect a limited
set of events.
This paper contributes a real-time solution to detect events related to one or
several objects in the presence of challenges such as noise and occlusion. The
solution is based on plausibility rules where no prior knowledge, initialization,
4
or object models are assumed. This paper contributes, furthermore, a solution
to explicitly account for changing object features, e.g., due to object occlusion,
splitting, or deformation. This is done by adaption to video noise, by explicitly
detecting object occlusion or erroneous object segmentation, and monitoring
of features over time. In addition, we introduce plausibility rules to handle
errors and confirm temporal reliability and consistency of object matching.
The proposed system is designed for real-time event-oriented applications such
as video surveillance.
4
Overview of the proposed system
It is widely agreed that modules of a layered video system cannot be optimized independently of each other. In our system, we study video processing
modules with emphasis on integration into a layered system. To balance effectiveness and efficiency, the proposed system aims 1) at stable modules that
forego the need for feature precision, 2) at context-independence, 3) at error correction by higher modules where more information is available, 4) at
real environments including multiple objects, occlusions and artifacts, and 5)
at low computational cost. Stability aims at handling inaccuracies and false
alarms such as split not deposit event.
To achieve the above goals, our layered system is divided into simple but effective tasks avoiding complex operations, and involves three interacting processing layers: video enhancement (i.e., noise estimation, Amer et al. (2002),
and noise reduction, Jostschulte and Amer (1998)), video analysis (i.e., object
segmentation and tracking, Amer (2003b,a)), and video interpretation (i.e.,
event detection as described in this paper).
In this paper, events are defined by approximate but efficient world models. This is done by continually monitoring changes and behavior of low-level
features of the scene’s objects. When certain conditions are met, events are
detected. To achieve higher applicability, events are detected independently
of the context of the video. Several context-independent events are rigorously
defined and automatically detected using features detected following video
analysis (i.e., object segmentation, motion estimation, and tracking). We do
not, however, assume perfect video analysis. Possible errors of the video analysis are subsequently corrected or compensated at higher modules.
Fig. 1 displays a block diagram of the proposed system which can be viewed as
a system of methods and algorithms to build automated scene interpretation.
Besides applications such as video surveillance, outputs of the proposed system
can serve a video understanding, or a symbolic reasoning system.
5
Video shot
I(n-1)
Video enhancement
Enhanced video
Enhanced video
Video analysis
Pixels to video objects
I(n)
Noise estimation and/or
noise reduction
σn
Motion-based object segmentation
=> Pixels to objects
Object-based
motion estimation
Voting-based object tracking
=> Objects to video objects
Spatio-temporal
object features
Video interpretation
Analysis & interpretation
Video objects to events
of low-level features
Event detection & classification
Results
Requests
(Events & Objects)
Fig. 1. Block diagram of the proposed system. σn represents the estimated noise
standard deviation. “Requests” depend on the specific design of an event-based
surveillance system. For instance, a request can be “online” streaming of events to
a control site or “offline” storage of event in a given format.
At two successive time instants, the system outputs a list of objects with
their low- and mid-level features and related events. Information is passed
to the next frame through a linked list of objects and their features which
are set or updated throughout the layered system. Features include identity
(ID), age (the number of frames since the object occurred), perimeter, area,
minimum bounding box (MBB), start position, split data (e.g., current status
and ID of the object that was split from), occlusion data (e.g., status and
ID of the object that occludes, or is occluded by, this object), recent motion,
corresponding confidence, link to the corresponding object, and related events.
5
Detection of context-independent events
In this section, we describe in detail the video interpretation module (Fig. 1)
that detects events based on a formal definition of these and on approximate
but efficient world models. We propose perceptual descriptions of events that
can be used and easily extended in a wide range of surveillance applications.
For generality, event detection is not based on geometry of objects but on
their features and relations over time.
The video analysis module (Fig. 1) represents objects by their spatio-temporal
low-level features in temporally linked lists. Object features are stored for each
6
frame and compared as frames of a shot arrive. To detect events, the video
interpretation module simultaneously monitors the behavior and features of
each object in the scene. Events are automatically detected by combining
trajectory information and spatial features, such as size and location. When
specific conditions are met, events related to these conditions are detected.
Behavior monitoring is done on-line, i.e., object data is analyzed as it arrives
and events are detected as they occur.
In the following, define:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
I(n), I(n − 1), I(q), frames of the input shots at time instant n, n − 1, or q;
Oic , Ojc , segmented objects in the current frame I(n);
Okp , Olp , segmented objects in the previous frame I(n − 1);
Mli : Olp → Oic , a function assigning Olp ∈ I(n − 1) to an Oic ∈ I(n);
M0i :a Oic , a zero-match function matching an Oic ∈ I(n) to no object in
I(n − 1);
Ml0 : Olp a, a zero-match function matching an Olp ∈ I(n − 1) to no object
in I(n));
Bi , the MBB of Oic ;
gi , the age of Oic , i.e., the number of frames since its appearance or entrance;
ei = (exi , eyi ), the centroid of Oic ;
dij , the distance between the centroids of Oic and Ojc ;
rmin , the upper row of the MBB of an object;
rmax , the lower row of the MBB of an object;
cmin , the left column of the MBB of an object; and
cmax , the right column of the MBB of an object.
The following are samples of generic events detected automatically by our
system following the classification in Section 2. Note that thresholds used in
these sample events are adaptive as described in Section 5.6.
5.1
Primitive events
Primitive events are related to “simple” behavior of objects. In the following
we define the events enter, appear, exit, disappear, move, and stop.
Oic enters
at time instant n if 1) Oic ∈ I(n),
/ I(n − 1), i.e, zero match M0i :a Oic , and 3) ei is at the border
2) Oic ∈
of I(n) (this condition depends on the object motion and the frame rate).
Our definition aims at detecting object entrance as soon as an object portion
becomes visible. In some applications, only entering objects of specific size,
motion, or age, are of interest. This can be easily integrated into our definition.
7
Oic appears
at time instant n if 1) Oic ∈ I(n), 2) Oic ∈
/
I(n − 1), i.e., zero match in I(n − 1): M0i :a Oic , and 3) ei is not at the border
of I(n). Note that an object can either enter or appear at a time instant n.
Okp exits at time instant n if 1) Okp ∈ I(n − 1), 2) Okp ∈
/ I(n), i.e., zero match
in I(n): Mk0 : Okp a, 3) ek is at the border of I(n − 1), and 4) gk > tg where
gk is the age of Okp . tg is the age threshold (Sec. 5.6).
Okp disappears at instant n if 1) Okp ∈ I(n − 1), 2) Okp ∈
/ I(n), i.e., zero match
Mk0 : Okp a in I(n), 3) ek is not at the border of I(n − 1), and 4) gk > tg .
Oic moves
at time instant n if 1) Mki :
Okp → Oic and 2) M ED(|wic ∈ I(n−q)|) > ts with 0 ≤ q ≤ m and M ED(·) is
median filter. Condition 2) means that the median of the motion magnitudes
of Oic ∈ I(n) in the last m frames is larger than the motion threshold ts . m
is the filter order (Sec. 5.6). Note that there is no delay to detect this event
because motion data at previous frames are available.
Oic stops at time instant n if 1) Mki : Okp → Oic and 2) M ED(|wic ∈ I(n −
q)|) ≤ ts with 0 ≤ q ≤ m.
5.2
Action events
An action event is related to “non-simple” (“complex”) behavior of a single
object. In this paper, an action is detected when an object moves abnormal
or when the object is dominant.
Oic moves abnormal if its movement is either frequent (i.e., fast motion or
short stay) or rare (i.e., slow motion or long stay).
Oic stays long in the scene if 1) gi > tg , i.e., Oic does not exit the scene after a
time tg , and 2) di < tsl , i.e, the distance di between the current I(n) and past
I(q) position of Oic (q < n) is less than a threshold tsl (Sec. 5.6).
Oic moves too fast (or moves too slow) if the object speed in the last m frames
is larger (smaller) than a threshold tmf (tms ) (Sec. 5.6).
Oic is dominant if it performs a “significant” predefined event such as it has
the largest speed, age, or size, at time instant n.
Note that for a complete description of action events, some details on the
underlying application are required. For example, the event moving too fast
requires details about the camera position to be complete.
8
5.3
Interaction events
An interaction event is related to “non-simple” behavior of two or more objects. In this paper, we define the events occlude/occluded, expose/exposed,
remove/removed, deposit/deposited, split, and at an obstacle.
Object occlusion can result from either real object occlusion (i.e., objects
physically get connected) or segmentation errors (i.e., object images overlap
although objects are distant in world coordinate). In both cases, an object
segmentation outputs one region consisting of the occluding and occluded
objects.
if Eq. 1 (see Section
Oic occlude Ojc
5.5) is valid and if the size (area) of Oic is larger than that of Ojc . In occlusion,
at least two objects are involved. The object with the larger area is defined as
the occluding object, the other the occluded object; exposure is detected when
occlusion ends.
Okp removes Olp
if 1) Okp and Olp were occluded in I(n−1),
/ I(n), i.e., zero match Ml0 : Olp a, and 3) the area Al of Olp is smaller
2) Olp ∈
than that of Okp , i.e., AAkl < ta , ta being the size threshold (Sec. 5.6). Note that
removal is detected after occlusion. When occlusion is detected the system
predicts the occluded objects. In case of removal, the features of the removed
object can change significantly and the proposed system may not be able to
follow the removed objects. Conditions for removal are checked and if they are
met, removal is declared. The object with the larger area is the remover, the
other is the removed object.
if 1) Okp ∈ I(n − 1), Oic , Ojc ∈ I(n), and
Oic deposits Ojc
Mki : Okp → Oic , 2) Ojc ∈
/ I(n − 1), i.e., no match M0j :a Ojc , 3) AAji < ta , ta
being the size threshold, 4) [Ai +Aj ' Ak ]∧[(Hi +Hj ' Hk )∨(Wi +Wj ' Wk )],
where Ai , Hi , and Wi are area, height, and width of Oic , 5) Oic changes in
height or width between I(n − 1) and I(n) at the MBB-side s, and 6) Ojc is
close to a side, s, of the MBB of Oic . s ∈ {rmini , rmaxi , cmini , cmaxi }. Define djs
the distance between the MBB-side s and Ojc . Ojc is close to s if 1 < djs < tjs
with the closeness threshold tjs (Sec. 5.6).
Only if the distance between the deposited object and depositor is large (i.e.,
9
the depositor moves away from the deposited object) is deposit considered.
Otherwise Ojc is assumed to have split from Oic and is merged to Oic . To
reduce false alarms, deposit is declared if the deposited object remains in
the scene for some time. Thus the system differentiates between deposit and
segmentation errors. It can also differentiate between stopping (e.g., seated
person) and deposited objects.
if Eq. 2 is valid (see Section 5.5). An
Okp splits into Oic1 and Oic2
object split can be real (in case of object deposit) or due to object segmentation
errors. The main difference between deposit and split is that a split object is
close to the splitter while a depositor moves away from the deposited object
and they become widely separated.
Often, objects move close to static background objects (obstacles) which may
occlude part of the moving objects. This is particularly frequent in traffic
scenes where objects move close to road signs. An object segmentation is not
able in such cases to detect pixels occluded by an obstacle. A moving object
is then split into two or more regions as in
⇒
. This is different from
object split because, rather than abrupt, gradual change of object features
occurs. The proposed system detects objects at obstacles by monitoring the
features of each object, Oic , in the scene. If a gradual decrease, or increase, of
a features of Oic is detected, Oic is labeled accordingly.
Olp is at an obstacle 1
if 1) Olp , Okp ∈
I(n − 1) have appeared or entered, 2) Olp has no corresponding object in I(n),
i.e., zero match in I(n) with Ml0 : Olp a, 3) Ak (area) was monotonically
decreasing in the last m frames, 4) Olp and Okp are some how similar, i.e.,
Mlk : Olp → Okp , 5) motion direction of Olp does not change if matched to
Oic ∈ I(n), and 6) Olp is close to Okp where Al was monotonically increasing
in the last m frames and Okp has a corresponding object, i.e., Mki : Okp → Oic .
1
While the depicted transition figures for at an obstacle show two objects, the
original object gets its ID back when motion at the obstacle is detected.
10
5.4
Composite events
A composite event is related to two or more primitive, action, or interaction
events. We propose here the following perceptual definitions of the composite
events stand, sit, walk, lost, found, approach a restricted site, change direction,
change speed, and special movement. Our main goal is to show the extensibility
of the proposed system.
stands and sits are characterized by continual change in height and width of
the object MBB. Oic stands if 1) its motion in the last m frames is very small
and 2) it experiences a continual strong increase in its MBB height and a
continual slight decrease in its MBB width.
Oic sits if 1) it experiences a continual increase of the width and decrease of
the height of its MBB. In both events, height and width need to be compared
to the values of the height and width at the time instant before they started
to change.
Oic walks if it experiences a continual moderate movement over a significant
period of time (i.e., for a given number of frames). When objects walk they
experience motion without an abrupt change in their MBB.
Oic is lost if it has no corresponding object in the current frame and occlusion
was previously reported but not removal. It is similar to the event disappear.
Some applications require the search for lost objects even if they are not in
the scene. To allow the system to find lost objects, features, such as ID, size,
or motion, of lost objects need to be stored for future reference. If the event
lost is detected and a new object appears in the scene which shows similar
features to the lost object, the objects can be matched and the event found
declared.
The event approaches a restricted site is straightforward to detect when the
location of the restricted site is predefined. In this case, the direction of the
object’s motion and distance to the site can be monitored to detect the event.
Based on a registered motion direction, the motion direction in the last m
frames previous to I(n) are compared with the registered motion direction. If
the current motion direction is deviating from the motion direction in each
of the m frames, the event changes direction can be declared. Similarly, the
event changes speed can be detected.
Often, a scene contains events which have never occurred before or occur
rarely. This is application dependent. The event special movement can be
defined as a chain of simple events, for example, enters, moves through the
scene, and disappears.
11
5.5
Monitoring the video analysis
A good event detection technique should account for errors of previous steps.
Object segmentation, for example, is likely to output erroneous object masks
and features. This system accounts for the correction of two major errors,
object occlusion and split, based on plausibility rules and prediction strategies.
To monitor occlusion or split, our system monitors changes at all sides of the
object bounding box, both horizontal and vertical displacements, and uses
the distance between objects. The corrected new object information is then
used to update the previous analysis steps on the previous frame so that the
correction is done between two successive time instants.
5.5.1
Monitoring object occlusion
Occlusion may occur when two objects physically meet or when their images
overlap while being distant. This means that the minimum bounding boxes of
the two objects physically overlap in the frame. When objects get connected,
at least one side of the object’s four MBB sides will experience a significantly
large displacement. Define:
• Oic ∈ I(n), the occlusion of Okp ∈ I(n − 1) and Olp ∈ I(n − 1) in I(n) where
Mki : Okp → Oic ;
• dkl , the distance of the centroids of Okp and of Olp ;
• wkp = (wx , wy ), the current displacement of Okp , i.e., between I(n − 2) and
I(n − 1).
• wvl , the vertical displacement of the lower MBB row of Okp ;
• wvu , the vertical displacement of the upper MBB row of Okp ;
• whr , the horizontal displacement of the right MBB column of Okp ;
• whl , the horizontal displacement of the left MBB column of Okp ; and
• tw , td , the displacement and distance thresholds (inversely proportional to
the frame rate R, see Sec. 5.6).
Object occlusion is detected (Eq. 1) if 1) the current displacement and the
displacement of one of the four side of the MBB of Okp significantly differ, 2)
the displacement of that MBB side is outward, and 3) there exists an object
Olp ∈ I(n − 1) close to Okp .
[(|wx − whr | > tw ) ∧ (whr > 0) ∧ (dkl < td )] ∨
[(|wx − whl | > tw ) ∧ (whl > 0) ∧ (dkl < td )] ∨
[(|wy − wvl | > tw ) ∧ (wvl > 0) ∧ (dkl < td )] ∨
[(|wy − wvu | > tw ) ∧ (wvu > 0) ∧ (dkl < td )].
12
(1)
If occlusion is detected then all (occluding and occluded) objects are labeled.
This labeling allows to continue monitoring objects in subsequent frames even
if they become completely occluded (i.e., the occlusion conditions in Eq. 1
are not met). This is important since objects might reappear. Note that the
labeling is important to help detect occlusion even if the occlusion conditions
in Eq. 1 are not met.
If occlusion is detected, the occluded object Oic is split into different objects
(two or more depending how many objects get connected). Splitting is done
by predicting related objects of I(n − 1) onto I(n) based on their estimated
displacements. After prediction, the lists of objects and their features in I(n)
and I(n − 1) are updated, for example, by adding Olp to I(n − 1).
5.5.2
Monitoring object split
A segmentation method may divide an object into several regions. Note that
when objects split, at least one side of the object’s four MBB sides will experience significantly large displacement. Assume Okp ∈ I(n − 1) is erroneously
split in I(n) into Oic1 and Oic2 and Mki1 : Okp → Oic1 . Define:
• di12 , the distance between Oic1 , Oic2 ; and
• w = (wx , wy ), the current displacement of Okp between I(n−2) and I(n−1).
• (Recall the displacement tw and the distance td thresholds.)
An object split is declared (Eq. 2) if 1) the current displacement and the
displacement of one of the four sides of the MBB of Okp significantly differ, 2)
the displacement of that MBB side is inward, and 3) there is an object Oic2
close to Oic1 in I(n).
[(|wx − whr | > tw ) ∧ (whr < 0) ∧ (di12 < td )] ∨
[(|wx − whl | > tw ) ∧ (whl < 0) ∧ (di12 < td )] ∨
[(|wy − wvl | > tw ) ∧ (wvl < 0) ∧ (di12 < td )] ∨
(2)
[(|wy − wvu | > tw ) ∧ (wvu < 0) ∧ (di12 < td )].
If splitting is detected, then the two regions Oic1 and Oic2 are merged into
one object Oic so to reduce the total number of regions to process and thus
improve algorithm performance.
5.5.3
Temporal consistency
We have applied the following plausibility adaptation that aims at temporal
consistency of the event detection:
13
• Objects are followed once they enter and also during occlusion.
• Two objects are matched only if the estimated motion direction is consistent
to those in previous frames.
• If, during the matching (e.g, after merging), a better correspondence than
a previous one is found, the matching is revised.
5.6
Selection of thresholds
Due to various distortions such as noise, feature inaccuracies are possible. The
proposed system uses thresholds to compensate for such distortions. We adapt
most of these thresholds to parameters of the input video such as frame rate
R, frame size F , and object size A. For example, if objects are large, then a
larger error can be tolerated than when they are small. This adaptation allows
better distinction at smaller sizes and stronger matching at larger sizes and
makes thus the thresholds used useful for the entire sequence where the size
of the object varies.
The proposed system requires the specification of the following thresholds:
• tg : the age (temporal) threshold with typical values between 10 and 25
(e.g., see the exit event). tg is a function of the frame rate and the minimal
allowable object speed.
• ts : the motion threshold with typical values between 0 and 2 (e.g., see the
move event).
• tsl : the stay-for-long (temporal) threshold is a function of the frame rate R,
the object motion w, and the frame size F .
• tmf : the move-fast (temporal) threshold with typical values between 8 and
10. tmf is inversely proportional to R. Due to the projective nature of video
shots these thresholds should depend on the position of the camera (but the
camera position is currently not integrated in our system).
• tms : the move-slow (temporal) threshold with typical values between 2 and
4.
• tw : the displacement (temporal) threshold which is inversely proportional
to R (e.g., see the occlusion event).
• td : the distance (temporal) threshold which is inversely proportional to R
(e.g., see the split event).
• ta : the size (spatial) threshold with typical values between 0.25 and 0.50
(e.g., see the remove event).
• tjs : the closeness (spatial) threshold with typical values between 15 and 20
pixels (e.g., see the deposit event).
• m: the order threshold (number of previous frames) with typical values
between 3 and 5 (e.g., see obstacle event). m depends on A and on the
position of the camera.
14
To adapt a “spatial” threshold tsa from the list above, we apply Eq. 3 which
aims at better distinction at small object and frame sizes and stronger matching at large sizes.




tsamin



: A ≤ Amin
tsa =  tsamin +




 tsamax
(tsamax −tsamin )
A
Amax
+
tsamin Fmax
:
F
Amin < A ≤ Amax
: A > Amax
(3)
Amax = a · F,
Amin = b · F,
b = 0.0001 · a.
A represents the object size, Amin (Amax ) the minimum (maximum) size, F the
frame size, and Fmax the maximum (e.g., HDTV) frame size. Typical values
for a are between 0.1 and 0.25. tsamin and tsamax were set experimentally.
To adapt a “temporal” threshold tta to the frame rate R, we applied Eq. 4
tta = α
1
R
(4)
with the value of α between 190 and 300. Eq. 4 makes our system still performant when objects between frames are more distant, i.e., low R.
6
Results and applications
To show the performance of our system, we report on experiments on various
video shots, containing over 10, 000 frames with multi-object occlusion, noise,
and artifacts, of indoor and outdoor real environments with a fixed video
camera.
The main difficulties in outdoor shots are illumination changes, occlusion, and
shadows. We used both test sequences commonly referenced in related work
and others captured in typical surveillance environment (using different video
cameras such as Web-Cam, Lew et al. (2004)). Sequences used include multiple
(at least two) objects (people and vehicles) moving in different environments.
Most scenes include changes in illumination and the target objects have various
features such as speed and shape and most have shadowed regions.
The proposed system implemented in C provides a response in real-time for
video sequences with a rate of up to 10 frames per second on a multitasking
SUN Ultra-SPARC 360 MHz without specialized hardware. Computational
15
cost is measured in seconds and given per a CIF (352x288) frame. No attention
was paid to optimize the software. The computational time of the proposed
three-layer system depends on the size of the involved objects with an average
of 0.14 seconds per frame and a worst case of 0.35 seconds. The worst case is
a result of separation of large occluded objects where the tracking processing
time significantly rise up from 0.001 to 0.2 seconds per frame. We are currently
investigating a more efficient implementation of the separation of objects to
increase the performance of the proposed system to 30 frames per seconds.
In the following two sections, we show the reliability of the proposed system
using sample results for event-based video summarization and key-frame extraction. The system’s reliability is due to 1) the spatio-temporal threshold
adaptation and plausibility rules for temporal reliability and consistency and
2) the explicit detection of object occlusion and segmentation errors.
6.1
Event-based video summary
The system is applied here to summarize video shots based on “significant”
events. For a given application, the selection of “significant” events can be
easily integrated into our system. In some application, for instance, a detailed
description of the occlusion process is needed, e.g., list all frames during occlusion. Tables 1-2 show sample of video shot summaries.
Table 1
Summary of the shot ‘Stair’(frames 1-1475).
Frame
Event
ObjID
Age
Status
Position
Motion
start/present
present
Size
start/
present
172
Enter
2
8
Move
(312,248)/(308,230)
(-2 ,-2 )
5746 /5746
216
Enter
3
8
Move
(184,186)/(167,169)
(-2 ,-2 )
7587 /7587
234
Exit
2
69
Exit
(312,248)/(337,282)
(3 ,16 )
11803 /180
479
Appear
4
8
Stop
(128,104)/(127,100)
(0 ,0 )
211 /211
547
Disappear
3
338
Disappear
(184,186)/(114,67 )
(0 ,-1 )
6374 /137
608
Enter
7
8
Move
(120,88 )/(125,79 )
(1 ,1 )
2536 /2536
678
Enter
8
8
Move
(138,85 )/(131,85 )
(-1 ,0 )
5308 /5308
803
Exit
7
202
Exit
(120,88 )/(337,282)
(3 ,20 )
3432 /199
812
Disappear
8
141
Disappear
(138,85 )/(121,92 )
(0 ,0 )
4334 /73
916
Enter
9
8
Move
(11 ,151)/(23 ,159)
(3 ,0 )
2866 /2866
1179
Exit
9
270
Exit
(11 ,151)/(127,72 )
(0 ,0 )
4955 /1388
1290
Enter
16
8
Stop
(123,72 )/(128,77 )
(0 ,0 )
2737 /2737
1363
Exit
16
80
Exit
(123,72 )/(83 ,154)
(-5 ,1 )
3844 /15910
16
Table 2
Summary of the shot ‘Highway’ (frames 1-300).
Frame
Event
ObjID
Age
Status
Position
Motion
start/present
present
Size
start/
present
8
Appear
1
8
Move
(306,216)/(272,167)
(-5 ,-5 )
1407 /1407
11
Enter
2
8
Move
(344,205)/(340,199)
(-1 ,-1 )
543 /543
45
Appear
3
8
Stop
(148,39 )/(148,39 )
(0 ,0 )
148 /148
158
Appear
4
8
Move
(142,69 )/(138,74 )
(-1 ,1 )
157 /157
182
Disappear
1
181
Disappear
(306,216)/(173,41 )
(0 ,0 )
809 /34
200
Appear
5
8
Move
(336,266)/(295,191)
(-8 ,-8 )
1838 /1838
201
Move Fast
5
9
Move
(336,266)/(288,181)
(-8 ,-8 )
1588 /1588
204
Exit
4
53
Exit
(142,69 )/(8 ,273)
(-9 ,14 )
191 /471
204
Occlusion
5
12
Move
(336,266)/(271,157)
(-5 ,-8 )
1103 /1103
204
Occlusion (by Obj 5)
2
201
Stop
(344,205)/(289,129)
(0 ,0 )
822 /555
269
Exit
3
231
Exit
(148,39 )/(3 ,230)
(-6 ,7 )
156 /391
279
Abnormal
2
276
Stop
(344,205)/(291,129)
(0 ,0 )
822 /563
Table 3
Summary of the shot ‘FloorSurvey’ Shot (frames 1-826).
Frame
Event
ObjID
Age
Status
Position
Motion
start/present
present
Size
start/
present
36
Appear
1
8
Move
(126,140)/(123,135)
(0 ,-1 )
320 /320
268
is Deposit (by Obj 1)
3
8
Stop
(121,140)/(121,140)
(0 ,0 )
539 /539
405
Occlusion
3
145
Stop
(121,140)/(120,141)
(0 ,0 )
541 /555
405
Occlusion (by Obj 3)
1
377
Move
(126,140)/(83 ,109)
(1 ,0 )
840 /2422
411
Removal (by Obj 1)
3
150
Removal
(121,140)/(104,132)
(0 ,0 )
541 /1451
787
Appear
18
8
Move
(105,68 )/(108,86 )
(0 ,0 )
91 /91
825
Exit
1
796
Exit
(126,140)/(9 ,230)
(-10,7 )
840 /247
6.2
Event-based key-frame detection
Key frames best represent the content of a video in an abstract manner. Keyframe abstraction transforms an entire shot into a small number of representative frames. This way important content is maintained while redundancies
are removed. Key frames based on events are appropriate when on-line alerts
to specific events are required.
In a surveillance environment, key events may be rare or infrequent. The
attention of human operators may thus decrease and significant events may
be missed. Our system identifies events of interest as they occur and human
operators can focus their attention on these events accordingly.
Figs. 2-8 display key frames of indoor and outdoor shots that are detected
17
automatically by the proposed system. Here only objects performing events
are annotated (ID and MBB) where key frames of primitive events such as Exit
are not always shown due to space constraint. As can be seen, the system has
good performance both in indoor and outdoor shots, even those with heavy
noise, multiple objects, and/or local illumination changes. This results from
the handling of inaccuracies and false alarms. For example, the system is able
to differentiate between deposited, split, and at-obstacle objects.
6.3
Reliability and limitations of the proposed system
As mentioned, the proposed system has been applied to over 10, 000 frames
of both indoor and outdoor real-world sequences with different levels of difficulties. For example, we used
• frequently-referenced shots (e.g., Fig. 2);
• shots overlaid with heavy white noise (up to 25 dB PSNR) or including
coding artifacts (e.g., Fig. 3);
• shots where we only processed every fifth frame (e.g., Fig. 4);
• shots with multiple occlusion (e.g., Fig. 5, 6 and 7);
• shots with multiple objects of temporally varying shapes and of lengthly
stay (e.g., Fig. 6); and
• shots with illumination changes and shadow (e.g., Fig. 7 and 8).
In our simulations, we have used the same parameter settings and no significant detection errors of events or objects were found. Some object boundaries
deviate, however, by a few pixels or events are detected with a very short delay
of a few frames, not affecting the overall system performance.
Our system proposes approximate but efficient world models to formally define
widely-applicable events. Since objects are monitored and some information of
their history is stored, our system takes into account the effect of time-varying
behavior of objects. Note, however, that a complete definition of a particular
event in a given application depends on details such as camera position, and
context such as importance of an event.
The system has been applied to monitor multiple objects, but it was not
tested in crowded scenes and under sudden global illumination change. Fig. 6
shows, however, how the system successfully detects events in a busy street
scene with very small objects. It successfully handles objects of various sizes,
and also situations when object size changes throughout the video shot (e.g.,
Fig. 6). We do not assume perfect segmentation, but we handle possible segmentation errors at a higher level where more information is available. As the
goal of this paper is to detect context-independent video content, no application dependent information such as camera position was integrated.
18
Fig. 2. Key frames of the ‘Hall’ shot (300 frames). Key events: O1 enters, O5 enters,
O6 is deposited by O1 .
Fig. 3. Key frames of the ‘Hall’ shot overlaid with 25 dB PSNR noise (300 frames).
Key events: O2 enters, O4 enters, O5 is deposited by O2 .
Fig. 4. Key frames of the ‘Hall’ shot where only every 5th is considered. Key events:
O1 enters, O2 enters, O3 is deposited by O1 .
Fig. 5. Key frames of the ‘FloorSurvey’ shot (826 frames). Key events: O1 deposits
O3 , O1 occlude O3 , O1 removes O3 .
19
Fig. 6. Key frames of the ‘Urbicande’ shot (300 frames). Key events: O1 occludes
O6 , O1 occludes O7 , abnormal O1 stays long. (The original frames are rotated by
90◦ to the right to comply with the CIF format.)
Fig. 7. Key frames of the ‘ParkingSurvey’ shot (979 frames). Key events: O1 and
O2 enter, O5 enters, O1 , O2 , and O5 occlude.
Fig. 8. Key frames of the ‘Stair’ shot (1475 frames) with three entrances (front,
back, and stairs). This is a noisy sequence with illumination changes and shadow.
Person A enters the back door, passes the front glass door, exits, returns and exits
from the back door. Person B enters from the stairs, approaches the back door,
passes through the front door, exits, returns and goes up the stairs. Key events: O3 ,
O7 , O9 enter; O9 moves; O9 moves; O11 enters.
20
7
Conclusion
There are few real-time detection schemes concerning events in video shots.
Many event detection schemes are context-dependent or constrained to narrow
applications. We propose a computational system to automatically and efficiently detect generic video events. Several context-independent events have
been rigorously defined and automatically detected using features following
segmentation, motion estimation and object tracking methods. The proposed
events are sufficiently broad to assist applications such as monitoring the removal of objects, traffic objects, or behaviors of objects. On the other hand,
refinements of events proposed can be easily integrated into our system.
The reliability and real-time (up to 10 frames per second) response of the proposed system has been demonstrated by extensive experimentation on indoor
and outdoor shots containing over 10, 000 frames with object occlusion, noise,
and coding artifacts.
In this system, special consideration is given to inaccuracies and false alarms.
For example, the system is able to differentiate between deposited objects,
split objects, and objects at an obstacle. Errors are corrected or compensated
at a higher level where more information is available.
References
Abrams, R., Christ, S., 2003. Motion onset captures attention. Psychological
Science 14, 427–432.
Amer, A., Jan. 2003a. Memory-based spatio-temporal real-time object segmentation. In: Proc. SPIE Int. Conf. on Real-Time Imaging. Vol. 5012.
Santa Clara, California, USA, pp. 10–21.
Amer, A., Jan. 2003b. Voting-based simultaneous tracking of multiple video
objects. In: Proc. SPIE Int. Conf. on Image and Video Communications and
Processing. Vol. 1. Santa Clara, California, USA, pp. 500–511, to appear.
Amer, A., Mitiche, A., Dubois, E., Sep. 2002. Reliable and fast structureoriented video noise estimation. In: Proc. IEEE Int. Conf. Image Processing.
Vol. 1. Rochester, USA, pp. 840–843.
Benitez, A. B., Chang, S.-F., November 2003. Extraction, description and
application of multimedia using mpeg-7. In: Asimolar Conference of Signals,
Systems, and Computers, Invited Paper on Special Session on Document
Image Processing. Monterey, CA.
Benitez, A. B., Rising, H., Jorgensen, C., Leonardi, R., Bugatti, A., Hasida,
K., Mehrotra, R., Tekalp, M., Ekin, A., Walker, T., September 2002. Semantics of multimedia in mpeg-7. In: IEEE International Conference on Image
Processing (ICIP). Rochester, New York.
21
Bobick, A., 1997. Movement, activity, and action: the role of knowledge in the
perception of motion. Tech. Rep. 413, M.I.T. Media Laboratory.
Bouthemy, P., Gelgon, M., Ganansia, F., Nov. 1997. A unified approach to
shot change detection and camera motion characterization. Tech. Rep. 1148,
Institut National de Recherche en Informatique et en Automatique.
Bremond, F., Maillot, N., Thonnat, M., Vu, V.-T., May 2004. Ontologies for
video events. Tech. Rep. RR-5189, Institut National de Recherche en Informatique et en Automatique.
Courtney, J., 1997. Automatic video indexing via object motion analysis. Pattern Recognit. 30 (4), 607–625.
Franconeri, S. L., Simons, D. J., 2003. Moving and looming stimuli capture
attention. Perception and Psychophysics 65 (7), 999–1010.
Haering, N., Qian, R., Sezan, I., Sep. 2000. A semantic event detection approach and its application to detecting hunts in wildlife video. IEEE Trans.
Circuits Syst. Video Techn. 10 (6), 857–868.
Haritaoglu, I., Harwood, D., Davis, L. S., Aug. 2000. W 4 : Real-time surveillance of people and their activities. IEEE Trans. Pattern Anal. Machine
Intell. 22 (8), 809–830.
IEEE PAMI, vol. 22, no. 8, Aug. 2000. Special section on video surveillance.
IEEE Trans. Pattern Anal. Machine Intell.
Jostschulte, K., Amer, A., Oct. 1998. A new cascaded spatio-temporal noise
reduction scheme for interlaced video. In: Proc. IEEE Int. Conf. Image
Processing. Vol. 2. Chicago, USA, pp. 493–497.
Lew, S., Hoang, T.-L., Li, M., Woo, T., Li-Cheong-Man, P., May 2004. Video
processing and transmission over IP for indoor video surveillance applications. Tech. Rep. Final Year Project, ECE, Concordia Univ. of Montréal.
Lippman, A., Vasconcelos, N., Iyengar, G., Nov. 1998. Human interfaces to
video. In: Proc. 32nd Asilomar Conf. on Signals, Systems, and Computers.
Asilomar, CA, invited Paper.
Naphade, M., Mehrotra, R., Ferman, A., Warnick, J., Huang, T., Tekalp, A.,
1998. A high performance algorithm for shot boundary detection using multiple cues. In: Proc. IEEE Int. Conf. Image Processing. Vol. 2. Chicago, IL,
pp. 884–887.
Nothdurft, H. C., 1993. The role of features in preattentive vision: Comparison
of orientation, motion and color cues. Vision Research 33, 1937–1958.
Pentland, A., Jan. 2000. Looking at people: Sensing for ubiquitous and wearable computing. IEEE Trans. Pattern Anal. Machine Intell. 22 (1), 107–119.
Stringa, E., Regazzoni, C., Oct. 1998. Content-based retrieval and real time
detection from video sequences acquired by surveillance systems. In: Proc.
IEEE Int. Conf. Image Processing. Chicago, IL, pp. 138–142.
Vasconcelos, N., Lippman, A., Oct. 1997. Towards semantically meaningful
feature spaces for the characterization of video content. In: Proc. IEEE Int.
Conf. Image Processing. Vol. 1. Santa Barbara, CA, pp. 25–28.
22