Rule-based real-time detection of context-independent events in video shots ? Aishy Amer a Eric Dubois b Amar Mitiche c a Electrical b School and Computer Engineering, Concordia University, Montréal, Québec, Canada. Email: [email protected] of Information Technology and Engineering, University of Ottawa, Ottawa, Canada. Email: [email protected] c INRS-Télécommunications, Université du Québec, Montréal, Québec, Canada. Email: [email protected] A bstract The purpose of this paper is to investigate a real-time system to detect contextindependent events in video shots. We test the system in video surveillance environments with a fixed camera. We assume that objects have been segmented (not necessarily perfectly) and reason with their low-level features, such as shape, and mid-level features, such as trajectory, to infer events related to moving objects. Our goal is to detect generic events, i.e., events that are independent of the context of where or how they occur. Events are detected based on a formal definition of these and on approximate but efficient world models. This is done by continually monitoring changes and behavior of features of video objects. When certain conditions are met, events are detected. We classify events into four types: primitive, action, interaction, and composite. Our system includes three interacting video processing layers: enhancement to estimate and reduce additive noise, analysis to segment and track video objects, and interpretation to detect context-independent events. The contributions in this paper are the interpretation of spatio-temporal object features to detect contextindependent events in real time, the adaptation to noise, and the correction and compensation of low-level processing errors at higher layers where more information is available. The effectiveness and real-time response of our system are demonstrated by extensive experimentation on indoor and outdoor video shots in the presence of multiobject occlusion, different noise levels, and coding artifacts. Key words: Event detection, context independence, motion interpretation, occlusion, split, noise adaptation, video surveillance. Preprint submitted to Elsevier Science 5 February 2005 1 Introduction The subject of the majority of video is related to moving objects, in particular people, that perform activities and interact to create events (Pentland (2000); Bobick (1997)). The human visual system is strongly attracted to moving objects creating luminance change (Nothdurft (1993); Franconeri and Simons (2003); Abrams and Christ (2003)). Generally, a video displays multiple objects with their low-level features (e.g., shape), mid-level features (e.g., trajectory), high-level features (e.g., events and meaning), and syntax (i.e., spatio-temporal relation between objects). High-level object features are generally related to the movement of objects. In this paper, we concentrate on context-independent events with the aim of providing assistance to users of a video system, for example, in video surveillance to detect object removal, or in visual fine arts applications to extract interacting persons. The significant increase of video data in various domains such as video surveillance requires efficient and automated event detection systems. In this paper, we propose a real-time, unsupervised event detection system. Our system is based on three video processing layers: video enhancement to reduce noise and artifacts, video analysis to extract low- and mid-level features, and video interpretation to infer events. The remainder of this paper is organized as follows. Section 2 classifies events into four categories, Section 3 discusses related work, Section 4 gives an overview of our system, Section 5 proposes our approach to detect contextindependent events, Section 6 presents experimental results, and Section 7 contains a conclusion. 2 A classification of video events An event is an interpreted spatio-temporal relation involving one or several objects (Benitez et al. (2002); Benitez and Chang (2003); Bremond et al. (2004)) such as “Object A deposits object B” or “Object A disappears”. In this paper, we look at events that involve variations of object features at two successive time instants of a video shot. (A video sequence contains, usually, many shots. To facilitate event detection, a video sequence has to first be ? This work was supported, in part, by the Natural Sciences and Engineering Research Council (NSERC) of Canada. 2 segmented into shots (see, e.g., Naphade et al. (1998); Bouthemy et al. (1997)). In this paper, the term video refers to a video shot.) A video event expresses a particular behavior of a finite set of objects in a sequence of a small number of consecutive frames of a video shot. In the simplest case, an event is the appearance of a new object into the scene. In more complex cases, an event starts when a higher-level behavior of objects changes. An event consists of a context-independent (or a fixed meaning) and a contextdependent component associated with a time and a location. For example, the event deposit has a fixed context-independent meaning, “the object is added to the scene”, that is common to all applications but deposit may have a variable meaning in different contexts such as “the object is added at a specific significant location”. Events are generally applicable when they convey fixed meaning independently of context of the input video. A generally applicable event-detection system detects context-independent events. Note that we distinguish here between application independence and context independence. For example, moving fast is context independent as the highlevel details of this event do not matter. The threshold to determine which motion is fast is, however, application dependent. Deposit is context independent as it is detected independently of where and what is being deposited. The importance of the position of the deposit is context dependent. We distinguish between four types of events: primitive (e.g., “Object A moves” or “Object C appears’), action (e.g., “Object A moves too fast”), interaction (e.g., “Object A deposits Object C”), and composite events (e.g., “Object A stands”, note that the event stand may be concluded following the events move, slow motion stops). Primitive events describe basic object behavior while action, interaction, and composite events describe non-basic behavior of objects. A primitive event is defined directly from the temporal features of a moving object, i.e., “simple” change of object motion state. They represent the finest granularity of events. Action, interaction, and composite events are related to “non-simple” (“complex”) behavior of objects. An action event is related to one moving object achieving a task and an interaction event is related to two or more objects. A composite event may consist of a combination of 1) several primitive events involving possibly multiple objects or 2) primitive, action, and interaction events. Composite events directly depend on the goals of the underlying application. 3 3 Related work Research in the area of detecting high-level video content such as events, especially related to people, has become a central topic in video processing (Bobick (1997); Lippman et al. (1998); Pentland (2000)). In recent years, research interest in video content detection has shifted from motion and tracking toward detection and recognition of activities and events. Most event detection systems are developed for smaller scope applications, for example, hand sign applications or Smart-Camera based cooking (IEEE PAMI (2000); Lippman et al. (1998); Vasconcelos and Lippman (1997); Bobick (1997)). In these systems, prior knowledge is inserted in the event recognition inference system and the focus is on recognition and logical formulation of events and actions. Little work on context-independent detection exists and such systems are rarely tested in real applications, e.g., in the presence of illumination changes, noise, or multiple objects with occlusion. The system of Courtney (1997) is based on prediction and nearest-neighbor matching. The system is able to detect events such as deposit. It can operate in simple environments where one human is tracked under the assumption of translational motion. It is limited to applications of indoor environments, does not address occlusion, and is sensitive to noise. Moreover, the definition of events is limited to a few environments. The event detection system for indoor surveillance applications of Stringa and Regazzoni (1998) consists of object detection and event detection modules. The event detection module classifies objects using a neural network. The classification includes: abandoned object, person, and object. The system is limited to one abandoned object event in unattended environments. The definition of abandoned object is specific to the given application. The system cannot associate abandoned objects and the person who deposited them. The system of Haering et al. (2000) is designed for off-line processing applications and uses domain knowledge to facilitate detection of events (wildlife hunt events). The system of Haritaoglu et al. (2000) tracks several people simultaneously and uses appearance-based models to identify people. It determines whether a person is carrying an object, and can segment the object from the person. It also tracks body parts such as head or hands. The system imposes, however, restrictions on the object movements. Objects are assumed to move upright and with little to no occlusion. Moreover, it can only detect a limited set of events. This paper contributes a real-time solution to detect events related to one or several objects in the presence of challenges such as noise and occlusion. The solution is based on plausibility rules where no prior knowledge, initialization, 4 or object models are assumed. This paper contributes, furthermore, a solution to explicitly account for changing object features, e.g., due to object occlusion, splitting, or deformation. This is done by adaption to video noise, by explicitly detecting object occlusion or erroneous object segmentation, and monitoring of features over time. In addition, we introduce plausibility rules to handle errors and confirm temporal reliability and consistency of object matching. The proposed system is designed for real-time event-oriented applications such as video surveillance. 4 Overview of the proposed system It is widely agreed that modules of a layered video system cannot be optimized independently of each other. In our system, we study video processing modules with emphasis on integration into a layered system. To balance effectiveness and efficiency, the proposed system aims 1) at stable modules that forego the need for feature precision, 2) at context-independence, 3) at error correction by higher modules where more information is available, 4) at real environments including multiple objects, occlusions and artifacts, and 5) at low computational cost. Stability aims at handling inaccuracies and false alarms such as split not deposit event. To achieve the above goals, our layered system is divided into simple but effective tasks avoiding complex operations, and involves three interacting processing layers: video enhancement (i.e., noise estimation, Amer et al. (2002), and noise reduction, Jostschulte and Amer (1998)), video analysis (i.e., object segmentation and tracking, Amer (2003b,a)), and video interpretation (i.e., event detection as described in this paper). In this paper, events are defined by approximate but efficient world models. This is done by continually monitoring changes and behavior of low-level features of the scene’s objects. When certain conditions are met, events are detected. To achieve higher applicability, events are detected independently of the context of the video. Several context-independent events are rigorously defined and automatically detected using features detected following video analysis (i.e., object segmentation, motion estimation, and tracking). We do not, however, assume perfect video analysis. Possible errors of the video analysis are subsequently corrected or compensated at higher modules. Fig. 1 displays a block diagram of the proposed system which can be viewed as a system of methods and algorithms to build automated scene interpretation. Besides applications such as video surveillance, outputs of the proposed system can serve a video understanding, or a symbolic reasoning system. 5 Video shot I(n-1) Video enhancement Enhanced video Enhanced video Video analysis Pixels to video objects I(n) Noise estimation and/or noise reduction σn Motion-based object segmentation => Pixels to objects Object-based motion estimation Voting-based object tracking => Objects to video objects Spatio-temporal object features Video interpretation Analysis & interpretation Video objects to events of low-level features Event detection & classification Results Requests (Events & Objects) Fig. 1. Block diagram of the proposed system. σn represents the estimated noise standard deviation. “Requests” depend on the specific design of an event-based surveillance system. For instance, a request can be “online” streaming of events to a control site or “offline” storage of event in a given format. At two successive time instants, the system outputs a list of objects with their low- and mid-level features and related events. Information is passed to the next frame through a linked list of objects and their features which are set or updated throughout the layered system. Features include identity (ID), age (the number of frames since the object occurred), perimeter, area, minimum bounding box (MBB), start position, split data (e.g., current status and ID of the object that was split from), occlusion data (e.g., status and ID of the object that occludes, or is occluded by, this object), recent motion, corresponding confidence, link to the corresponding object, and related events. 5 Detection of context-independent events In this section, we describe in detail the video interpretation module (Fig. 1) that detects events based on a formal definition of these and on approximate but efficient world models. We propose perceptual descriptions of events that can be used and easily extended in a wide range of surveillance applications. For generality, event detection is not based on geometry of objects but on their features and relations over time. The video analysis module (Fig. 1) represents objects by their spatio-temporal low-level features in temporally linked lists. Object features are stored for each 6 frame and compared as frames of a shot arrive. To detect events, the video interpretation module simultaneously monitors the behavior and features of each object in the scene. Events are automatically detected by combining trajectory information and spatial features, such as size and location. When specific conditions are met, events related to these conditions are detected. Behavior monitoring is done on-line, i.e., object data is analyzed as it arrives and events are detected as they occur. In the following, define: • • • • • • • • • • • • • • I(n), I(n − 1), I(q), frames of the input shots at time instant n, n − 1, or q; Oic , Ojc , segmented objects in the current frame I(n); Okp , Olp , segmented objects in the previous frame I(n − 1); Mli : Olp → Oic , a function assigning Olp ∈ I(n − 1) to an Oic ∈ I(n); M0i :a Oic , a zero-match function matching an Oic ∈ I(n) to no object in I(n − 1); Ml0 : Olp a, a zero-match function matching an Olp ∈ I(n − 1) to no object in I(n)); Bi , the MBB of Oic ; gi , the age of Oic , i.e., the number of frames since its appearance or entrance; ei = (exi , eyi ), the centroid of Oic ; dij , the distance between the centroids of Oic and Ojc ; rmin , the upper row of the MBB of an object; rmax , the lower row of the MBB of an object; cmin , the left column of the MBB of an object; and cmax , the right column of the MBB of an object. The following are samples of generic events detected automatically by our system following the classification in Section 2. Note that thresholds used in these sample events are adaptive as described in Section 5.6. 5.1 Primitive events Primitive events are related to “simple” behavior of objects. In the following we define the events enter, appear, exit, disappear, move, and stop. Oic enters at time instant n if 1) Oic ∈ I(n), / I(n − 1), i.e, zero match M0i :a Oic , and 3) ei is at the border 2) Oic ∈ of I(n) (this condition depends on the object motion and the frame rate). Our definition aims at detecting object entrance as soon as an object portion becomes visible. In some applications, only entering objects of specific size, motion, or age, are of interest. This can be easily integrated into our definition. 7 Oic appears at time instant n if 1) Oic ∈ I(n), 2) Oic ∈ / I(n − 1), i.e., zero match in I(n − 1): M0i :a Oic , and 3) ei is not at the border of I(n). Note that an object can either enter or appear at a time instant n. Okp exits at time instant n if 1) Okp ∈ I(n − 1), 2) Okp ∈ / I(n), i.e., zero match in I(n): Mk0 : Okp a, 3) ek is at the border of I(n − 1), and 4) gk > tg where gk is the age of Okp . tg is the age threshold (Sec. 5.6). Okp disappears at instant n if 1) Okp ∈ I(n − 1), 2) Okp ∈ / I(n), i.e., zero match Mk0 : Okp a in I(n), 3) ek is not at the border of I(n − 1), and 4) gk > tg . Oic moves at time instant n if 1) Mki : Okp → Oic and 2) M ED(|wic ∈ I(n−q)|) > ts with 0 ≤ q ≤ m and M ED(·) is median filter. Condition 2) means that the median of the motion magnitudes of Oic ∈ I(n) in the last m frames is larger than the motion threshold ts . m is the filter order (Sec. 5.6). Note that there is no delay to detect this event because motion data at previous frames are available. Oic stops at time instant n if 1) Mki : Okp → Oic and 2) M ED(|wic ∈ I(n − q)|) ≤ ts with 0 ≤ q ≤ m. 5.2 Action events An action event is related to “non-simple” (“complex”) behavior of a single object. In this paper, an action is detected when an object moves abnormal or when the object is dominant. Oic moves abnormal if its movement is either frequent (i.e., fast motion or short stay) or rare (i.e., slow motion or long stay). Oic stays long in the scene if 1) gi > tg , i.e., Oic does not exit the scene after a time tg , and 2) di < tsl , i.e, the distance di between the current I(n) and past I(q) position of Oic (q < n) is less than a threshold tsl (Sec. 5.6). Oic moves too fast (or moves too slow) if the object speed in the last m frames is larger (smaller) than a threshold tmf (tms ) (Sec. 5.6). Oic is dominant if it performs a “significant” predefined event such as it has the largest speed, age, or size, at time instant n. Note that for a complete description of action events, some details on the underlying application are required. For example, the event moving too fast requires details about the camera position to be complete. 8 5.3 Interaction events An interaction event is related to “non-simple” behavior of two or more objects. In this paper, we define the events occlude/occluded, expose/exposed, remove/removed, deposit/deposited, split, and at an obstacle. Object occlusion can result from either real object occlusion (i.e., objects physically get connected) or segmentation errors (i.e., object images overlap although objects are distant in world coordinate). In both cases, an object segmentation outputs one region consisting of the occluding and occluded objects. if Eq. 1 (see Section Oic occlude Ojc 5.5) is valid and if the size (area) of Oic is larger than that of Ojc . In occlusion, at least two objects are involved. The object with the larger area is defined as the occluding object, the other the occluded object; exposure is detected when occlusion ends. Okp removes Olp if 1) Okp and Olp were occluded in I(n−1), / I(n), i.e., zero match Ml0 : Olp a, and 3) the area Al of Olp is smaller 2) Olp ∈ than that of Okp , i.e., AAkl < ta , ta being the size threshold (Sec. 5.6). Note that removal is detected after occlusion. When occlusion is detected the system predicts the occluded objects. In case of removal, the features of the removed object can change significantly and the proposed system may not be able to follow the removed objects. Conditions for removal are checked and if they are met, removal is declared. The object with the larger area is the remover, the other is the removed object. if 1) Okp ∈ I(n − 1), Oic , Ojc ∈ I(n), and Oic deposits Ojc Mki : Okp → Oic , 2) Ojc ∈ / I(n − 1), i.e., no match M0j :a Ojc , 3) AAji < ta , ta being the size threshold, 4) [Ai +Aj ' Ak ]∧[(Hi +Hj ' Hk )∨(Wi +Wj ' Wk )], where Ai , Hi , and Wi are area, height, and width of Oic , 5) Oic changes in height or width between I(n − 1) and I(n) at the MBB-side s, and 6) Ojc is close to a side, s, of the MBB of Oic . s ∈ {rmini , rmaxi , cmini , cmaxi }. Define djs the distance between the MBB-side s and Ojc . Ojc is close to s if 1 < djs < tjs with the closeness threshold tjs (Sec. 5.6). Only if the distance between the deposited object and depositor is large (i.e., 9 the depositor moves away from the deposited object) is deposit considered. Otherwise Ojc is assumed to have split from Oic and is merged to Oic . To reduce false alarms, deposit is declared if the deposited object remains in the scene for some time. Thus the system differentiates between deposit and segmentation errors. It can also differentiate between stopping (e.g., seated person) and deposited objects. if Eq. 2 is valid (see Section 5.5). An Okp splits into Oic1 and Oic2 object split can be real (in case of object deposit) or due to object segmentation errors. The main difference between deposit and split is that a split object is close to the splitter while a depositor moves away from the deposited object and they become widely separated. Often, objects move close to static background objects (obstacles) which may occlude part of the moving objects. This is particularly frequent in traffic scenes where objects move close to road signs. An object segmentation is not able in such cases to detect pixels occluded by an obstacle. A moving object is then split into two or more regions as in ⇒ . This is different from object split because, rather than abrupt, gradual change of object features occurs. The proposed system detects objects at obstacles by monitoring the features of each object, Oic , in the scene. If a gradual decrease, or increase, of a features of Oic is detected, Oic is labeled accordingly. Olp is at an obstacle 1 if 1) Olp , Okp ∈ I(n − 1) have appeared or entered, 2) Olp has no corresponding object in I(n), i.e., zero match in I(n) with Ml0 : Olp a, 3) Ak (area) was monotonically decreasing in the last m frames, 4) Olp and Okp are some how similar, i.e., Mlk : Olp → Okp , 5) motion direction of Olp does not change if matched to Oic ∈ I(n), and 6) Olp is close to Okp where Al was monotonically increasing in the last m frames and Okp has a corresponding object, i.e., Mki : Okp → Oic . 1 While the depicted transition figures for at an obstacle show two objects, the original object gets its ID back when motion at the obstacle is detected. 10 5.4 Composite events A composite event is related to two or more primitive, action, or interaction events. We propose here the following perceptual definitions of the composite events stand, sit, walk, lost, found, approach a restricted site, change direction, change speed, and special movement. Our main goal is to show the extensibility of the proposed system. stands and sits are characterized by continual change in height and width of the object MBB. Oic stands if 1) its motion in the last m frames is very small and 2) it experiences a continual strong increase in its MBB height and a continual slight decrease in its MBB width. Oic sits if 1) it experiences a continual increase of the width and decrease of the height of its MBB. In both events, height and width need to be compared to the values of the height and width at the time instant before they started to change. Oic walks if it experiences a continual moderate movement over a significant period of time (i.e., for a given number of frames). When objects walk they experience motion without an abrupt change in their MBB. Oic is lost if it has no corresponding object in the current frame and occlusion was previously reported but not removal. It is similar to the event disappear. Some applications require the search for lost objects even if they are not in the scene. To allow the system to find lost objects, features, such as ID, size, or motion, of lost objects need to be stored for future reference. If the event lost is detected and a new object appears in the scene which shows similar features to the lost object, the objects can be matched and the event found declared. The event approaches a restricted site is straightforward to detect when the location of the restricted site is predefined. In this case, the direction of the object’s motion and distance to the site can be monitored to detect the event. Based on a registered motion direction, the motion direction in the last m frames previous to I(n) are compared with the registered motion direction. If the current motion direction is deviating from the motion direction in each of the m frames, the event changes direction can be declared. Similarly, the event changes speed can be detected. Often, a scene contains events which have never occurred before or occur rarely. This is application dependent. The event special movement can be defined as a chain of simple events, for example, enters, moves through the scene, and disappears. 11 5.5 Monitoring the video analysis A good event detection technique should account for errors of previous steps. Object segmentation, for example, is likely to output erroneous object masks and features. This system accounts for the correction of two major errors, object occlusion and split, based on plausibility rules and prediction strategies. To monitor occlusion or split, our system monitors changes at all sides of the object bounding box, both horizontal and vertical displacements, and uses the distance between objects. The corrected new object information is then used to update the previous analysis steps on the previous frame so that the correction is done between two successive time instants. 5.5.1 Monitoring object occlusion Occlusion may occur when two objects physically meet or when their images overlap while being distant. This means that the minimum bounding boxes of the two objects physically overlap in the frame. When objects get connected, at least one side of the object’s four MBB sides will experience a significantly large displacement. Define: • Oic ∈ I(n), the occlusion of Okp ∈ I(n − 1) and Olp ∈ I(n − 1) in I(n) where Mki : Okp → Oic ; • dkl , the distance of the centroids of Okp and of Olp ; • wkp = (wx , wy ), the current displacement of Okp , i.e., between I(n − 2) and I(n − 1). • wvl , the vertical displacement of the lower MBB row of Okp ; • wvu , the vertical displacement of the upper MBB row of Okp ; • whr , the horizontal displacement of the right MBB column of Okp ; • whl , the horizontal displacement of the left MBB column of Okp ; and • tw , td , the displacement and distance thresholds (inversely proportional to the frame rate R, see Sec. 5.6). Object occlusion is detected (Eq. 1) if 1) the current displacement and the displacement of one of the four side of the MBB of Okp significantly differ, 2) the displacement of that MBB side is outward, and 3) there exists an object Olp ∈ I(n − 1) close to Okp . [(|wx − whr | > tw ) ∧ (whr > 0) ∧ (dkl < td )] ∨ [(|wx − whl | > tw ) ∧ (whl > 0) ∧ (dkl < td )] ∨ [(|wy − wvl | > tw ) ∧ (wvl > 0) ∧ (dkl < td )] ∨ [(|wy − wvu | > tw ) ∧ (wvu > 0) ∧ (dkl < td )]. 12 (1) If occlusion is detected then all (occluding and occluded) objects are labeled. This labeling allows to continue monitoring objects in subsequent frames even if they become completely occluded (i.e., the occlusion conditions in Eq. 1 are not met). This is important since objects might reappear. Note that the labeling is important to help detect occlusion even if the occlusion conditions in Eq. 1 are not met. If occlusion is detected, the occluded object Oic is split into different objects (two or more depending how many objects get connected). Splitting is done by predicting related objects of I(n − 1) onto I(n) based on their estimated displacements. After prediction, the lists of objects and their features in I(n) and I(n − 1) are updated, for example, by adding Olp to I(n − 1). 5.5.2 Monitoring object split A segmentation method may divide an object into several regions. Note that when objects split, at least one side of the object’s four MBB sides will experience significantly large displacement. Assume Okp ∈ I(n − 1) is erroneously split in I(n) into Oic1 and Oic2 and Mki1 : Okp → Oic1 . Define: • di12 , the distance between Oic1 , Oic2 ; and • w = (wx , wy ), the current displacement of Okp between I(n−2) and I(n−1). • (Recall the displacement tw and the distance td thresholds.) An object split is declared (Eq. 2) if 1) the current displacement and the displacement of one of the four sides of the MBB of Okp significantly differ, 2) the displacement of that MBB side is inward, and 3) there is an object Oic2 close to Oic1 in I(n). [(|wx − whr | > tw ) ∧ (whr < 0) ∧ (di12 < td )] ∨ [(|wx − whl | > tw ) ∧ (whl < 0) ∧ (di12 < td )] ∨ [(|wy − wvl | > tw ) ∧ (wvl < 0) ∧ (di12 < td )] ∨ (2) [(|wy − wvu | > tw ) ∧ (wvu < 0) ∧ (di12 < td )]. If splitting is detected, then the two regions Oic1 and Oic2 are merged into one object Oic so to reduce the total number of regions to process and thus improve algorithm performance. 5.5.3 Temporal consistency We have applied the following plausibility adaptation that aims at temporal consistency of the event detection: 13 • Objects are followed once they enter and also during occlusion. • Two objects are matched only if the estimated motion direction is consistent to those in previous frames. • If, during the matching (e.g, after merging), a better correspondence than a previous one is found, the matching is revised. 5.6 Selection of thresholds Due to various distortions such as noise, feature inaccuracies are possible. The proposed system uses thresholds to compensate for such distortions. We adapt most of these thresholds to parameters of the input video such as frame rate R, frame size F , and object size A. For example, if objects are large, then a larger error can be tolerated than when they are small. This adaptation allows better distinction at smaller sizes and stronger matching at larger sizes and makes thus the thresholds used useful for the entire sequence where the size of the object varies. The proposed system requires the specification of the following thresholds: • tg : the age (temporal) threshold with typical values between 10 and 25 (e.g., see the exit event). tg is a function of the frame rate and the minimal allowable object speed. • ts : the motion threshold with typical values between 0 and 2 (e.g., see the move event). • tsl : the stay-for-long (temporal) threshold is a function of the frame rate R, the object motion w, and the frame size F . • tmf : the move-fast (temporal) threshold with typical values between 8 and 10. tmf is inversely proportional to R. Due to the projective nature of video shots these thresholds should depend on the position of the camera (but the camera position is currently not integrated in our system). • tms : the move-slow (temporal) threshold with typical values between 2 and 4. • tw : the displacement (temporal) threshold which is inversely proportional to R (e.g., see the occlusion event). • td : the distance (temporal) threshold which is inversely proportional to R (e.g., see the split event). • ta : the size (spatial) threshold with typical values between 0.25 and 0.50 (e.g., see the remove event). • tjs : the closeness (spatial) threshold with typical values between 15 and 20 pixels (e.g., see the deposit event). • m: the order threshold (number of previous frames) with typical values between 3 and 5 (e.g., see obstacle event). m depends on A and on the position of the camera. 14 To adapt a “spatial” threshold tsa from the list above, we apply Eq. 3 which aims at better distinction at small object and frame sizes and stronger matching at large sizes. tsamin : A ≤ Amin tsa = tsamin + tsamax (tsamax −tsamin ) A Amax + tsamin Fmax : F Amin < A ≤ Amax : A > Amax (3) Amax = a · F, Amin = b · F, b = 0.0001 · a. A represents the object size, Amin (Amax ) the minimum (maximum) size, F the frame size, and Fmax the maximum (e.g., HDTV) frame size. Typical values for a are between 0.1 and 0.25. tsamin and tsamax were set experimentally. To adapt a “temporal” threshold tta to the frame rate R, we applied Eq. 4 tta = α 1 R (4) with the value of α between 190 and 300. Eq. 4 makes our system still performant when objects between frames are more distant, i.e., low R. 6 Results and applications To show the performance of our system, we report on experiments on various video shots, containing over 10, 000 frames with multi-object occlusion, noise, and artifacts, of indoor and outdoor real environments with a fixed video camera. The main difficulties in outdoor shots are illumination changes, occlusion, and shadows. We used both test sequences commonly referenced in related work and others captured in typical surveillance environment (using different video cameras such as Web-Cam, Lew et al. (2004)). Sequences used include multiple (at least two) objects (people and vehicles) moving in different environments. Most scenes include changes in illumination and the target objects have various features such as speed and shape and most have shadowed regions. The proposed system implemented in C provides a response in real-time for video sequences with a rate of up to 10 frames per second on a multitasking SUN Ultra-SPARC 360 MHz without specialized hardware. Computational 15 cost is measured in seconds and given per a CIF (352x288) frame. No attention was paid to optimize the software. The computational time of the proposed three-layer system depends on the size of the involved objects with an average of 0.14 seconds per frame and a worst case of 0.35 seconds. The worst case is a result of separation of large occluded objects where the tracking processing time significantly rise up from 0.001 to 0.2 seconds per frame. We are currently investigating a more efficient implementation of the separation of objects to increase the performance of the proposed system to 30 frames per seconds. In the following two sections, we show the reliability of the proposed system using sample results for event-based video summarization and key-frame extraction. The system’s reliability is due to 1) the spatio-temporal threshold adaptation and plausibility rules for temporal reliability and consistency and 2) the explicit detection of object occlusion and segmentation errors. 6.1 Event-based video summary The system is applied here to summarize video shots based on “significant” events. For a given application, the selection of “significant” events can be easily integrated into our system. In some application, for instance, a detailed description of the occlusion process is needed, e.g., list all frames during occlusion. Tables 1-2 show sample of video shot summaries. Table 1 Summary of the shot ‘Stair’(frames 1-1475). Frame Event ObjID Age Status Position Motion start/present present Size start/ present 172 Enter 2 8 Move (312,248)/(308,230) (-2 ,-2 ) 5746 /5746 216 Enter 3 8 Move (184,186)/(167,169) (-2 ,-2 ) 7587 /7587 234 Exit 2 69 Exit (312,248)/(337,282) (3 ,16 ) 11803 /180 479 Appear 4 8 Stop (128,104)/(127,100) (0 ,0 ) 211 /211 547 Disappear 3 338 Disappear (184,186)/(114,67 ) (0 ,-1 ) 6374 /137 608 Enter 7 8 Move (120,88 )/(125,79 ) (1 ,1 ) 2536 /2536 678 Enter 8 8 Move (138,85 )/(131,85 ) (-1 ,0 ) 5308 /5308 803 Exit 7 202 Exit (120,88 )/(337,282) (3 ,20 ) 3432 /199 812 Disappear 8 141 Disappear (138,85 )/(121,92 ) (0 ,0 ) 4334 /73 916 Enter 9 8 Move (11 ,151)/(23 ,159) (3 ,0 ) 2866 /2866 1179 Exit 9 270 Exit (11 ,151)/(127,72 ) (0 ,0 ) 4955 /1388 1290 Enter 16 8 Stop (123,72 )/(128,77 ) (0 ,0 ) 2737 /2737 1363 Exit 16 80 Exit (123,72 )/(83 ,154) (-5 ,1 ) 3844 /15910 16 Table 2 Summary of the shot ‘Highway’ (frames 1-300). Frame Event ObjID Age Status Position Motion start/present present Size start/ present 8 Appear 1 8 Move (306,216)/(272,167) (-5 ,-5 ) 1407 /1407 11 Enter 2 8 Move (344,205)/(340,199) (-1 ,-1 ) 543 /543 45 Appear 3 8 Stop (148,39 )/(148,39 ) (0 ,0 ) 148 /148 158 Appear 4 8 Move (142,69 )/(138,74 ) (-1 ,1 ) 157 /157 182 Disappear 1 181 Disappear (306,216)/(173,41 ) (0 ,0 ) 809 /34 200 Appear 5 8 Move (336,266)/(295,191) (-8 ,-8 ) 1838 /1838 201 Move Fast 5 9 Move (336,266)/(288,181) (-8 ,-8 ) 1588 /1588 204 Exit 4 53 Exit (142,69 )/(8 ,273) (-9 ,14 ) 191 /471 204 Occlusion 5 12 Move (336,266)/(271,157) (-5 ,-8 ) 1103 /1103 204 Occlusion (by Obj 5) 2 201 Stop (344,205)/(289,129) (0 ,0 ) 822 /555 269 Exit 3 231 Exit (148,39 )/(3 ,230) (-6 ,7 ) 156 /391 279 Abnormal 2 276 Stop (344,205)/(291,129) (0 ,0 ) 822 /563 Table 3 Summary of the shot ‘FloorSurvey’ Shot (frames 1-826). Frame Event ObjID Age Status Position Motion start/present present Size start/ present 36 Appear 1 8 Move (126,140)/(123,135) (0 ,-1 ) 320 /320 268 is Deposit (by Obj 1) 3 8 Stop (121,140)/(121,140) (0 ,0 ) 539 /539 405 Occlusion 3 145 Stop (121,140)/(120,141) (0 ,0 ) 541 /555 405 Occlusion (by Obj 3) 1 377 Move (126,140)/(83 ,109) (1 ,0 ) 840 /2422 411 Removal (by Obj 1) 3 150 Removal (121,140)/(104,132) (0 ,0 ) 541 /1451 787 Appear 18 8 Move (105,68 )/(108,86 ) (0 ,0 ) 91 /91 825 Exit 1 796 Exit (126,140)/(9 ,230) (-10,7 ) 840 /247 6.2 Event-based key-frame detection Key frames best represent the content of a video in an abstract manner. Keyframe abstraction transforms an entire shot into a small number of representative frames. This way important content is maintained while redundancies are removed. Key frames based on events are appropriate when on-line alerts to specific events are required. In a surveillance environment, key events may be rare or infrequent. The attention of human operators may thus decrease and significant events may be missed. Our system identifies events of interest as they occur and human operators can focus their attention on these events accordingly. Figs. 2-8 display key frames of indoor and outdoor shots that are detected 17 automatically by the proposed system. Here only objects performing events are annotated (ID and MBB) where key frames of primitive events such as Exit are not always shown due to space constraint. As can be seen, the system has good performance both in indoor and outdoor shots, even those with heavy noise, multiple objects, and/or local illumination changes. This results from the handling of inaccuracies and false alarms. For example, the system is able to differentiate between deposited, split, and at-obstacle objects. 6.3 Reliability and limitations of the proposed system As mentioned, the proposed system has been applied to over 10, 000 frames of both indoor and outdoor real-world sequences with different levels of difficulties. For example, we used • frequently-referenced shots (e.g., Fig. 2); • shots overlaid with heavy white noise (up to 25 dB PSNR) or including coding artifacts (e.g., Fig. 3); • shots where we only processed every fifth frame (e.g., Fig. 4); • shots with multiple occlusion (e.g., Fig. 5, 6 and 7); • shots with multiple objects of temporally varying shapes and of lengthly stay (e.g., Fig. 6); and • shots with illumination changes and shadow (e.g., Fig. 7 and 8). In our simulations, we have used the same parameter settings and no significant detection errors of events or objects were found. Some object boundaries deviate, however, by a few pixels or events are detected with a very short delay of a few frames, not affecting the overall system performance. Our system proposes approximate but efficient world models to formally define widely-applicable events. Since objects are monitored and some information of their history is stored, our system takes into account the effect of time-varying behavior of objects. Note, however, that a complete definition of a particular event in a given application depends on details such as camera position, and context such as importance of an event. The system has been applied to monitor multiple objects, but it was not tested in crowded scenes and under sudden global illumination change. Fig. 6 shows, however, how the system successfully detects events in a busy street scene with very small objects. It successfully handles objects of various sizes, and also situations when object size changes throughout the video shot (e.g., Fig. 6). We do not assume perfect segmentation, but we handle possible segmentation errors at a higher level where more information is available. As the goal of this paper is to detect context-independent video content, no application dependent information such as camera position was integrated. 18 Fig. 2. Key frames of the ‘Hall’ shot (300 frames). Key events: O1 enters, O5 enters, O6 is deposited by O1 . Fig. 3. Key frames of the ‘Hall’ shot overlaid with 25 dB PSNR noise (300 frames). Key events: O2 enters, O4 enters, O5 is deposited by O2 . Fig. 4. Key frames of the ‘Hall’ shot where only every 5th is considered. Key events: O1 enters, O2 enters, O3 is deposited by O1 . Fig. 5. Key frames of the ‘FloorSurvey’ shot (826 frames). Key events: O1 deposits O3 , O1 occlude O3 , O1 removes O3 . 19 Fig. 6. Key frames of the ‘Urbicande’ shot (300 frames). Key events: O1 occludes O6 , O1 occludes O7 , abnormal O1 stays long. (The original frames are rotated by 90◦ to the right to comply with the CIF format.) Fig. 7. Key frames of the ‘ParkingSurvey’ shot (979 frames). Key events: O1 and O2 enter, O5 enters, O1 , O2 , and O5 occlude. Fig. 8. Key frames of the ‘Stair’ shot (1475 frames) with three entrances (front, back, and stairs). This is a noisy sequence with illumination changes and shadow. Person A enters the back door, passes the front glass door, exits, returns and exits from the back door. Person B enters from the stairs, approaches the back door, passes through the front door, exits, returns and goes up the stairs. Key events: O3 , O7 , O9 enter; O9 moves; O9 moves; O11 enters. 20 7 Conclusion There are few real-time detection schemes concerning events in video shots. Many event detection schemes are context-dependent or constrained to narrow applications. We propose a computational system to automatically and efficiently detect generic video events. Several context-independent events have been rigorously defined and automatically detected using features following segmentation, motion estimation and object tracking methods. The proposed events are sufficiently broad to assist applications such as monitoring the removal of objects, traffic objects, or behaviors of objects. On the other hand, refinements of events proposed can be easily integrated into our system. The reliability and real-time (up to 10 frames per second) response of the proposed system has been demonstrated by extensive experimentation on indoor and outdoor shots containing over 10, 000 frames with object occlusion, noise, and coding artifacts. In this system, special consideration is given to inaccuracies and false alarms. For example, the system is able to differentiate between deposited objects, split objects, and objects at an obstacle. Errors are corrected or compensated at a higher level where more information is available. References Abrams, R., Christ, S., 2003. Motion onset captures attention. Psychological Science 14, 427–432. Amer, A., Jan. 2003a. Memory-based spatio-temporal real-time object segmentation. In: Proc. SPIE Int. Conf. on Real-Time Imaging. Vol. 5012. Santa Clara, California, USA, pp. 10–21. Amer, A., Jan. 2003b. Voting-based simultaneous tracking of multiple video objects. In: Proc. SPIE Int. Conf. on Image and Video Communications and Processing. Vol. 1. Santa Clara, California, USA, pp. 500–511, to appear. Amer, A., Mitiche, A., Dubois, E., Sep. 2002. Reliable and fast structureoriented video noise estimation. In: Proc. IEEE Int. Conf. Image Processing. Vol. 1. Rochester, USA, pp. 840–843. Benitez, A. B., Chang, S.-F., November 2003. Extraction, description and application of multimedia using mpeg-7. In: Asimolar Conference of Signals, Systems, and Computers, Invited Paper on Special Session on Document Image Processing. Monterey, CA. Benitez, A. B., Rising, H., Jorgensen, C., Leonardi, R., Bugatti, A., Hasida, K., Mehrotra, R., Tekalp, M., Ekin, A., Walker, T., September 2002. Semantics of multimedia in mpeg-7. In: IEEE International Conference on Image Processing (ICIP). Rochester, New York. 21 Bobick, A., 1997. Movement, activity, and action: the role of knowledge in the perception of motion. Tech. Rep. 413, M.I.T. Media Laboratory. Bouthemy, P., Gelgon, M., Ganansia, F., Nov. 1997. A unified approach to shot change detection and camera motion characterization. Tech. Rep. 1148, Institut National de Recherche en Informatique et en Automatique. Bremond, F., Maillot, N., Thonnat, M., Vu, V.-T., May 2004. Ontologies for video events. Tech. Rep. RR-5189, Institut National de Recherche en Informatique et en Automatique. Courtney, J., 1997. Automatic video indexing via object motion analysis. Pattern Recognit. 30 (4), 607–625. Franconeri, S. L., Simons, D. J., 2003. Moving and looming stimuli capture attention. Perception and Psychophysics 65 (7), 999–1010. Haering, N., Qian, R., Sezan, I., Sep. 2000. A semantic event detection approach and its application to detecting hunts in wildlife video. IEEE Trans. Circuits Syst. Video Techn. 10 (6), 857–868. Haritaoglu, I., Harwood, D., Davis, L. S., Aug. 2000. W 4 : Real-time surveillance of people and their activities. IEEE Trans. Pattern Anal. Machine Intell. 22 (8), 809–830. IEEE PAMI, vol. 22, no. 8, Aug. 2000. Special section on video surveillance. IEEE Trans. Pattern Anal. Machine Intell. Jostschulte, K., Amer, A., Oct. 1998. A new cascaded spatio-temporal noise reduction scheme for interlaced video. In: Proc. IEEE Int. Conf. Image Processing. Vol. 2. Chicago, USA, pp. 493–497. Lew, S., Hoang, T.-L., Li, M., Woo, T., Li-Cheong-Man, P., May 2004. Video processing and transmission over IP for indoor video surveillance applications. Tech. Rep. Final Year Project, ECE, Concordia Univ. of Montréal. Lippman, A., Vasconcelos, N., Iyengar, G., Nov. 1998. Human interfaces to video. In: Proc. 32nd Asilomar Conf. on Signals, Systems, and Computers. Asilomar, CA, invited Paper. Naphade, M., Mehrotra, R., Ferman, A., Warnick, J., Huang, T., Tekalp, A., 1998. A high performance algorithm for shot boundary detection using multiple cues. In: Proc. IEEE Int. Conf. Image Processing. Vol. 2. Chicago, IL, pp. 884–887. Nothdurft, H. C., 1993. The role of features in preattentive vision: Comparison of orientation, motion and color cues. Vision Research 33, 1937–1958. Pentland, A., Jan. 2000. Looking at people: Sensing for ubiquitous and wearable computing. IEEE Trans. Pattern Anal. Machine Intell. 22 (1), 107–119. Stringa, E., Regazzoni, C., Oct. 1998. Content-based retrieval and real time detection from video sequences acquired by surveillance systems. In: Proc. IEEE Int. Conf. Image Processing. Chicago, IL, pp. 138–142. Vasconcelos, N., Lippman, A., Oct. 1997. Towards semantically meaningful feature spaces for the characterization of video content. In: Proc. IEEE Int. Conf. Image Processing. Vol. 1. Santa Barbara, CA, pp. 25–28. 22
© Copyright 2026 Paperzz