Proceedings of the 2013 IEEE Second International Conference on Image Information Processing (ICIIP-2013) Unusual Event Detection using Sparse SpatioTemporal Features and Bag of Words Model Balakrishna Mandadi Dr. Amit Sethi EEE Department Indian Institute of Technology Guwahati Guwahati, India [email protected] EEE Department Indian Institute of Technology Guwahati Guwahati, India [email protected] The problem of detecting unusual events can be approached in different ways. Tracking based methods [1] are a good choice in case of constrained environments where only limited set of activities is possible. These methods use tracks of moving objects to model the activities based on the speed, size, direction and silhouettes of objects along these tracks [2]. By tracking, the activity of one object can be separated from other co-occurring activities. The advantage of these methods is that complex motion patterns can be easily modeled using extracted trajectories. But these methods are sensitive to tracking errors. Explicit tracking of objects in crowded scenes is very complex and is easily prone to errors due to frequent occlusions. Abstract— We present a system for unusual event detection in single fixed camera surveillance video. Instead of taking a binary or multi-class supervised learning approach, we take a one-class classification approach assuming that training dataset only contains usual events. The videos are modeled using a bag of words model for documents, where the words are prototypical sparse spatio-temporal feature descriptors extracted along moving objects in the scene of observation. We learn a probabilistic model of the training data as a corpus of documents, which contains a certain probabilistic mixture of latent topics, using Latent Dirichlet Allocation framework. In this framework, topics are further modeled as certain probabilistic mixture of words. Unusual events are video clips that probabilistically deviate more than a threshold from the distribution of the usual events. Our results indicate potential to learn usual events from a few examples, reliable flagging of unusual events, and sufficient speed for practical applications. The other kind of approaches [1] directly uses motion feature vectors instead of tracks to describe video clips. In these approaches, simple low level visual features like motion histogram are extracted as part of the feature set. As there is no detection and tracking, a particular activity cannot be separated from other simultaneously co-occurring activities. For example, the motion of a pedestrian from that of a car cannot be separated without tracking. To alleviate this problem, some sorts of mixture models are used to learn various co-occurring activities in the scene. We also use the same kind of approach without going for explicit tracking based methods. Keywords—Automated surveillance; Unusual abnormal rare event detection; Latent Dirichlet allocation I. INTRODUCTION The ubiquitous presence of surveillance cameras, due to increased threat to public life and property, calls for automatic methods to detect any abnormal activities. Most of the present day surveillance systems do not have automation to detect illegal or unusual activities. It is highly expensive and difficult to appoint a human observer to monitor the activities for each and every system. Hence the need for automatic methods is increasing day by day. Unusual activity detection is a part of automated surveillance as a pre-warning that is to be further processed by the security personnel to take higher level decision. Zhong and Shi [3] are one of the first to use “hard to describe” but “easy to verify” notion of detecting unusual events from video sequences. By slicing a long video sequence into small segments and representing them with simple motion features, they have clustered the video segments using importance feature signal extraction. Unusual video segments are the ones with small inter-cluster similarity. The above approach is well suited for finding interesting events in a large pool of video space but may not be able to detect unusual events over a live streaming video. Despite its offline nature, the above method has become popular due to its simple approach and inspired later methods to use powerful statistical models to make it online. Wang et al. [1] and Varadarajan et al. [4] used the natural language processing models like Latent Dirichlet Allocation (LDA) [5] and Probabilistic Latent Semantic Analysis (pLSA) [6] to model the activities by treating every video segment as document and quantized optical flow features as vocabulary. Our approach of detecting unusual events resembles to that of [1] but we used different set of features to represent a video and used different abnormality measures to detect unusual events. As motion information using optical The notion of unusual event is difficult to define because those are unpredictable and even change with scene of observation. For example, loitering is a usual activity in public parks but unusual in airports. In general, unusual activities can be defined as those whose occurrence is rare or unexpected in a given scenario. Given a long video sequence, those events which occur repeatedly over time are considered usual. With myriad possibilities of events in a video scene, it is quite difficult to label all types of events and classify them using supervised learning. Furthermore, since examples of unusual events will be few, if at all, the learning will have to be based on highly unbalanced classes. Dealing the problem as a one-class classification will be a better choice. 978-1-4673-6101-9/13/$31.00 ©2013 IEEE 629 Proceedings of the 2013 IEEE Second International Conference on Image Information Processing (ICIIP-2013) flow alone may not be used to model various activities efficiently, we have used quantized spatio-temporal corner descriptors to efficiently model various activities over the scene of observation. The used features can capture both shape and motion information of moving objects. To model the activities we have also used LDA model as used in [1], but used different measures of abnormality instead of likelihood measure as it shows high for an unusual activity co-occurring with a usual activity which is undesirable. By measuring the deviation of new test clips’ inferred parameters from that of model using probabilistic distances like Bhattacharya distance and Kullback-Leibler (KL) divergence, we detect unusual events. The model can be a supervised or unsupervised model but usually it is very difficult to get good training examples for unusual events in case of crowded environments to use supervised models. To solve the problem of unavailability of unusual examples, we model the activities seen during training using unsupervised learning. In general, we learn a probability distribution over the training examples, and mark test cases that are improbably under this distribution as unusual. The probability distribution should be flexible and loose enough to not raise false alarms for usual activities in the test examples, while it should not be too loose to miss the detection of obviously unusual cases. The rest of the paper is organized as follows. In section II the general mathematical formulation of the problem is given. In section III we discuss about the video features and representation. In section IV we discuss about the generative model LDA. In section V we give experimental results followed by a conclusion. II. III. With the success of sparse spatio-temporal (ST) corner descriptors for action recognition, we have used the same set of features to represent a video clip. There are different set of algorithms available for detecting spatio-temporal interest points. Most popular among them are Harris3D [7] and periodic detector [8]. Harris3D is the extension of popular Harris corner detection algorithm [9] to videos and was developed by Laptev and Lindeberg. The periodic detector proposed by Dollar et al. [8] uses linear filters along temporal directions to detect interest points. Because of its robustness and simplicity in our experiments, we have used periodic detector method♣ of representing a video clip proposed by Dollar et al. [8]. Usually a response function is evaluated at each location for any feature detection algorithm. The response function for periodic detector uses separable 1D temporal Gabor filters and is of the form, MATHEMATICAL FORMULATION Let V( x, y , t ) represents a small video clip of T frames duration which may contain non-separable set of activities (e.g. a walking pedestrian while a car is moving on road) where x, y and t are the spatial and temporal coordinates respectively. Let V = {V1 , V2 , V3 , " , VN } be the training dataset which is obtained by slicing a long video sequence into overlapping or non-overlapping equal length segments. The dataset V can be labeled (label being usual or unusual) or unlabeled dataset depending on the supervised or unsupervised approach to detect a given video event as unusual. Feature representation of a video clip can be done in many ways (We use histogram of sparse feature descriptors to represent a video). In general, let λ Vi ( l ) is the feature representation of a video clip, Vi i = 1, " , N and l is the feature parameter. Now the original dataset V can be replaced by the feature dataset as, V { ( l ) , λ ( l ) ,", λ ( l )} V≡λ = λ V1 VN V2 VIDEO REPRESENTATION R ( x, y, t ) = ( J ∗ g (σ )* hev ) + ( J ∗ g (σ )* hod ) (3) where R is the response function , J is the image intensity at location ( x, y, t ) . (1) To learn the various sets of activities or motion patterns generally a probabilistic model is fitted over the training data. Let M v ( l ;θ ) be the generative model that is to be fitted over the training video dataset λ V . θ is the parameter vector that is learned while fitting the model to dataset. As θ is tuned to the dataset λ V , M v ( l ;θ ) captures certain beliefs about the (a) Test Frames environment V . Now given a new test video clip of same duration, Vs and its feature representation, λ Vs ( l ) , its deviation from model is calculated as D s using appropriate probabilistic distance (PD) measures ( D s = PD λ Vs ( l ) , M v ( l ;θ ) ) (b) Periodic Detector’s Interest points (2) Figure 1: (a) One of the frames from 5 sec test video clips (b) The detected 3D interest points at different time instances projected on a frame shown in (a) using periodic detector. New test clip Vs can be flagged as unusual if D s > Dth , a predetermined threshold from training sequence. ♣ Source code for periodic detector was downloaded from the author’s homepage. 630 Proceedings of the 2013 IEEE Second International Conference on Image Information Processing (ICIIP-2013) g= 1 e−( x 2 + y2 ) previous section. LDA assumes that a document (video clip) is generated from a random mixture of K latent topics (atomic activities). Topics are a set of co-occurring words that are captured from the corpus of documents during training. LDA is a fully generative model which does not suffer from overfitting problem as that of its predecessor pLSA [10, 11]. is the spatial smoothening kernel. hev and 2πσ hod are a pair of quadrature Gabor filters of the form. 2 hev = − cos(2π tω )e −t 2 τ2 and hod = − sin(2π tω )e −t 2 τ2 are the quadrature filters with ω = 4 . The Gabor filters are bandpass τ in nature. Response function is evaluated at every possible location along three dimensions and those locations which fire high response are spatio-temporal corners. Figure 1 above shows the detected ST corners or interest points from a small video clip by projecting on to single frame. The ST features correspond to corner points of moving objects in the scene as in Figure 1. To efficiently represent a video clip using the obtained ST corners, cuboids of certain size are extracted around these interest points [8]. We empirically determined 9x9x17 to be a good cuboid size for the tested scenes. In general, it will be dependent on the size of interesting objects and their pace. To obtain the shape and motion information of moving objects, the extracted cuboids are represented using a feature vector of 3D gradients. Figure 2: Graphical representation of LDA model as in [5] (shaded part represents observed variable) As LDA is a widely accepted model, we have directly taken some of the equations and notations required for our problem. For complete explanation refer [5] . The multinomial parameter of topics over document, θ , is considered random of known Dirichlet distribution given by: Vocabulary: To build vocabulary for the bag of words model, the obtained descriptors are quantized using k-means clustering. The cluster centers are the prototype words which can act as vocabulary. The number of clusters k depends on different types of descriptors in the scene of observation and can be empirically determined. To allow for the new unseen descriptors which may be a part of unusual events during testing, for each prototype feature a counter prototype is defined. The descriptors which are nearer to the prototype centers but far away from the farthest descriptor of the cluster belong to the counter prototype. The location of the moving objects is also important in case of unusual activity detection. To get the location information, the frame is divided into H horizontal and V vertical patches and the count of detected corner points within that patch is also appended to the vocabulary histogram. Now, the total possible words in the vocabulary are 2k + HV . Our feature representation λ V of video clip is the above histogram of prototype descriptors present in the video clip. By treating every video clip as document and the prototype feature descriptors as vocabulary, we model the activities using LDA model which is explained below. IV. p (θ | α ) = Γ (∑ ∏ k i =1 k i =1 αi )θ Γ (α i ) α1 −1 1 "θ kα k −1 Where Γ ( x ) is the Gamma function of the form: ∞ Γ ( x ) = ∫ t x −1e−t dt , x is not a negative integer and zero. 0 Given the parameters α and β , the joint distribution of topic mixture θ , a set of N topic labels z and a set of N words w is given by: N p(θ, z, w | α, β) = p(θ | α)∏ p( zn | θ) p( wn | zn , β) (4) n =1 Where p ( zn | θ) is simply θi for the unique i such that z ni = 1 . Integrating over θ and summing over z , we obtain the marginal distribution of a document (video clip): N p(w | α, β) = ∫ p (θ | α )∏ ∑ p( zn | θ) p( wn | zn , β)dθ (5) n =1 zn Taking the product of the marginal probabilities of single documents, we obtain the probability of a corpus D M N p(D | α, β) = ∏ ∫ p(θd | α )∏ ∑ p( zdn | θd ) p( wdn | zdn , β)dθd d =1 LATENT DIRICHLET ALLOCATION n =1 zn (6) The LDA model can be represented graphically as shown in Figure 2. As shown in the figure, there are three levels to the representation. The parameter α and β are the corpus level parameters, assumed to be sampled only once in the process of generating a corpus. The variables θ d are document level parameters, sampled once per document. Finally, the variables z dn and wdn are word level variables and are sampled once for every word in each document. LDA [5] is a probabilistic generative model successfully used in documents processing to extract semantic meaning of document. A document is modeled as an unordered collection of words which are obtained from the defined vocabulary. Only the counts of individual words are important in a bag of words model, while their order is neglected. Histogram of words captures their co-occurrence to model different topics. In case of videos, the documents are video clips and the vocabulary is the set of prototype descriptors as defined in 631 Proceedings of the 2013 IEEE Second International Conference on Image Information Processing (ICIIP-2013) The LDA model description is given above and various parameters like α and β for the model should be learned from training corpus (a set of related documents). The inference problem is computing the posterior distribution of hidden variables given a document and is given by (7) p (θ, z , w | α, β) (7) p (θ, z | w , α, β) = p ( w | α , β) The denominator in (4) is given by (2) is intractable due to coupling between θ and β . As the posterior distribution is intractable for exact inference, the authors proposed variational Bayes approximation for inference in their paper by relaxing the link between the coupling parameters. The general idea is to approximate the complex posterior using simple free variational probability by minimizing the difference between them. The free variational posterior is given by (8) = ln Γ (∑ k i =1 ) α ai − ln Γ (∑ k i =1 α bi + ∑ i =1{ln Γ (α bi ) − ln Γ (α ai )} ) k (14) ( ) k k + ∑ i =1 [α ai − α bi ] ⎡⎢ ψ (α ai ) − ψ ∑ i =1α ai ⎤⎥ ⎣ ⎦ The Bhattacharya distance between two Dirichlet distributions [13] with parameters α a and α b is given by: ⎧⎪ ⎫⎪ DBC (α a , α b ) = − ln ⎨ ∫ pa ( X; α a ) pb ( X; α b )dX ⎬ ⎪⎩ X ⎭⎪ k ⎛ k α + α bi ⎞ ⎛ α + α bi = ln Γ ⎜ ∑ i =1 ai − ∑ i =1 ln Γ ⎜ ai ⎟ 2 2 ⎝ ⎠ ⎝ 1 k + ∑ i =1 ln Γ (α ai ) + ln Γ (α bi ) 2 1 k k − ln Γ(∑ i =1 | α ai |) + ln Γ(∑ i =1 | α bi |) 2 { { N q (θ, z | γ, φ) = q(θ | γ )∏ q ( zn | ϕn ) (8) } (15) ⎞ ⎟ ⎠ (16) } n =1 where (ϕ1 , ϕ 2 ," , ϕ N ) are the free variational and parameters that are to be estimated in constraint of minimizing the difference between true posterior (4) and variational posterior (8). A tight lower bound on the log likelihood directly translates into the following optimization problem [5], γ ∗ ∗ ( γ , φ ) = arg min ( γ ,φ) D(q(θ, z | γ, φ) || p (θ, z | w, α, β)) V. EXPERIMENTAL RESULTS As there is no proper test benchmark available as of now, we have collected our own videos from our campus. We provide results of using both optical flow features and spatio-temporal features. The video dataset consists of a 45 minute video of which 35 minutes of video is used for training which contains general activities that are seen daily. To test for unusual events, an acted video is captured which contains unusual activities like driving in wrong directions, entering the restricted lawn area. A small part of the normal video and the acted video is kept aside for testing purposes. The video sequence is sliced into small non-overlapping segments of 5 seconds duration which are treated as documents. After slicing into short video segments, a total of 370 video segments for training and 158 video segments for testing are obtained. Of the 158 test video clips first 58 contain daily seen usual activities and remaining segments contain unusual activities. Each video segment is represented as histogram of quantized optical flow features [1] and histogram of quantized ST corner descriptors as explained in section III. The results are discussed below. (9) where || is Kullback–Leibler divergence. The optimum values for free variational parameters can be found by equating first derivatives to zero, ϕni ∝ βiw exp{E q [log(θi | γ )]} (10) γ i = α i + ∑ n =1ϕni (11) n N E q [log(θi | γ )] = ψ(γ i )-ψ(∑ j =1 γ j ) k (12) where ψ is the first derivative of the log Γ function. Equations (7) and (8) are coupled equations which are solved recursively using expectation maximization (EM) algorithm. The model parameters α and β are learned using variational EM algorithm as explained in [5]. Our aim is to infer the various topic mixture weights γ ts present in a new test document (video clip) using (8) and (9). This variational mixture parameter is compared to that of the model parameter α to determine the deviation between the two. Optical flow as vocabulary: The optical flow [14] vocabulary for the bag of words model is created as explained in [1] with the added bins for magnitude. The frame size of 240x320 is divided into spatial patches of 24x32, each patch containing 10x10 pixels. By dividing magnitude and direction of optical flow over a patch into 4 bins each, a 16 bin histogram per patch is calculated by counting the number of pixels belonging to specific bins. All the 16-bin histograms per patch are concatenated to represent a video segment (document). The total vocabulary size is equal to the length of the histogram bins i.e. 24x32x16=12,288. The number of topics K is empirically determined to be 20. The optical flow histograms of various training video segments are used to learn the parameters of the model. Given a new video clip, the topic distribution present in the given clip is inferred from the model using(10), (11) and (12) iteratively. The deviation of new test clip’s topic distribution from the model parameter is measured using KL divergence given by(14). As it is clear from the KL Measures of abnormality: As both γ ts and α are the parameters of random Dirichlet distribution, statistical distance measures like Kullback-Leibler(KL) divergence and Bhattacharya distance can be used to measure the deviation between the two. The KL divergence between two Dirichlet distributions [12] is given by, p ( X; α a ) (13) DKL (α a , α b ) = ∫ pa ( X; α a ) ln a dX pb ( X; α b ) X 632 Proceedings of the 2013 IEEE Second International Conference on Image Information Processing (ICIIP-2013) divergence curve in Figure 3, differentiationn between unusual clips (59-158) and usual clips (1-58)) is not much distinguishable because of insufficient repreesentation of video by using noisy optical flow motion featuress. This problem is alleviated by representing a video by using sparse ST corner descriptors as shown in Figure 4. Figure 3 shhows some of the frames from the unusual video segments corrresponding to high deviation from the learned model. abnormality given by (14) and (16). The differentiation a usual clips (1-58) is clearly between unusual video clips and evident in Figure 4. The Bhattaacharya distance curve and KL divergence curves are almost similar s except for the numerical range. The threshold is calcuulated as maximum distance of training clips from model param meter. (a) 1. 2. 3. 4. 5. 6 6. (cc) 2. 3. 5. 1. 4. Figure 4: (a) KL divergence curvves and (b) Bhattacharya distance for test data sets using ST featurre representation. (c) Some of the frames corresponding to unusuall video (high divergence). (1,3,5)Motion in the restricted lawn areaa, (2,6)-Driving in wrong directions. (4)-Unusual hand movement of observer, o recorded while capturing the videos. (Bounding boxes are hand labeled to visualize unusual activity) 6. (b) Figure 3: (a) KL divergence abnormality measurre for test video clips using optical flow as vocabulary. (b) Frames coorresponding to high KL divergence. (1, 4, 5, 6) People moving in the restricted r lawn area. (2, 3) Bicycles moving in wrong directions. (B Bounding boxes are hand labeled to visualize unusual activities over the t scene) Some of the frames of unusual video segments corresponding to temporally local maximal abbnormality are shown in Figure 4. Most of the detected unusuaal activities include driving in wrong directions and entering restricted lawn areas which are not seen during training. ST features as vocabulary: The vocabulary for ST corner descriptors is built as explained in section III. The vocabulary is generated by quantizing the ST descriptors using k-means clustering wiith the empirically chosen k = 500. For location information thee frame is divided into 6x8 patches and the ST features undeer that patches are counted. Hence total vocabulary is 2k + HV V = 1,048, which is very less compared to that of optical flow reepresentation. The number of topics K in the mixture model is manually selected as 20 (Increasing the value of K results inn repetitive atomic activities or mixture of other already learneed activities which are redundant). The topic distribution of thee given test clip is inferred as explained in section 3. Insteead of going for likelihood measure of abnormality, which may m show up high for an unusual activity co-occurring with a usual activity, we used KL divergence and Bhattacharya distannce as measures of Due to limited training videoss used for the experiments, few patterns which are expected to be usual are detected as unusual are labeled as false allarm (FA) in Figure 4. The false alarms are corresponding to car c going reverse, hence in the wrong direction and taking a U-turn U which are not seen widely during training are shown in Figure 5. The present implementation has a latency of 5 sec duration i.e. there is a delay of 5 sec duration to calcuulate and determine the unusual event over live streaming viddeos. The latency can be easily reduced by taking overlappingg clips. To allow for the spatial localization we tried to divide the t frame into patches and a 633 Proceedings of the 2013 IEEE Second International Conference on Image Information Processing (ICIIP-2013) [2] [3] [4] (a) [5] [6] [7] [8] (b) [9] Figure 5: (a) Frames corresponding to normal video events but detected as unusual (false alarms) (b) Frames corresponding to unusual but detected as usual (missed detection). [10] [11] separate LDA model is learned for each patch. But the results are not satisfactory due to the size of the patch where it cannot accommodate the whole activity and everything seems new. [12] [13] VI. CONCLUSION In case of one-class based learning of activities over the scene of observation, video representation and generative model selection are the primary things to be considered. Previous methods used optical flow features to represent a video and used language models to describe the scene. Motion features using optical flow alone may not be useful to model the scene everywhere especially in outdoor scenes where optical flow gives erroneous results. To alleviate this problem we have used robust representation of video clips using sparse interest points, successful in case of action recognition to represent a video clip. To learn the various co-occurring activities, a popular bag of words generative model, LDA is used. At the end, experimental results are provided to show that ST features performed well to that of optical flow features to detect unusual events in outdoor scenarios. The future scope for the approach can be in the areas of video representation and online learning of the model parameters. Recently, deep learning techniques [15] have become popular for learning important informative features from the video without going for hand designed features. To allow for continuous update of the model to capture the changing environment, an online version of the LDA model [16], proposed recently can be readily used here. [14] [15] [16] REFERENCES [1] X. Wang, X. Ma, and W. E. L. Grimson, “Unsupervised Activity Perception in Crowded and Complicated Scenes Using Hierarchical Bayesian Models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, pp. 539–555, 2009. 634 C. Stauffer and W. E. L. Grimson, “Learning Patterns of Activity Using Real-Time Tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, pp. 747–757, 2000. H. Zhong, J. Shi, and M. Visontai, “Detecting Unusual Activity in Video,” in CVPR (2), 2004, pp. 819–826. J. Varadarajan and J.-M. Odobez, “Topic Models for Scene Analysis and Abnormality Detection,” in 9th International Workshop in Visual Surveillance, 2009. D. M. Blei, A. Y. Ng, M. I. Jordan, and J. Lafferty, “Latent dirichlet allocation,” Journal of Machine Learning Research, vol. 3, p. 2003, 2003. T. Hofmann, “Learning the Similarity of Documents: An InformationGeometric Approach to Document Retrieval and Categorization.” 2000. I. Laptev and T. Lindeberg, “Space-time Interest Points,” in ICCV, 2003, pp. 432–439. P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior Recognition via Sparse Spatio-Temporal Features,” in VS-PETS, 2005. C. Harris and M. Stephens, “A combined corner and edge detector,” in Proc. of Fourth Alvey Vision Conference, 1988, pp. 147–151. M. Steyvers and T. Griffiths, “Probabilistic topic models,” Handbook of latent semantic analysis, vol. 427, pp. 424–440, 2007. H. Shan, A. Banerjee, and N. C. Oza, “Discriminative MixedMembership Models,” in Proceedings of the Ninth IEEE International Conference on Data Mining, 2009, pp. 466–475. W. D. Penny, “Kullback-Liebler Divergences of Normal, Gamma, Dirichlet and Wishart Densities,” Wellcome Department of Cognitive Neurology, 2001. T. W. Rauber, A. Conci, T. Braun, and K. Berns, “Bhattacharyya probabilistic distance of the Dirichlet density and its application to Splitand-Merge image segmentation,” in 15th International Conference on Systems, Signals and Image Processing, 2008, pp. 145–148. B. D. Lucas and T. Kanade, “An Iterative Image Registration Technique with an Application to Stereo Vision,” 1981, pp. 674–679. Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng, “Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 3361–3368. M. D. Hoffman, D. M. Blei, and F. Bach, “Online learning for latent dirichlet allocation,” in NIPS, 2010.
© Copyright 2026 Paperzz