Spatio-temporal Event Mining for Surveillance Video

An Interactive Framework for Retrieving Incidents in Transportation
Surveillance Video Databases
Xin Chen
Advisor: Dr. Chengcui Zhang
Department of Computer and Information Sciences, University of Alabama at Birmingham
1. Introduction and Motivation
3. Object Tracking & Trajectory Modeling
2. Related Work & Merits of the Proposed Work
Goal: Learn and retrieve semantic events in video databases
Vehicle Segmentation and Tracking:
Related Work:
such as accidents in transportation surveillance videos.
Segmentation -- Simultaneous Partition and
Class Parameter Estimation (SPCPE) [1]
algorithm.
(xcentroid, ycentroid)
Three Major Learning Algorithms for Event Detection from Videos:

Hidden Markov Model (HMM)[5]

Belief Networks [4]

Self-Organizing Map (SOM) [7]
Relevance Feedback:

Rui et. al [6] proposed to use RF in Content-based Image
Retrieval.
Main Challenges :
“Semantic Gap” – A gap between high level semantic
concepts and low level video features.
In Information Retrieval, there’s no prior knowledge to
construct the training set for learning.
Relevance Feedback is well-known in Content Based Image
Retrieval. We apply it to the semantic video retrieval.
A mapping between spatiotemporal trajectories and the Neural
Network input nodes is designed. The Neural Network model for time
series forecasting is adjusted for spatiotemporal semantic events
detection.
The Proposed Learning Framework:
Window Sliding: Extract trajectory segments by sliding window -- a way
to partition time series data. Continuity of the data is kept.
Window Size=6
Step Size=1

Sudden Change of the Velocity (vdiff)

Minimum Distance from the nearest vehicle (mdist)
Time
t1
x1
Step
t2
A Sample Point: α i = [1/mdisti, vdiffi, θi]
x2
x3
x4
x5
x6
x7
x8
Trajectory: α = [α1, α2, …, αn]
...
xn
Window sliding
x1 x2
x3
x4
x5
x6
x7
x8
Window sliding
tk-5
... xk-5 xk-4 xk-3 xk-2 xk-1 xk
Step
x1 x2 x3
x4
x5
x6
x7
... xk-5 xk-4 xk-3 xk-2 xk-1 xk
...
xn
……
M1
x8
... xk-5 xk-4 xk-3 xk-2 xk-1 xk
...
Hidden Output
Layer Layer
Sudden Change of the Direction (θ)
Based on the Neural Network for Time Series Data.
Prediction
Detection
Relevance Feedback is incorporated.
vi
…...
Input
Layer
Sampling: Sample centroids along a trajectory at the rate of 5 frames per
sampling point.
M2
wij
…...
xn
xt-m
...
xt-3 xt-2 xt-1
Experimental Results:
Initial Iteration:
Tested on two video clips; One is taken in a tunnel featuring single vehicle accidents
(2504 frames, 109 trajectory sequences); Another one is taken in an intersection in
Taiwan, featuring multiple vehicle accidents (592 frames, 168 sequences).
Sampling rate is 5 frms per sampling point; Window size is 3.
Five iterations of user relevance feedback are performed - Initial (no feedback),
First, Second, Third, and Fourth.
The top 20 video sequences are returned to the user.
The percentage of relevant sequences (accuracy) within the top 5, 10, 15 and 20 is
calculated.
Compared with Weighted Relevance Feedback Method [6].
The user specifies an event of interest (e.g., traffic
accidents).
There is no relevance feedback.
Top sequences are returned to the user by
heuristic:
1
2
2
max (

vdiff

q
i
i )
2
i
mdist i
Subsequent Iterations:
The user gives feedback to the retrieval results.
Training data = [xt-2, xt-1, xt, fdk, opt]
The learning algorithm refines and returns the
retrieval results to the user.
The whole process goes through several iterations
until a satisfactory result is obtained.
1
0.9
0.8
0.7
0.6
Weighted_RF
0.5
Proposed Framework
0.4
Interface
0.2
0.1
0
Initial
1st
2nd
3rd
4th
System Overview
Retrieval Results
Raw Video
0.8
Accuracy
0.6
Retrieval
Results
Feedbacks
Refined Results
Event
Modeling
Object
Tracking
0.5
Weighted_RF
0.4
Proposed Framework
0.3
0.2
0.1
Learning
and
Retrieval
Trajectories
Models
Feedback
Metadata
Trajectory
Modeling
Hidden Layers:
 One hidden layer with sigmoid transfer; The
number of nodes equals that in the input layer.
 One output layer with linear transfer; There is
one output node that scores the likelihood of an
event in the sequence.
Initial Weights:
 First layer – random weights
 Second layer – multiple linear regression weight
initialization
Search Algorithm:
 Conjugate Gradient
A semantic video retrieval framework is
proposed.
The neural network is applied to event
detection from video sequences, a special
type of time series data.
A mapping between spatiotemporal trajectories
and network input nodes is developed.
The proposed work incorporates the
Relevance Feedback in interactive video
retrieval.
0
Initial
1st
2nd
3rd
The event models for other general events will
be constructed and tested.
More video data will be collected with the
associated metadata for normalizing all the
videos before the storage and retrieval.
The framework will be extended to include
query by example, query by sketches, and the
customized combination of query types.
References:
0.7
Initial
Query
fdk
Future Work:
0.3
Fourth Iteration Retrieval Results of the 1st Clip (tunnel)
Query
xt
Conclusions:
Retrieval Results
Accuracy
Experiment Setup:
xt-1
8. Conclusions and Future Work
7. Experiments
Learning and Retrieval Process:
# frames is the typical sequence length of an event.
Input Nodes:
xt = αi = [1/misti, vdiffi, θi]
fdk
6. Learning and Retrieval (2)
Neural Network Design:
# frames
Window Size:
Sampling Rate
xt-2
Feedback

Centroids:
X
Data Preparation:
Accident Features:
q
Trajectory:
5. Learning and Retrieval (1)
Traffic Accidents:



Y
Tracked Vehicles
and Their Centroids
4. Event Modeling


Fitted Curve
Merits of the Proposed Framework:
An interactive semantic video retrieval framework is proposed.
The user guides the learning and retrieval process through
Relevance Feedback.
The Neural Network for Time Series data is the learning algorithm.
The proposed framework is especially useful in mining and
retrieving data from large multimedia databases.
This framework can be tailored to apply to many fields.
Experimental results show the effectiveness.

Advantage: The trajectory can be described by
only a few coefficients. Derivatives on the curve
are velocities.
Tracking -- Distinguish the static objects from
mobile objects in the frame.
Solution: Relevance Feedback (RF) [6]

The Least Square Curve Fitting method is used
to model the trajectories of vehicles. A trajectory
is represented by a kth degree polynomial:
y = a0 + a1x + … +akxk
Vehicle Segment
Neural Network is Used:

Mostly in forecasting trends in Time Series Data [2]

Rarely in detecting spatiotemporal patterns [3]
Summary:
Trajectory Modeling:
4th
Fourth Iteration Retrieval Results of the 2nd Clip (intersection)
[1] Chen, S.-C., Shyu, M.-L., Peeta, S., and Zhang, C. 2003. LearningBased Spatio-Temporal Vehicle Tracking and Indexing for Transportation
Multimedia Database Systems. IEEE Trans. on Intelligent Transportation
Systems, Vol. 4, No. 3, pp. 154-167.
[2] Davey, N., Hunt, S.P., Frank, R. J. 2000. Time Series Prediction and
Neural Networks. Journal of Intelligent and Robotic System, Vol. 31.
[3] Gao, D., Kinouchi, Y., Ito, K., and Zhao, X. 2005 Neural Networks for
Event Extraction from Time Series: a Backpropagation Algorithm
Approach. Future Generation Computer Systems, Vol. 21, pp.1096-1105.
[4] Huang, T., Koller, D., Malik, J., Ogasawara, G., Rao, B., Russell, S.,
and Weber, J. 1994. Automatic Symbolic Traffic Scene Analysis Using
Belief Networks. In Proceedings of National Conference on Artificial
Intelligence.
[5] Kamijo, S., Matsushita, Y., and Katsushi I. 2000. Traffic Monitoring
and Accident Detection at Intersections. IEEE Transactions on Intelligent
Transportation Systems (June 2000). Vol. 1, No. 2, pp. 108-118.
[6] Rui, Y., Huang, T.S., and Mehrotra, S. 1997. Content-based Image
Retrieval with Relevance Feedback in MARS. In Proceedings of the
International Conf. on Image Processing, pp. 815-818.
[7] Xie, D., Hu, W., Tan, T., and Peng, J. 2004 Semantic-based Traffic
Video Retrieval Using Activity Pattern Analysis. In IEEE International
Conference on Image Processing (ICIP).