From Observables to Situation Assessment

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS
Reasoning About Threats: From Observables
to Situation Assessment
1
2
Gertjan J. Burghouts and Jan-Willem Marck
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Abstract—We propose a mechanism to assess threats that are
based on observables. Observables are properties of persons, i.e.,
their behavior and interaction with other persons and objects. We
consider observables that can be extracted from sensor signals and
intelligence. In this paper, we discuss situation assessment that is
based on observables for threat assessment. In the experiments, the
assessment is evaluated for scenarios that are relevant to antiterrorism and crowd control. The experiments are performed within
an evaluation framework, where the setup is such that conclusions
can be drawn concerning: 1) the accuracy and robustness of an
architecture to assess situations with respect to threats; and 2) the
architecture’s dependence of the underlying observables in terms
of their false positive and negative rates. One of the interesting conclusions is that discriminative assessment of threatening situations
can be achieved by combining generic observables. Situations can
be assessed with a precision of 90% at a false positive and negative
rate of 15% using only eight learning examples. In a real-world
experiment at a large train station, we have classified various types
of crowd dynamics. Using simple video features of shape and motion, we have proposed a scheme to translate such features into
observables that can be classified by a conditional random field
(CRF). The implemented CRF shows to classify successfully the
crowd dynamics up to 80% accuracy.
27
28
29
Index Terms—Architecture, evaluation framework, information
processing, observables, situation understanding, threat recognition.
I. INTRODUCTION
30
31
32
33
34
35
36
37
38
39
40
41
42
43
Q1
1
ITUATION understanding is relevant to the field of security (i.e., robbery and vandalism), public safety (i.e.,
aggression and riots), and health care (i.e., incidents with
elderly people). Technology is making its entrance in solutions
for these domains. More recently, research institutes and technology providers have focused also on the domain of antiterrorism, aiming for solutions that detect threats in an early stage [1].
To detecting a threat at an early stage is important, as it enables
security professionals to mitigate the situation. In this paper,
we focus on technology to recognize the stages that build up to
potential threats.
Our objective is a system to alert for potential threats. To
that end, the system assesses the current situation in order to
S
Manuscript received December 30, 2009; revised September 20, 2010 and
January 25, 2011; accepted March 19, 2011. This paper was recommended by
Associate Editor J. Tang.
The authors are with The Netherlands Organization for Applied Scientific
Research (TNO) Observation Systems, The Hague 2597 AK, The Netherlands
(e-mail: [email protected]; [email protected]).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TSMCC.2011.2135344
Fig. 1. Our framework: outputs of various sensors (JDL level 0) are processed
to the higher abstraction level of observables (JDL level 1). Examples of observables are taken from our earlier work, e.g., trajectory analysis, pose estimation,
color of clothing, carrying an attribute, behavior recognition, and group dynamics [3]. These observables provide information to the situation estimator (JDL
level 2). Specific situations are associated with threats. Situation assessment is
the focus of this paper.
determine whether a threat may occur. Our approach is to assess
the situation from multiple object assessments. These two layers
are adopted from the well-known JDL data fusion model and
its extensions [2]. Object assessments are based on sensors and
intelligence. This is illustrated in Fig. 1.
The output of sensors requires automatic processing up to the
level of meaningful and robust information, which we will refer
to as observables. Intelligence is also taken as an observable.
The observables are object assessments and range from camera
observables (i.e., person carries some object) to geo-observables
(i.e, the person has a short interaction with another person) to
intelligence observables (the suspect is a person with a gray
shirt and black pants). Based on the observables, the situation
is assessed. For instance, the observables that a suspect person
hands over an object to another person who is mixing in the
crowd would result in the situation assessment that somebody
is carrying a suspect object.
In this paper, we discuss the methods to exploit observables
for threat assessment. We motivate our choices for specific pattern recognition methods. In the experiments, the added value
of the implemented method is demonstrated for scenarios that
are relevant to antiterrorism and crowd control. The experiments
are performed within an evaluation framework, where the setup
is such that conclusions can be drawn concerning: 1) the performance and robustness of an architecture to assess threats; and
2) the architecture’s dependence of the underlying observables
in terms of their false positive and negative rates.
1094-6977/$26.00 © 2011 IEEE
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
2
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS
A. Previous Work
Many observables have been introduced, which range from
detecting specific objects such as cars [4] to detecting aggression
[5], from generic features of human walking patterns [6] or
highly specific human acts such as shaking hands and kissing [7].
In this paper, we focus on properties and behavior of humans that
can be detected and recognized in video. The goal of our work
is to assess situations beyond the level of isolated behaviors,
which is based on multiple observables.
Previously, other researchers have also considered situation
assessment that is based on multiple observables, e.g., the combination of visual and audio features to improve the detection
of aggression [8]. With an appropriate combination scheme, the
discriminative power of situation assessment can be improved
by using multiple observables. Situation assessment is a challenge, as the situations under investigation can range from instantaneous (e.g., one person gives an object to another person)
to longer periods (e.g., a pickpocket is waiting the best moment
to steal someone’s belongings). In recent literature, researchers
tend to focus on either instantaneous events [9] or on complex,
long-term person–person or person–object interactions [10]. In
this paper, we propose a generic scheme to model and classify
both types of situations.
In this paper, we consider only situations that are known
a priori. This requires examples from which a system can learn.
Interestingly, recent work has shown that it is possible to detect
abnormal situations at run time [11]. Such methods are not
within the scope of this paper.
Another observation from reviewing recent work is that scientific communities have released descriptions and datasets of
behaviors of crowds and individuals. These datasets are very interesting and include recordings of single observables to experiment with, e.g., the PETS and TRECVID benchmarks. However,
such communities have not yet focused on short- and long-term
sequential scenarios. As a consequence, no scenarios are available for antiterrorism and crowd-control applications.
B. Contributions in this Paper
The contribution in this paper is fourfold.
1) Relevant scenarios for antiterrorism and crowd control.
We discuss how expert knowledge of incidents can be
used. This is crucial, as for the particular incidents related
to terrorism, escalation of crowd behavior, and suspect
people and objects, few or no data will be available.
2) A tool to generate a multitude of variations of the scenarios. The variations include statistical and systematic error
sources.
3) A framework to evaluate both 1) the performance of the
situation assessment as a whole and 2) its dependence of
underlying observables. The experiments are performed
by the usage of a conditional random field (CRF). The
outcomes of the evaluation give insight in 1) whether
more (complementary) observables are required to assess
situations successfully; and 2) whether the robustness of
observables needs to be improved in order to be useful for
situation assessment.
4) Given the scenarios and their variations and the framework, we evaluate a system for situation assessment that
was implemented for this paper. The system is described in
Section III that is based on a CRF that takes observables as
input and estimates situations as output. We demonstrate
in Section IV that discriminative power can be achieved
by combining generic observables.
This paper is organized as follows. In Section II, we discuss the scenarios and the tool to generate their variations.
In Section III, we propose the CRF that learns the best relation between observables and the assessment of situations. In
Section IV, the experimental setup is discussed. In Section V,
the CRF is evaluated for the scenarios and their variations. In
addition, the CRF is evaluated against a real-world dataset. Section VI concludes this paper.
140
II. SCENARIOS
141
In this section, we define scenarios to which our system (see
Section III) is applied and on which the experiments will be
based (see Section IV). A scenario is interpreted as a sequence
of states. The state is a description of the situation at a particular
time. That is, “a group of people is agitated.” Note that an
expert may be consulted to associate each state with a level of
escalation or alarm, to bridge the gap to the decision maker. For
each subsequent period, a number of observables are observed.
Observables are observed object properties. That is, “a person
enters the scene.” A scenario is specified by the sequence of
states, their duration, and the observables that are observed for
each subsequent period.
By the previous definition, the observables may vary from
time to time. Different sets of observables may be observed for
the same state at different periods. There is no direct coupling
between states and observables. In Section III, we select a pattern recognition method to learn the best probabilistic relation
between states and observables. In Section IV, we evaluate how
well this learned relation predicts the states for unseen variations
of the scenarios.
Observables may be chosen such that they relate to specific
persons or groups of persons that are of particular interest. That
is, “a suspect person enters the scene” or even more specific
“John enters the scene.” In this paper, we consider observables
that are generic, e.g., “a person enters the scene,” “a suspect
person is in the scene,” etc. Interestingly, in Section IV, we
demonstrate that discriminative power can be achieved by combining generic observables.
142
126
127
128
129
130
131
132
133
134
135
136
137
138
139
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
A. Observables
170
We have selected the following observables: empty square,
one person, two-five persons, many people, people flux +, people flux −, sudden large people flux, tracks toward hotspot,
tracks in hotspot, group formation, group moves, group on collision, individual is avoided by other people, individual’s head
orientation varies, individual carries object, individual has wild
gestures, somebody wears clothes related to a suspect person.
The observables have been adopted from interviews with security professionals and from a training in The Netherlands
171
172
173
174
175
176
177
178
179
BURGHOUTS AND MARCK: REASONING ABOUT THREATS: FROM OBSERVABLES TO SITUATION ASSESSMENT
182
(police academy) on “search, detect, and react” (based on an
Israeli security training). The observables are illustrated in
Fig. 1.
183
B. Antiterrorism and Crowd-Control Scenarios
184
Five scenarios for antiterrorism and crowd control are proposed: square is revolting, aggressive political speaker, extreme
demonstration, neutral, fight between a few people.
The situation at a given time is formalized by one of the
following states: neutral, group with agitator, other people are
avoiding the situation, group is coordinating something, group
is agitated, group riot.
The scenarios are defined by the format: {[state1 , [observable1,1 ,
observable1,2 , . . . ,
observable1,N ],
length1 ],
. . . , [stateT . . .]}. The variable “length” refers to the length of
the period of the current state. The number of observables may
vary per period. These formal specifications will be illustrated
at the end of this section.
First, we present the five scenarios in natural language.
1) Square is revolting: market square, many people moving randomly, two people are loitering, group assembles
just outside the market square, one group member is agitated, the group starts moving, the loitering people join
the group, another group starts to move, one of the group
members carries something, people are avoiding the two
groups, the two groups confront each other, fight, people
fleeing.
2) Aggressive political speaker: somebody start to speak in
the midst of people, people are moving around, people
start to listen, people are walking away.
3) Extreme demonstration: people are avoiding the group that
is demonstrating, the atmosphere is tense, little riot, the
demonstrating group is moving.
4) Fight between a few people: some people are agitated, one
troublemaker is starting a fight, small fight, people are
avoiding the fight, fight stops.
5) Default: neutral situation, people moving, many people
enter buildings, a small group is walking by, square is
getting more crowded, another group walks by.
180
181
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
C. Scenario Generation Including Statistical and Systematic
Errors
The scenarios are transformed into datasets. The inputs are
the observables (i.e, binary values) and the ground truth is the
state per timestep. Four types of noise can be added to the
observables.
1) Statistical errors: random false positives and negatives.
2) Systematic errors: fade in/out: false positives are added in
front of active observables, or at the end.
3) Systematic errors: ambiguity: false positives are added
for a particular observable, and false negatives are added
randomly.
4) Systematic errors: clutter: for each state, a state-dependent
clutter effect is generated, which results in more false
positives and negatives when more clutter is apparent.
3
Fig. 2. Example of increasing statistical errors for the scenario “square is
revolting.” On the horizontal axis, the timesteps are shown. The vertical axis
indicates the observables (each row represents one observable). (a) FPR =
FNR = 0%. (b) FPR = FNR = 10%. (c) FPR = FNR = 20%.
Fig. 3. Example of nontrivial relation between (a) observables and (b) states.
The relation between observables and states is not clear and probabilistic in
nature. Learning the probabilistic relation between observables and states is the
objective of this section and the experiments.
All types of noise will be expressed in terms of false positive
ratio (FPR) and false negative ratio (FNR) in the experiments.
Fig. 2 illustrates the effect of various FPRs and FNRs on the
datasets.
236
III. FROM OBSERVABLES TO SITUATION ASSESSMENT
237
Recall from the previous section, the observables may be
noisy. In this section, we exploit a CRF to learn the best probabilistic relation between states and observables.
To assess the current state of the situation, the observables are
simply considered altogether. The situation assessment is based
on a “bag of observables.” More specifically, in this paper, we
interpret situation assessment as a mapping problem from the
“bag of observables” to a situation state.
The observables are perceptual concepts. The situation states
are implicit concepts that are not directly observed from the data.
Fig. 3 makes this clear. There is no obvious relation between
the observables and states. The situation state can be interpreted
as a hidden parameter of the observables. This makes the state
prediction nontrivial. In Section V, we evaluate whether the
proposed CRF is a good technique to model the probabilistic
relations between observables and situation states.
238
233
234
235
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
A. Conditional Random Field
254
A CRF [12] is a type of discriminative probabilistic model
that is most often used to label or parse the sequential data, such
as natural language text or biological sequences. CRFs are a
more general form of hidden Markov models (HMM, [13]) and
relaxes the Markov assumption over the sequential data that is
characteristic for the HMM. This allows a CRF model to use
information from time separated data [12].
We made the observation that situations in real life evolve
in certain sequential orders. The evolution over time of the
scenario states resembles the kind of probabilistic sequences
255
256
257
258
259
260
261
262
263
4
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS
Fig. 4. Example of a CRF. (a) Weights for observables to current state potential. (b) Weights of the previous state to current state potential. (c) Cumulative
probabilities. (d) Estimated state and ground truth.
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
of natural language. These sequences can be modeled by an
undirected graphical model in which each vertex represents a
state. The edges between the vertices can be understood as a
dependence between states. There are two main ways to model
such graphical dependences: HMM and CRF.
We choose the CRF over the HMM, because it has two advantages that are important to our problem. Both advantages are
due to the fact that the CRF model is a discriminative model
instead of generative model [14].
A discriminative model models the conditional probability
P (Y |X). A generative model is more widely applicable as it
can be used to calculate the joint probability P (X, Y ) and all
marginals of the joint, for example, P (X) or P (Y ). A discriminative model is limited to predicting state Y given certain
sequence of observations X. When fitting the model of a generative model, one approximates both P (X) and P (Y |X) instead
of merely P (Y |X) in the case of the discriminative model. This
results in a number of advantages.
1) A discriminative model is independent of P (X). This
means that to estimate P (X) for rarely occurring states
is not a problem. This results in better accuracy for rare
scenarios.
2) It can be trained in one setting and tested in a different setting with another P (X), since it is independent of P (X)
and dependences within P (X).
3) A discriminative model has better predictive performance
of P (Y ), since it is trained on classification of P (Y |X),
instead of modeling the joint probability. It does not need
additional effort to fit data with P (X).
Usage of the CRF consists of two phases: the data-fitting
phase and the inference phase. In the data-fitting phase, the
model parameters are fitted to a training dataset. The best fit is
the maximum of the (log) likelihood of these parameters that are
given the training dataset. The likelihood is a multidimensional
concave function. The optimum of this function gives the best fit
to the parameters. In effect, two matrices are learned, which are
used by the CRF to classify newly observed observables into the
next state: one to capture the dependences between observables
and states, and the other to capture the dependences from the
previous state to the next state. The observable-state matrix also
captures the relative importance of each observable for each
state.
In the inference phase, the learned model is used to estimate
P (Y ) from a given X. When estimating P (Y ), the forward
backward algorithm [13] is used to refine the estimation of
P (Y ). The forward backward algorithm finds the most probable
path in time over the states in P (Y ) [13] and smooths the
probabilities of P (Y ).
An example of a CRF is illustrated in Fig. 4. This example is used in Section V-B, where the precise states and
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
BURGHOUTS AND MARCK: REASONING ABOUT THREATS: FROM OBSERVABLES TO SITUATION ASSESSMENT
325
observables (in a real-world experiment) are further explained.
For the next illustration, the naming of states and observables is
not yet important; here, we want to discuss the underlying mechanism. We illustrate the weighting and potentials of observables
and the estimation of states. The weights of the observables
are shown in Fig. 4(a). The weights of the observables in the
previous state are shown in Fig. 4(b). Together, they influence
the estimation of the current state. Positive weights excite the
corresponding state, whereas negative weights inhibit that state.
This is exemplified by the cumulative probabilities in Fig. 4(c).
The state with highest probability becomes the estimated state
[see Fig. 4(d)].
326
IV. EXPERIMENTAL SETUP
327
328
We model the situation state as the hidden state of the CRF,
and the observables are the observations of the hidden state.
329
A. Conditional Random Field Evaluation Objectives
330
341
We evaluate how well the relation between states and observables can be modeled by a CRF. We are interested in the
following properties of the CRF.
1) Discriminative power: How well does the CRF predict the
situation states for unseen variations of the scenarios? This
is an interesting question, as the observables are generic
and, hence, not tailored to the specific scenarios and their
states.
2) Robustness: How sensitive is the CRF with respect to
statistical and systematic errors?
3) Capacity to generalize: How many examples of a given
set of scenarios are required to obtain a robust CRF?
342
B. Evaluation Tool
343
349
To test the discriminative power of the CRF for situations, we
adapted the Schmidt MATLAB CRF implementation [15]. We
implemented a scenario generator to generate the scenarios (see
Section II-B). In this tool, we also added the capability to include
the four noise modes (see Section II-C) to the observables of
the scenarios in order to measure the performance of the CRF
approach against various noise models.
350
C. Datasets
351
We test each noise variable to increase amounts of noise. The
datasets are generated from the scenarios in Section II and are,
subsequently, contaminated with increasing levels of noise. The
types of noise are statistical (random false positives/negatives)
and systematic (fade ins/outs, ambiguity, and clutter). All types
of noise will be expressed in terms of false positive and negative
ratios in the experiments.
314
315
316
317
318
319
320
321
322
323
324
331
332
333
334
335
336
337
338
339
340
344
345
346
347
348
352
353
354
355
356
357
358
D. Cross Validation
359
For each noise setting, 48 repetitions were generated.
Each repetition contains the entire set of five scenarios (see
Section II-B). For each noise parameter setting, a 12-fold cross
validation will be performed. Each test fold contains four repetitions and is tested against a training set that contains N randomly
360
361
362
363
5
Fig. 5. Results for the experiment with 16, 8, 4, 2, and 1 scenario variations
to learn from. Results are shown for nine fixed samples of FPR and FNR. For
each FPR/FNR sample, we indicate how the CRF classification accuracy drops
as a consequence of fewer learning examples. This is shown by each bar, where
the accuracy drops from red (almost 100%) to green (just above 50%). This
effect is worse for higher FPR/FNR (e.g., the bar in the upright corner). A good
balance between number of examples and performance of the CRF is obtained
for eight examples (see second from the left of each colored bar).
selected repetitions out of the 48 scenarios that are generated
for the current parameter setting. The averaged results from the
folds are the CRF performance results for the selected parameter
setting.
367
E. Training the Conditional Random Field
368
The CRF will be trained at each test fold on the randomly
selected training set. The data fitting, i.e, the optimizing of the
loglikelihood, is performed by the L-BFGS optimization algorithm [16]. This quasi-Newton optimization algorithm is a
standard to optimize large multidimensional optimization problems such as a CRF [17]. The optimization ends when the average error of the training set is smaller than 10−5 or after 500
iterations.
369
376
F. Evaluation Measures
377
The results of the CRF are presented as confusion matrices.
Following conventions, each confusion matrix contains on the
vertical axis the ground truth situation states and on the horizontal axis the predicted states by the CRF. Hence, a diagonal
with high prediction accuracies is aimed for.
We expect that for higher FPRs and FNRs, the predicted states
become unreliable, which will cause the confusion matrices
to have lower values on the diagonal. Each increasing state
index indicates an increasing alarm level. In the case of errors
(i.e., wrongly predicted states), we hope that the errors are not
distributed among all states but remain confined to subsequent
states, such that the perturbing effect on wrongly predicted alarm
levels remains limited.
378
364
365
366
370
371
372
373
374
375
379
380
381
382
383
384
385
386
387
388
389
390
6
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS
Fig. 6. Situation assessment: results. Experiment: only one scenario variation to learn from, for various types of noise (see captions directly under the figure) and
the degree of noise (see false positive and negative ratios). Confusion matrices for each of the six situation states are displayed for various FPRs and FNRs. Up to
FPR or FNR of 15%, the CRF is surprisingly accurate to assess situations. (a) Statistical errors. (b) Systematic: fade ins. (c) Systematic: fade outs. (d) Systematic:
fade ins and outs. (e) Systematic: ambiguity. (f) Systematic: clutter.
BURGHOUTS AND MARCK: REASONING ABOUT THREATS: FROM OBSERVABLES TO SITUATION ASSESSMENT
391
V. RESULTS
392
395
First, we focus on experiments on synthetic data where we
can control the sources and amount of noise (see Section V-A).
Second, we show experimental results on real-world data (see
Section V-B).
396
A. Synthetic Data
397
The results in this section are used to evaluate the capacity
to generalize (see Section V-A1), the discriminative power (see
Section V-A2), and robustness (see Section V-A2).
1) Effect of the Size of the Training Set: Our first objective is
to find a good balance between the number of learning examples
to train the CRF and the performance of the CRF. If the number
of learning examples is low, then fewer examples have to be
collected, and training takes less time. However, with fewer
learning examples, the performance of the CRF will be lower.
To evaluate this tradeoff in detail, we investigate the results of
the CRF approach to increase FPR and FNR levels.
Fig. 5 shows the accuracy of the CRF for various sizes of the
training set, which has been decreased from 16 to smaller sets
of 8, 4, 2, and 1 training example(s). The figure shows at each
FPR (horizontal coordinate) and FNR (vertical) a bar of values,
one for each size of the training set. One value represents the
CRF’s accuracy of the situation assessment, where the accuracy
has been averaged equally over the situation states.
With 16 examples and errors up to 30%, the average accuracy
of the CRF is 81%. With fewer errors, up to 15%, the accuracy is
94%. With only one example, the accuracy drops, respectively,
to 50% (was 81%) and 63% (was 94%). A good tradeoff seems to
be eight examples, which achieves only slightly lower accuracy
(i.e., 4% lower than with 16 examples) while using only half
of the examples. Situations can be assessed with a precision of
90% at a false positive and negative rate of 15% using only eight
learning examples.
2) Effect of Statistical and Systematic Errors on the CRF:
Our second objective is to find out where the errors of the CRF
(wrongly classified states) occur and how they are distributed
over the states. To that end, we need a more detailed figure.
Rather than to display a single value that represents the accuracy
as averaged over the states, we display the confusion matrix for a
given size of the training set. As an extreme test case, we display
results for a training set of one example only. In addition, the
results are split into the various noise sources that we identified
earlier, each produces one subfigure [see Fig. 6(a)–(f)]. The
captions indicate the error sources: There are four error sources,
where the fade in/out source is separated into three related but
different cases, which results in six subfigures in total. The line
around each of the confusion matrices indicates the average
predication accuracy, which has been weighted equally over all
states. Fig. 6 altogether is meant to give a detailed insight in
how error sources and the levels of FPR and FNR deteriorate
the results.
Up to FPR or FNR of 15% for any type of noise, the CRF
is surprisingly accurate to assess situations. The accuracy starts
to deteriorate above 15% FPR or FNR or both. Interestingly,
the CRF is not very sensitive to fading effects or ambiguity (we
393
394
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
7
Fig. 7. Video descriptors to capture the crowd dynamics in a compact way.
This figure shows 12 out of the 104 descriptors.
TABLE 1
RESULTS FOR THE CLASSIFICATION OF CROWD DYNAMICS BY A CRF
introduced correlations) within the observables. On the contrary,
the CRF is sensitive to random noise and to clutter (state-specific
levels of random noise). The clutter case is most difficult, as the
noise is different for each state. As a consequence, the noise
increases and decreases over time. The CRF is least robust to
noise levels that vary over time.
451
B. Real-World Data
452
In this section, we consider an experiment that is based on
real data. The experiment is relevant for security and safety
applications. We have recorded and annotated 4-h video data
at the second largest train station in The Netherlands. The
video data includes many and diverse behavioral patterns of
the crowd, of which a limited set have been classified by a CRF.
For this real-world experiment, we have adopted state-of-the-art
video features that capture local structure and temporal dynamics. Given these features, we adopt a quantization scheme to
translate the many and high-dimensional video features into a
fixed number of low-dimensional observables. Based on these
spatiotemporal observables, various crowd dynamics are classified by the CRF. The goal is to classify three types of crowd
dynamics: “normal” (10–100 people), “abandoned” (0–10 people), “rush” (many people moving, to go to or leave the tram
platforms). In the recorded video, these three event classes are
correspondingly labeled by hand.
1) Describing the Crowd: Video Features: To describe the
crowd, we aim to capture the temporal dynamics in the crowd
(the general movement of, e.g., heads, shoulders, and legs). We
tailor existing video features for this task. We start with optical flow [18], which is computed at each pixel in the video
images, that yields at each pixel a magnitude and orientation.
We summarize these flows in spatial bins to reduce the amount
of data. We choose 104 fixed bins that are distributed equally
453
446
447
448
449
450
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
8
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS
Fig. 8. Examples of errors for each type of crowd dynamics, where the CRF misclassified the state of the crowd (between parentheses). (a) “Normal” (“Rush”).
(b) “Abandoned” (“Normal”). (c) “Rush” (“Normal”).
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
over the image. The 104 spatial bins are obtained as follows.
The video is of HD quality, therefore each image frame is
1080 × 1440. We aim to sample approximately every 100 pixels. Discounting the border areas, the grid becomes 8 × 13,
which yields 104 spatial bins. Next, in each spatial bin, the flow
vectors are binned into eight orientations and four magnitudes.
This approach is inspired by other work, e.g., scale-invariant
feature transform [19], where discriminative yet compact image features were extracted from images by summarizing many
pixel values into few orientations and magnitudes. For adequate
continuous weighting of the optical flows, the contribution of
each flow vector to each bin is determined by a kernel-based
weight. With our approach, 104 × 8 × 6, i.e., 104 descriptors of
length 48 are obtained per video frame. This procedure captures
the dynamics of the crowd in a compact way. An illustration is
depicted in Fig. 7.
2) Transforming the Video Features Into Observables: We
experimentally established that the CRF is able to learn the three
types of crowd dynamics when the input is bound at maximum
40 observables. Recall that the video features yield 104 × 48 =
4992 values, i.e., much higher than the maximum of 40 observables. Hence, a translation is needed that reduces the video
features into approximately 40 observables. To do so, we adopt
a method of quantization: visual codebooks [20]. In a visual
codebook, the descriptors are assigned to a smaller selection
of representative descriptors (which are also called primitives).
This yields a histogram, of which the length is determined by
the number of primitives. We choose eight primitives, that result
in a histogram of length 8. Next, we discretize each bin (of the
eight-bin histogram) into four levels, to obtain 8 × 4 binary
values. Recall that the CRF is especially suited to learn from
binary values. In this translation procedure, the 104 multivalued
video features are translated into a 32 binary values. These 32
binary values are the observables, on which the CRF is trained
and tested.
3) Results: Classification of Crowd Dynamics: All examples
of the three classes are taken from the 4-h videos. These examples are subsampled at 2 frames/s. This is a choice to reduce
the amount of stored data and a limitation posed by processing
speed. We argue that 2 frames/s is sufficiently fast for awareness on crowd dynamics, yet the variation between samples is
significant. For each frame, the video features are computed and
translated into observables. Together with the classification of
observables into a next state, this procedure can be done at a
sampling rate of 2 frames/s. The trained CRF, with its weights
for the observables and state transitions, has been shown in
Fig. 4 and was discussed in Section III.
For the training of the CRF, the transitions between states
are essential. In our real-world dataset (i.e., the examples of
the classes), we found that the number of transitions is limited.
This poses a challenge to make distinct yet rich training and test
sets. We choose to solve this as follows. We perform a fourfold
cross validation. We subsample every Nth encoded video frame
to be included in the Nth-fold. This way, the folds are disjunct
and different enough as they are at minimum 2 s apart (given
the sampling of 2 frames/s and fourfolds). That is, the scene is
changing rapidly. Going from one video frame to the next, with
2 s in between, gives an enormous variation in the images. People
have walked significantly, new people have entered, other people
have exited the scene, and people have got closer or further apart
from other people. Following conventions, we train on onefold
and test on another. This experiment is repeated four times.
The results are summarized by the confusion matrix in
Table I. Interestingly, the CRF classifies some states that are
more accurate than others. The CRF has shown to successfully
classify the crowd dynamics up to 80% accuracy.
Examples of errors are shown in Fig. 8. An error that occurs
regularly is a badly timed state transition of the estimation.
Usually, the transition is to the right state but is too early or too
late. This is illustrated in Fig. 8(b) and (c). Sometimes there are
“true misclassifications;” this is illustrated in Fig. 8(a).
VI. CONCLUSION
We have proposed scenarios that are relevant for antiterrorism
and crowd control. Given the scenarios, and their variations
including various kinds of errors, we have assessed situations
that are based on observables with the objective to recognize
threats.
We have chosen the CRF as a method to learn probabilistically the relation between observables and states (as the hidden
parameter). A good size for the training set was shown to be
eight examples for each of the five scenarios, which achieves
only slightly lower accuracy (i.e., 4%) than with 16 examples
while using only half of the examples. Situations can be assessed
with a precision of 90% at a false positive and negative rate of
15% using only eight learning examples.
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
BURGHOUTS AND MARCK: REASONING ABOUT THREATS: FROM OBSERVABLES TO SITUATION ASSESSMENT
580
We have investigated the extreme test case of using a training
set of only one example to find out where the CRF becomes
inaccurate. Up to FPR or FNR of 15% for any type of noise
(random, fade in/out, ambiguity, and clutter), the CRF is surprisingly accurate to assess situations. The accuracy starts to
deteriorate above 15% FPR or FNR or both. The CRF is not
very sensitive to fading effects or ambiguity but is sensitive to
random noise and to clutter (i.e., state-specific levels of random
noise).
In a real-world experiment at a large train station, we have
classified various types of crowd dynamics. Using simple video
features of shape and motion, we have proposed a scheme to
translate such features into observables that can be classified by
a CRF. The CRF has shown to successfully classify the crowd
dynamics up to 80% accuracy.
Overall, we conclude this paper with the interesting finding
that discriminative power can be achieved by combining multiple, generic, and simple observables.
581
ACKNOWLEDGMENT
582
585
The authors would like to thank Dr. R. den Hollander to
provide the camera-based observables and their characteristics.
They are also grateful to Dr. K. Schutte for useful discussions
on the inference of the situation that is based on observables.
586
REFERENCES
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
[1] M. M. Kokar, “Situation awareness: Issues and challenges,” in Proc. Int.
Conf. Inf. Fusion, 2004, pp. 533–534.
[2] J. Llinas, C. Bowman, G. Rogova, and A. Steinberg, “Revisiting the JDL
data fusion model II,” in Proc. Int. Conf. Inf. Fusion, 2004, pp. 1218–1230.
[3] G. J. Burghouts, B. Broek, B. G. Alefs, E. den Breejen, and K. Schutte,
“Automated indicators for behavior interpretation,” in Proc. Int. Conf.
Crime Detect. Prevent., 2007, pp. 1–6.
[4] S. Agarwal, A. Awan, and D. Roth, “Learning to detect objects in images
via a sparse, part-based representation,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 26, no. 11, pp. 1475–1490, Nov. 2004.
[5] A. Datta, “Person-on-person violence detection in video data,” in Proc.
Int. Conf. Pattern Recognit., 2002, pp. 433–438.
[6] D. M. Gavrila, “The visual analysis of human movement: A survey,”
Comput. Vis. Image Understand., vol. 73, no. 1, pp. 82–98, 1999.
[7] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning realistic
human actions from movies,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern
Recognit., 2008, pp. 1–8.
[8] W. Zajdel, D. Krijnders, T. Andringa, and D. Gavrila, “CASSANDRA:
Audio-video sensor fusion for aggression detection,” in Proc. Int. Conf.
Adv. Video Signal Based Surveillance, 2007, pp. 200–205.
[9] L. Duan, D. Xu, I. W. Tsang, and J. Luo, “Visual event recognition in
videos by learning from web data,” in Proc. IEEE Int. Conf. Comput. Vis.
Pattern Recognit., 2010, pp. 1959–1966.
[10] D. Küttel, M. Breitenstein, L. van Gool, and V. Ferrari, “What’s going on?
discovering spatio-temporal dependencies in dynamic scenes,” in Proc.
IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2010, pp. 1951–1958.
[11] T. Xiang and S. Gong, “Video behaviour profiling for anomaly detection,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 5, pp. 893–908, May
2008.
[12] A. McCallum and C. Sutton. (2006). “An introduction to conditional random fields for relational learning,” in Introduction to Statistical Relational Learning, L. Getoor and B. Taskar, Eds. Cambridge, MA: MIT Press [Online]. Available: http://www.cs.umass.edu/
mccallum/papers/crf-tutorial.pdf
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
583
584
9
[13] L. Rabiner, “A tutorial on hidden Markov models and selected applications
in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286, Feb. 1989.
[14] I. Ulusoy and C. Bishop, “Generative versus discriminative methods for
object recognition,” in Proc.IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2005, vol. 2, pp. 258–265.
[15] M. Schmidt. (2008). “CRF toolkit in MATLAB,” [Online]. Available:
http://people.cs.ubc.ca/schmidtm/Software/
[16] D. C. Liu and J. Nocedal, “On the limited memory BFGS method for large
scale optimization,” Math. Program., vol. 45, no. 3, pp. 503–528, 1989.
[17] F. Sha and O. Pereira, “Shallow parsing with conditional random
fields,” in Proc. HLT-NAACL, 2003, pp. 213–220 [Online]. Available:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.9849
[18] B. K. P. Horn and B. G. Schunck, “Determining optical flow,” Artif. Intell.,
vol. 17, nos. 1–3, pp. 185–203, 1981.
[19] D. G. Lowe, “Object recognition from local scale-invariant features,” in
Proc. Int. Conf. Comput. Vis., 1999, pp. 1–8.
[20] B. T. F. Jurie, “Creating efficient codebooks for visual recognition,” in
Proc. Int. Conf. Comput. Vis., 2005, pp. 604–610.
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
Gertjan J. Burghouts received the Ph.D. degree
from the University of Amsterdam, Amsterdam, The
Netherlands, in 2007 on the topic of visual recognition of objects and their motion, in realistic scenes
with varying conditions.
He is currently a Lead Research Scientist of Visual Pattern Recognition with The Netherlands Organization for Applied Scientific Research (TNO), The
Hague, The Netherlands. He studied artificial intelligence at the University of Twente during 1997–2002
with a specialization in pattern analysis and human–
machine interaction. Since 2007, he has been the Principal Investigator of automated understanding of human behavior based on sensory perception. He is
the Principal Investigator of a DARPA project named CORTEX (2.3M), about
recognition of events and behaviors. He has written papers on this topic in
internationally renowned journals, e.g., the IEEE TRANSACTIONS ON IMAGE
PROCESSING, the Computer Vision and Image Understanding, the International
Journal of Computer Vision, and International Conference on Crime Detection
and Prevention. His work has been cited mote than 200 times since 2005.
Dr. Burghouts received an award from the Netherlands Association of Engineers for the best innovative project in 2007.
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
Jan-Willem Marck received the M.Sc. degree in artificial intelligence from the University of Groningen,
Groningen, The Netherlands, in 2006, with a specialization in autonomous systems.
He is currently a Research Scientist of Artificial Intelligence with the Department of Distributed Sensor
Systems, The Netherlands Organization for Applied
Scientific Research (TNO), The Hague, The Netherlands. Since 2006, he has been a Research Scientist
on various topics, e.g.,sensor information fusion, relevance of information, human–machine interaction
and human behavior classification based on state estimation methods using sensory data. He has (co-)authored a number of papers and managed projects on
these topics.
661
662
663
664
665
666
667
668
669
670
671
672
673
674 Q2
675
676
677
678
QUERIES
Q1: Author: Please verify the affiliation of the authors as typeset.
Q2. Author: Please verify the current location of author “Jan-Willen Marck.”
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS
Reasoning About Threats: From Observables
to Situation Assessment
1
2
Gertjan J. Burghouts and Jan-Willem Marck
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Abstract—We propose a mechanism to assess threats that are
based on observables. Observables are properties of persons, i.e.,
their behavior and interaction with other persons and objects. We
consider observables that can be extracted from sensor signals and
intelligence. In this paper, we discuss situation assessment that is
based on observables for threat assessment. In the experiments, the
assessment is evaluated for scenarios that are relevant to antiterrorism and crowd control. The experiments are performed within
an evaluation framework, where the setup is such that conclusions
can be drawn concerning: 1) the accuracy and robustness of an
architecture to assess situations with respect to threats; and 2) the
architecture’s dependence of the underlying observables in terms
of their false positive and negative rates. One of the interesting conclusions is that discriminative assessment of threatening situations
can be achieved by combining generic observables. Situations can
be assessed with a precision of 90% at a false positive and negative
rate of 15% using only eight learning examples. In a real-world
experiment at a large train station, we have classified various types
of crowd dynamics. Using simple video features of shape and motion, we have proposed a scheme to translate such features into
observables that can be classified by a conditional random field
(CRF). The implemented CRF shows to classify successfully the
crowd dynamics up to 80% accuracy.
27
28
29
Index Terms—Architecture, evaluation framework, information
processing, observables, situation understanding, threat recognition.
I. INTRODUCTION
30
31
32
33
34
35
36
37
38
39
40
41
42
43
Q1
1
ITUATION understanding is relevant to the field of security (i.e., robbery and vandalism), public safety (i.e.,
aggression and riots), and health care (i.e., incidents with
elderly people). Technology is making its entrance in solutions
for these domains. More recently, research institutes and technology providers have focused also on the domain of antiterrorism, aiming for solutions that detect threats in an early stage [1].
To detecting a threat at an early stage is important, as it enables
security professionals to mitigate the situation. In this paper,
we focus on technology to recognize the stages that build up to
potential threats.
Our objective is a system to alert for potential threats. To
that end, the system assesses the current situation in order to
S
Manuscript received December 30, 2009; revised September 20, 2010 and
January 25, 2011; accepted March 19, 2011. This paper was recommended by
Associate Editor J. Tang.
The authors are with The Netherlands Organization for Applied Scientific
Research (TNO) Observation Systems, The Hague 2597 AK, The Netherlands
(e-mail: [email protected]; [email protected]).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TSMCC.2011.2135344
Fig. 1. Our framework: outputs of various sensors (JDL level 0) are processed
to the higher abstraction level of observables (JDL level 1). Examples of observables are taken from our earlier work, e.g., trajectory analysis, pose estimation,
color of clothing, carrying an attribute, behavior recognition, and group dynamics [3]. These observables provide information to the situation estimator (JDL
level 2). Specific situations are associated with threats. Situation assessment is
the focus of this paper.
determine whether a threat may occur. Our approach is to assess
the situation from multiple object assessments. These two layers
are adopted from the well-known JDL data fusion model and
its extensions [2]. Object assessments are based on sensors and
intelligence. This is illustrated in Fig. 1.
The output of sensors requires automatic processing up to the
level of meaningful and robust information, which we will refer
to as observables. Intelligence is also taken as an observable.
The observables are object assessments and range from camera
observables (i.e., person carries some object) to geo-observables
(i.e, the person has a short interaction with another person) to
intelligence observables (the suspect is a person with a gray
shirt and black pants). Based on the observables, the situation
is assessed. For instance, the observables that a suspect person
hands over an object to another person who is mixing in the
crowd would result in the situation assessment that somebody
is carrying a suspect object.
In this paper, we discuss the methods to exploit observables
for threat assessment. We motivate our choices for specific pattern recognition methods. In the experiments, the added value
of the implemented method is demonstrated for scenarios that
are relevant to antiterrorism and crowd control. The experiments
are performed within an evaluation framework, where the setup
is such that conclusions can be drawn concerning: 1) the performance and robustness of an architecture to assess threats; and
2) the architecture’s dependence of the underlying observables
in terms of their false positive and negative rates.
1094-6977/$26.00 © 2011 IEEE
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
2
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS
A. Previous Work
Many observables have been introduced, which range from
detecting specific objects such as cars [4] to detecting aggression
[5], from generic features of human walking patterns [6] or
highly specific human acts such as shaking hands and kissing [7].
In this paper, we focus on properties and behavior of humans that
can be detected and recognized in video. The goal of our work
is to assess situations beyond the level of isolated behaviors,
which is based on multiple observables.
Previously, other researchers have also considered situation
assessment that is based on multiple observables, e.g., the combination of visual and audio features to improve the detection
of aggression [8]. With an appropriate combination scheme, the
discriminative power of situation assessment can be improved
by using multiple observables. Situation assessment is a challenge, as the situations under investigation can range from instantaneous (e.g., one person gives an object to another person)
to longer periods (e.g., a pickpocket is waiting the best moment
to steal someone’s belongings). In recent literature, researchers
tend to focus on either instantaneous events [9] or on complex,
long-term person–person or person–object interactions [10]. In
this paper, we propose a generic scheme to model and classify
both types of situations.
In this paper, we consider only situations that are known
a priori. This requires examples from which a system can learn.
Interestingly, recent work has shown that it is possible to detect
abnormal situations at run time [11]. Such methods are not
within the scope of this paper.
Another observation from reviewing recent work is that scientific communities have released descriptions and datasets of
behaviors of crowds and individuals. These datasets are very interesting and include recordings of single observables to experiment with, e.g., the PETS and TRECVID benchmarks. However,
such communities have not yet focused on short- and long-term
sequential scenarios. As a consequence, no scenarios are available for antiterrorism and crowd-control applications.
B. Contributions in this Paper
The contribution in this paper is fourfold.
1) Relevant scenarios for antiterrorism and crowd control.
We discuss how expert knowledge of incidents can be
used. This is crucial, as for the particular incidents related
to terrorism, escalation of crowd behavior, and suspect
people and objects, few or no data will be available.
2) A tool to generate a multitude of variations of the scenarios. The variations include statistical and systematic error
sources.
3) A framework to evaluate both 1) the performance of the
situation assessment as a whole and 2) its dependence of
underlying observables. The experiments are performed
by the usage of a conditional random field (CRF). The
outcomes of the evaluation give insight in 1) whether
more (complementary) observables are required to assess
situations successfully; and 2) whether the robustness of
observables needs to be improved in order to be useful for
situation assessment.
4) Given the scenarios and their variations and the framework, we evaluate a system for situation assessment that
was implemented for this paper. The system is described in
Section III that is based on a CRF that takes observables as
input and estimates situations as output. We demonstrate
in Section IV that discriminative power can be achieved
by combining generic observables.
This paper is organized as follows. In Section II, we discuss the scenarios and the tool to generate their variations.
In Section III, we propose the CRF that learns the best relation between observables and the assessment of situations. In
Section IV, the experimental setup is discussed. In Section V,
the CRF is evaluated for the scenarios and their variations. In
addition, the CRF is evaluated against a real-world dataset. Section VI concludes this paper.
140
II. SCENARIOS
141
In this section, we define scenarios to which our system (see
Section III) is applied and on which the experiments will be
based (see Section IV). A scenario is interpreted as a sequence
of states. The state is a description of the situation at a particular
time. That is, “a group of people is agitated.” Note that an
expert may be consulted to associate each state with a level of
escalation or alarm, to bridge the gap to the decision maker. For
each subsequent period, a number of observables are observed.
Observables are observed object properties. That is, “a person
enters the scene.” A scenario is specified by the sequence of
states, their duration, and the observables that are observed for
each subsequent period.
By the previous definition, the observables may vary from
time to time. Different sets of observables may be observed for
the same state at different periods. There is no direct coupling
between states and observables. In Section III, we select a pattern recognition method to learn the best probabilistic relation
between states and observables. In Section IV, we evaluate how
well this learned relation predicts the states for unseen variations
of the scenarios.
Observables may be chosen such that they relate to specific
persons or groups of persons that are of particular interest. That
is, “a suspect person enters the scene” or even more specific
“John enters the scene.” In this paper, we consider observables
that are generic, e.g., “a person enters the scene,” “a suspect
person is in the scene,” etc. Interestingly, in Section IV, we
demonstrate that discriminative power can be achieved by combining generic observables.
142
126
127
128
129
130
131
132
133
134
135
136
137
138
139
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
A. Observables
170
We have selected the following observables: empty square,
one person, two-five persons, many people, people flux +, people flux −, sudden large people flux, tracks toward hotspot,
tracks in hotspot, group formation, group moves, group on collision, individual is avoided by other people, individual’s head
orientation varies, individual carries object, individual has wild
gestures, somebody wears clothes related to a suspect person.
The observables have been adopted from interviews with security professionals and from a training in The Netherlands
171
172
173
174
175
176
177
178
179
BURGHOUTS AND MARCK: REASONING ABOUT THREATS: FROM OBSERVABLES TO SITUATION ASSESSMENT
182
(police academy) on “search, detect, and react” (based on an
Israeli security training). The observables are illustrated in
Fig. 1.
183
B. Antiterrorism and Crowd-Control Scenarios
184
Five scenarios for antiterrorism and crowd control are proposed: square is revolting, aggressive political speaker, extreme
demonstration, neutral, fight between a few people.
The situation at a given time is formalized by one of the
following states: neutral, group with agitator, other people are
avoiding the situation, group is coordinating something, group
is agitated, group riot.
The scenarios are defined by the format: {[state1 , [observable1,1 ,
observable1,2 , . . . ,
observable1,N ],
length1 ],
. . . , [stateT . . .]}. The variable “length” refers to the length of
the period of the current state. The number of observables may
vary per period. These formal specifications will be illustrated
at the end of this section.
First, we present the five scenarios in natural language.
1) Square is revolting: market square, many people moving randomly, two people are loitering, group assembles
just outside the market square, one group member is agitated, the group starts moving, the loitering people join
the group, another group starts to move, one of the group
members carries something, people are avoiding the two
groups, the two groups confront each other, fight, people
fleeing.
2) Aggressive political speaker: somebody start to speak in
the midst of people, people are moving around, people
start to listen, people are walking away.
3) Extreme demonstration: people are avoiding the group that
is demonstrating, the atmosphere is tense, little riot, the
demonstrating group is moving.
4) Fight between a few people: some people are agitated, one
troublemaker is starting a fight, small fight, people are
avoiding the fight, fight stops.
5) Default: neutral situation, people moving, many people
enter buildings, a small group is walking by, square is
getting more crowded, another group walks by.
180
181
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
C. Scenario Generation Including Statistical and Systematic
Errors
The scenarios are transformed into datasets. The inputs are
the observables (i.e, binary values) and the ground truth is the
state per timestep. Four types of noise can be added to the
observables.
1) Statistical errors: random false positives and negatives.
2) Systematic errors: fade in/out: false positives are added in
front of active observables, or at the end.
3) Systematic errors: ambiguity: false positives are added
for a particular observable, and false negatives are added
randomly.
4) Systematic errors: clutter: for each state, a state-dependent
clutter effect is generated, which results in more false
positives and negatives when more clutter is apparent.
3
Fig. 2. Example of increasing statistical errors for the scenario “square is
revolting.” On the horizontal axis, the timesteps are shown. The vertical axis
indicates the observables (each row represents one observable). (a) FPR =
FNR = 0%. (b) FPR = FNR = 10%. (c) FPR = FNR = 20%.
Fig. 3. Example of nontrivial relation between (a) observables and (b) states.
The relation between observables and states is not clear and probabilistic in
nature. Learning the probabilistic relation between observables and states is the
objective of this section and the experiments.
All types of noise will be expressed in terms of false positive
ratio (FPR) and false negative ratio (FNR) in the experiments.
Fig. 2 illustrates the effect of various FPRs and FNRs on the
datasets.
236
III. FROM OBSERVABLES TO SITUATION ASSESSMENT
237
Recall from the previous section, the observables may be
noisy. In this section, we exploit a CRF to learn the best probabilistic relation between states and observables.
To assess the current state of the situation, the observables are
simply considered altogether. The situation assessment is based
on a “bag of observables.” More specifically, in this paper, we
interpret situation assessment as a mapping problem from the
“bag of observables” to a situation state.
The observables are perceptual concepts. The situation states
are implicit concepts that are not directly observed from the data.
Fig. 3 makes this clear. There is no obvious relation between
the observables and states. The situation state can be interpreted
as a hidden parameter of the observables. This makes the state
prediction nontrivial. In Section V, we evaluate whether the
proposed CRF is a good technique to model the probabilistic
relations between observables and situation states.
238
233
234
235
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
A. Conditional Random Field
254
A CRF [12] is a type of discriminative probabilistic model
that is most often used to label or parse the sequential data, such
as natural language text or biological sequences. CRFs are a
more general form of hidden Markov models (HMM, [13]) and
relaxes the Markov assumption over the sequential data that is
characteristic for the HMM. This allows a CRF model to use
information from time separated data [12].
We made the observation that situations in real life evolve
in certain sequential orders. The evolution over time of the
scenario states resembles the kind of probabilistic sequences
255
256
257
258
259
260
261
262
263
4
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS
Fig. 4. Example of a CRF. (a) Weights for observables to current state potential. (b) Weights of the previous state to current state potential. (c) Cumulative
probabilities. (d) Estimated state and ground truth.
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
of natural language. These sequences can be modeled by an
undirected graphical model in which each vertex represents a
state. The edges between the vertices can be understood as a
dependence between states. There are two main ways to model
such graphical dependences: HMM and CRF.
We choose the CRF over the HMM, because it has two advantages that are important to our problem. Both advantages are
due to the fact that the CRF model is a discriminative model
instead of generative model [14].
A discriminative model models the conditional probability
P (Y |X). A generative model is more widely applicable as it
can be used to calculate the joint probability P (X, Y ) and all
marginals of the joint, for example, P (X) or P (Y ). A discriminative model is limited to predicting state Y given certain
sequence of observations X. When fitting the model of a generative model, one approximates both P (X) and P (Y |X) instead
of merely P (Y |X) in the case of the discriminative model. This
results in a number of advantages.
1) A discriminative model is independent of P (X). This
means that to estimate P (X) for rarely occurring states
is not a problem. This results in better accuracy for rare
scenarios.
2) It can be trained in one setting and tested in a different setting with another P (X), since it is independent of P (X)
and dependences within P (X).
3) A discriminative model has better predictive performance
of P (Y ), since it is trained on classification of P (Y |X),
instead of modeling the joint probability. It does not need
additional effort to fit data with P (X).
Usage of the CRF consists of two phases: the data-fitting
phase and the inference phase. In the data-fitting phase, the
model parameters are fitted to a training dataset. The best fit is
the maximum of the (log) likelihood of these parameters that are
given the training dataset. The likelihood is a multidimensional
concave function. The optimum of this function gives the best fit
to the parameters. In effect, two matrices are learned, which are
used by the CRF to classify newly observed observables into the
next state: one to capture the dependences between observables
and states, and the other to capture the dependences from the
previous state to the next state. The observable-state matrix also
captures the relative importance of each observable for each
state.
In the inference phase, the learned model is used to estimate
P (Y ) from a given X. When estimating P (Y ), the forward
backward algorithm [13] is used to refine the estimation of
P (Y ). The forward backward algorithm finds the most probable
path in time over the states in P (Y ) [13] and smooths the
probabilities of P (Y ).
An example of a CRF is illustrated in Fig. 4. This example is used in Section V-B, where the precise states and
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
BURGHOUTS AND MARCK: REASONING ABOUT THREATS: FROM OBSERVABLES TO SITUATION ASSESSMENT
325
observables (in a real-world experiment) are further explained.
For the next illustration, the naming of states and observables is
not yet important; here, we want to discuss the underlying mechanism. We illustrate the weighting and potentials of observables
and the estimation of states. The weights of the observables
are shown in Fig. 4(a). The weights of the observables in the
previous state are shown in Fig. 4(b). Together, they influence
the estimation of the current state. Positive weights excite the
corresponding state, whereas negative weights inhibit that state.
This is exemplified by the cumulative probabilities in Fig. 4(c).
The state with highest probability becomes the estimated state
[see Fig. 4(d)].
326
IV. EXPERIMENTAL SETUP
327
328
We model the situation state as the hidden state of the CRF,
and the observables are the observations of the hidden state.
329
A. Conditional Random Field Evaluation Objectives
330
341
We evaluate how well the relation between states and observables can be modeled by a CRF. We are interested in the
following properties of the CRF.
1) Discriminative power: How well does the CRF predict the
situation states for unseen variations of the scenarios? This
is an interesting question, as the observables are generic
and, hence, not tailored to the specific scenarios and their
states.
2) Robustness: How sensitive is the CRF with respect to
statistical and systematic errors?
3) Capacity to generalize: How many examples of a given
set of scenarios are required to obtain a robust CRF?
342
B. Evaluation Tool
343
349
To test the discriminative power of the CRF for situations, we
adapted the Schmidt MATLAB CRF implementation [15]. We
implemented a scenario generator to generate the scenarios (see
Section II-B). In this tool, we also added the capability to include
the four noise modes (see Section II-C) to the observables of
the scenarios in order to measure the performance of the CRF
approach against various noise models.
350
C. Datasets
351
We test each noise variable to increase amounts of noise. The
datasets are generated from the scenarios in Section II and are,
subsequently, contaminated with increasing levels of noise. The
types of noise are statistical (random false positives/negatives)
and systematic (fade ins/outs, ambiguity, and clutter). All types
of noise will be expressed in terms of false positive and negative
ratios in the experiments.
314
315
316
317
318
319
320
321
322
323
324
331
332
333
334
335
336
337
338
339
340
344
345
346
347
348
352
353
354
355
356
357
358
D. Cross Validation
359
For each noise setting, 48 repetitions were generated.
Each repetition contains the entire set of five scenarios (see
Section II-B). For each noise parameter setting, a 12-fold cross
validation will be performed. Each test fold contains four repetitions and is tested against a training set that contains N randomly
360
361
362
363
5
Fig. 5. Results for the experiment with 16, 8, 4, 2, and 1 scenario variations
to learn from. Results are shown for nine fixed samples of FPR and FNR. For
each FPR/FNR sample, we indicate how the CRF classification accuracy drops
as a consequence of fewer learning examples. This is shown by each bar, where
the accuracy drops from red (almost 100%) to green (just above 50%). This
effect is worse for higher FPR/FNR (e.g., the bar in the upright corner). A good
balance between number of examples and performance of the CRF is obtained
for eight examples (see second from the left of each colored bar).
selected repetitions out of the 48 scenarios that are generated
for the current parameter setting. The averaged results from the
folds are the CRF performance results for the selected parameter
setting.
367
E. Training the Conditional Random Field
368
The CRF will be trained at each test fold on the randomly
selected training set. The data fitting, i.e, the optimizing of the
loglikelihood, is performed by the L-BFGS optimization algorithm [16]. This quasi-Newton optimization algorithm is a
standard to optimize large multidimensional optimization problems such as a CRF [17]. The optimization ends when the average error of the training set is smaller than 10−5 or after 500
iterations.
369
376
F. Evaluation Measures
377
The results of the CRF are presented as confusion matrices.
Following conventions, each confusion matrix contains on the
vertical axis the ground truth situation states and on the horizontal axis the predicted states by the CRF. Hence, a diagonal
with high prediction accuracies is aimed for.
We expect that for higher FPRs and FNRs, the predicted states
become unreliable, which will cause the confusion matrices
to have lower values on the diagonal. Each increasing state
index indicates an increasing alarm level. In the case of errors
(i.e., wrongly predicted states), we hope that the errors are not
distributed among all states but remain confined to subsequent
states, such that the perturbing effect on wrongly predicted alarm
levels remains limited.
378
364
365
366
370
371
372
373
374
375
379
380
381
382
383
384
385
386
387
388
389
390
6
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS
Fig. 6. Situation assessment: results. Experiment: only one scenario variation to learn from, for various types of noise (see captions directly under the figure) and
the degree of noise (see false positive and negative ratios). Confusion matrices for each of the six situation states are displayed for various FPRs and FNRs. Up to
FPR or FNR of 15%, the CRF is surprisingly accurate to assess situations. (a) Statistical errors. (b) Systematic: fade ins. (c) Systematic: fade outs. (d) Systematic:
fade ins and outs. (e) Systematic: ambiguity. (f) Systematic: clutter.
BURGHOUTS AND MARCK: REASONING ABOUT THREATS: FROM OBSERVABLES TO SITUATION ASSESSMENT
391
V. RESULTS
392
395
First, we focus on experiments on synthetic data where we
can control the sources and amount of noise (see Section V-A).
Second, we show experimental results on real-world data (see
Section V-B).
396
A. Synthetic Data
397
The results in this section are used to evaluate the capacity
to generalize (see Section V-A1), the discriminative power (see
Section V-A2), and robustness (see Section V-A2).
1) Effect of the Size of the Training Set: Our first objective is
to find a good balance between the number of learning examples
to train the CRF and the performance of the CRF. If the number
of learning examples is low, then fewer examples have to be
collected, and training takes less time. However, with fewer
learning examples, the performance of the CRF will be lower.
To evaluate this tradeoff in detail, we investigate the results of
the CRF approach to increase FPR and FNR levels.
Fig. 5 shows the accuracy of the CRF for various sizes of the
training set, which has been decreased from 16 to smaller sets
of 8, 4, 2, and 1 training example(s). The figure shows at each
FPR (horizontal coordinate) and FNR (vertical) a bar of values,
one for each size of the training set. One value represents the
CRF’s accuracy of the situation assessment, where the accuracy
has been averaged equally over the situation states.
With 16 examples and errors up to 30%, the average accuracy
of the CRF is 81%. With fewer errors, up to 15%, the accuracy is
94%. With only one example, the accuracy drops, respectively,
to 50% (was 81%) and 63% (was 94%). A good tradeoff seems to
be eight examples, which achieves only slightly lower accuracy
(i.e., 4% lower than with 16 examples) while using only half
of the examples. Situations can be assessed with a precision of
90% at a false positive and negative rate of 15% using only eight
learning examples.
2) Effect of Statistical and Systematic Errors on the CRF:
Our second objective is to find out where the errors of the CRF
(wrongly classified states) occur and how they are distributed
over the states. To that end, we need a more detailed figure.
Rather than to display a single value that represents the accuracy
as averaged over the states, we display the confusion matrix for a
given size of the training set. As an extreme test case, we display
results for a training set of one example only. In addition, the
results are split into the various noise sources that we identified
earlier, each produces one subfigure [see Fig. 6(a)–(f)]. The
captions indicate the error sources: There are four error sources,
where the fade in/out source is separated into three related but
different cases, which results in six subfigures in total. The line
around each of the confusion matrices indicates the average
predication accuracy, which has been weighted equally over all
states. Fig. 6 altogether is meant to give a detailed insight in
how error sources and the levels of FPR and FNR deteriorate
the results.
Up to FPR or FNR of 15% for any type of noise, the CRF
is surprisingly accurate to assess situations. The accuracy starts
to deteriorate above 15% FPR or FNR or both. Interestingly,
the CRF is not very sensitive to fading effects or ambiguity (we
393
394
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
7
Fig. 7. Video descriptors to capture the crowd dynamics in a compact way.
This figure shows 12 out of the 104 descriptors.
TABLE 1
RESULTS FOR THE CLASSIFICATION OF CROWD DYNAMICS BY A CRF
introduced correlations) within the observables. On the contrary,
the CRF is sensitive to random noise and to clutter (state-specific
levels of random noise). The clutter case is most difficult, as the
noise is different for each state. As a consequence, the noise
increases and decreases over time. The CRF is least robust to
noise levels that vary over time.
451
B. Real-World Data
452
In this section, we consider an experiment that is based on
real data. The experiment is relevant for security and safety
applications. We have recorded and annotated 4-h video data
at the second largest train station in The Netherlands. The
video data includes many and diverse behavioral patterns of
the crowd, of which a limited set have been classified by a CRF.
For this real-world experiment, we have adopted state-of-the-art
video features that capture local structure and temporal dynamics. Given these features, we adopt a quantization scheme to
translate the many and high-dimensional video features into a
fixed number of low-dimensional observables. Based on these
spatiotemporal observables, various crowd dynamics are classified by the CRF. The goal is to classify three types of crowd
dynamics: “normal” (10–100 people), “abandoned” (0–10 people), “rush” (many people moving, to go to or leave the tram
platforms). In the recorded video, these three event classes are
correspondingly labeled by hand.
1) Describing the Crowd: Video Features: To describe the
crowd, we aim to capture the temporal dynamics in the crowd
(the general movement of, e.g., heads, shoulders, and legs). We
tailor existing video features for this task. We start with optical flow [18], which is computed at each pixel in the video
images, that yields at each pixel a magnitude and orientation.
We summarize these flows in spatial bins to reduce the amount
of data. We choose 104 fixed bins that are distributed equally
453
446
447
448
449
450
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
8
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS
Fig. 8. Examples of errors for each type of crowd dynamics, where the CRF misclassified the state of the crowd (between parentheses). (a) “Normal” (“Rush”).
(b) “Abandoned” (“Normal”). (c) “Rush” (“Normal”).
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
over the image. The 104 spatial bins are obtained as follows.
The video is of HD quality, therefore each image frame is
1080 × 1440. We aim to sample approximately every 100 pixels. Discounting the border areas, the grid becomes 8 × 13,
which yields 104 spatial bins. Next, in each spatial bin, the flow
vectors are binned into eight orientations and four magnitudes.
This approach is inspired by other work, e.g., scale-invariant
feature transform [19], where discriminative yet compact image features were extracted from images by summarizing many
pixel values into few orientations and magnitudes. For adequate
continuous weighting of the optical flows, the contribution of
each flow vector to each bin is determined by a kernel-based
weight. With our approach, 104 × 8 × 6, i.e., 104 descriptors of
length 48 are obtained per video frame. This procedure captures
the dynamics of the crowd in a compact way. An illustration is
depicted in Fig. 7.
2) Transforming the Video Features Into Observables: We
experimentally established that the CRF is able to learn the three
types of crowd dynamics when the input is bound at maximum
40 observables. Recall that the video features yield 104 × 48 =
4992 values, i.e., much higher than the maximum of 40 observables. Hence, a translation is needed that reduces the video
features into approximately 40 observables. To do so, we adopt
a method of quantization: visual codebooks [20]. In a visual
codebook, the descriptors are assigned to a smaller selection
of representative descriptors (which are also called primitives).
This yields a histogram, of which the length is determined by
the number of primitives. We choose eight primitives, that result
in a histogram of length 8. Next, we discretize each bin (of the
eight-bin histogram) into four levels, to obtain 8 × 4 binary
values. Recall that the CRF is especially suited to learn from
binary values. In this translation procedure, the 104 multivalued
video features are translated into a 32 binary values. These 32
binary values are the observables, on which the CRF is trained
and tested.
3) Results: Classification of Crowd Dynamics: All examples
of the three classes are taken from the 4-h videos. These examples are subsampled at 2 frames/s. This is a choice to reduce
the amount of stored data and a limitation posed by processing
speed. We argue that 2 frames/s is sufficiently fast for awareness on crowd dynamics, yet the variation between samples is
significant. For each frame, the video features are computed and
translated into observables. Together with the classification of
observables into a next state, this procedure can be done at a
sampling rate of 2 frames/s. The trained CRF, with its weights
for the observables and state transitions, has been shown in
Fig. 4 and was discussed in Section III.
For the training of the CRF, the transitions between states
are essential. In our real-world dataset (i.e., the examples of
the classes), we found that the number of transitions is limited.
This poses a challenge to make distinct yet rich training and test
sets. We choose to solve this as follows. We perform a fourfold
cross validation. We subsample every Nth encoded video frame
to be included in the Nth-fold. This way, the folds are disjunct
and different enough as they are at minimum 2 s apart (given
the sampling of 2 frames/s and fourfolds). That is, the scene is
changing rapidly. Going from one video frame to the next, with
2 s in between, gives an enormous variation in the images. People
have walked significantly, new people have entered, other people
have exited the scene, and people have got closer or further apart
from other people. Following conventions, we train on onefold
and test on another. This experiment is repeated four times.
The results are summarized by the confusion matrix in
Table I. Interestingly, the CRF classifies some states that are
more accurate than others. The CRF has shown to successfully
classify the crowd dynamics up to 80% accuracy.
Examples of errors are shown in Fig. 8. An error that occurs
regularly is a badly timed state transition of the estimation.
Usually, the transition is to the right state but is too early or too
late. This is illustrated in Fig. 8(b) and (c). Sometimes there are
“true misclassifications;” this is illustrated in Fig. 8(a).
VI. CONCLUSION
We have proposed scenarios that are relevant for antiterrorism
and crowd control. Given the scenarios, and their variations
including various kinds of errors, we have assessed situations
that are based on observables with the objective to recognize
threats.
We have chosen the CRF as a method to learn probabilistically the relation between observables and states (as the hidden
parameter). A good size for the training set was shown to be
eight examples for each of the five scenarios, which achieves
only slightly lower accuracy (i.e., 4%) than with 16 examples
while using only half of the examples. Situations can be assessed
with a precision of 90% at a false positive and negative rate of
15% using only eight learning examples.
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
BURGHOUTS AND MARCK: REASONING ABOUT THREATS: FROM OBSERVABLES TO SITUATION ASSESSMENT
580
We have investigated the extreme test case of using a training
set of only one example to find out where the CRF becomes
inaccurate. Up to FPR or FNR of 15% for any type of noise
(random, fade in/out, ambiguity, and clutter), the CRF is surprisingly accurate to assess situations. The accuracy starts to
deteriorate above 15% FPR or FNR or both. The CRF is not
very sensitive to fading effects or ambiguity but is sensitive to
random noise and to clutter (i.e., state-specific levels of random
noise).
In a real-world experiment at a large train station, we have
classified various types of crowd dynamics. Using simple video
features of shape and motion, we have proposed a scheme to
translate such features into observables that can be classified by
a CRF. The CRF has shown to successfully classify the crowd
dynamics up to 80% accuracy.
Overall, we conclude this paper with the interesting finding
that discriminative power can be achieved by combining multiple, generic, and simple observables.
581
ACKNOWLEDGMENT
582
585
The authors would like to thank Dr. R. den Hollander to
provide the camera-based observables and their characteristics.
They are also grateful to Dr. K. Schutte for useful discussions
on the inference of the situation that is based on observables.
586
REFERENCES
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
[1] M. M. Kokar, “Situation awareness: Issues and challenges,” in Proc. Int.
Conf. Inf. Fusion, 2004, pp. 533–534.
[2] J. Llinas, C. Bowman, G. Rogova, and A. Steinberg, “Revisiting the JDL
data fusion model II,” in Proc. Int. Conf. Inf. Fusion, 2004, pp. 1218–1230.
[3] G. J. Burghouts, B. Broek, B. G. Alefs, E. den Breejen, and K. Schutte,
“Automated indicators for behavior interpretation,” in Proc. Int. Conf.
Crime Detect. Prevent., 2007, pp. 1–6.
[4] S. Agarwal, A. Awan, and D. Roth, “Learning to detect objects in images
via a sparse, part-based representation,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 26, no. 11, pp. 1475–1490, Nov. 2004.
[5] A. Datta, “Person-on-person violence detection in video data,” in Proc.
Int. Conf. Pattern Recognit., 2002, pp. 433–438.
[6] D. M. Gavrila, “The visual analysis of human movement: A survey,”
Comput. Vis. Image Understand., vol. 73, no. 1, pp. 82–98, 1999.
[7] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning realistic
human actions from movies,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern
Recognit., 2008, pp. 1–8.
[8] W. Zajdel, D. Krijnders, T. Andringa, and D. Gavrila, “CASSANDRA:
Audio-video sensor fusion for aggression detection,” in Proc. Int. Conf.
Adv. Video Signal Based Surveillance, 2007, pp. 200–205.
[9] L. Duan, D. Xu, I. W. Tsang, and J. Luo, “Visual event recognition in
videos by learning from web data,” in Proc. IEEE Int. Conf. Comput. Vis.
Pattern Recognit., 2010, pp. 1959–1966.
[10] D. Küttel, M. Breitenstein, L. van Gool, and V. Ferrari, “What’s going on?
discovering spatio-temporal dependencies in dynamic scenes,” in Proc.
IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2010, pp. 1951–1958.
[11] T. Xiang and S. Gong, “Video behaviour profiling for anomaly detection,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 5, pp. 893–908, May
2008.
[12] A. McCallum and C. Sutton. (2006). “An introduction to conditional random fields for relational learning,” in Introduction to Statistical Relational Learning, L. Getoor and B. Taskar, Eds. Cambridge, MA: MIT Press [Online]. Available: http://www.cs.umass.edu/
mccallum/papers/crf-tutorial.pdf
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
583
584
9
[13] L. Rabiner, “A tutorial on hidden Markov models and selected applications
in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286, Feb. 1989.
[14] I. Ulusoy and C. Bishop, “Generative versus discriminative methods for
object recognition,” in Proc.IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2005, vol. 2, pp. 258–265.
[15] M. Schmidt. (2008). “CRF toolkit in MATLAB,” [Online]. Available:
http://people.cs.ubc.ca/schmidtm/Software/
[16] D. C. Liu and J. Nocedal, “On the limited memory BFGS method for large
scale optimization,” Math. Program., vol. 45, no. 3, pp. 503–528, 1989.
[17] F. Sha and O. Pereira, “Shallow parsing with conditional random
fields,” in Proc. HLT-NAACL, 2003, pp. 213–220 [Online]. Available:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.9849
[18] B. K. P. Horn and B. G. Schunck, “Determining optical flow,” Artif. Intell.,
vol. 17, nos. 1–3, pp. 185–203, 1981.
[19] D. G. Lowe, “Object recognition from local scale-invariant features,” in
Proc. Int. Conf. Comput. Vis., 1999, pp. 1–8.
[20] B. T. F. Jurie, “Creating efficient codebooks for visual recognition,” in
Proc. Int. Conf. Comput. Vis., 2005, pp. 604–610.
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
Gertjan J. Burghouts received the Ph.D. degree
from the University of Amsterdam, Amsterdam, The
Netherlands, in 2007 on the topic of visual recognition of objects and their motion, in realistic scenes
with varying conditions.
He is currently a Lead Research Scientist of Visual Pattern Recognition with The Netherlands Organization for Applied Scientific Research (TNO), The
Hague, The Netherlands. He studied artificial intelligence at the University of Twente during 1997–2002
with a specialization in pattern analysis and human–
machine interaction. Since 2007, he has been the Principal Investigator of automated understanding of human behavior based on sensory perception. He is
the Principal Investigator of a DARPA project named CORTEX (2.3M), about
recognition of events and behaviors. He has written papers on this topic in
internationally renowned journals, e.g., the IEEE TRANSACTIONS ON IMAGE
PROCESSING, the Computer Vision and Image Understanding, the International
Journal of Computer Vision, and International Conference on Crime Detection
and Prevention. His work has been cited mote than 200 times since 2005.
Dr. Burghouts received an award from the Netherlands Association of Engineers for the best innovative project in 2007.
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
Jan-Willem Marck received the M.Sc. degree in artificial intelligence from the University of Groningen,
Groningen, The Netherlands, in 2006, with a specialization in autonomous systems.
He is currently a Research Scientist of Artificial Intelligence with the Department of Distributed Sensor
Systems, The Netherlands Organization for Applied
Scientific Research (TNO), The Hague, The Netherlands. Since 2006, he has been a Research Scientist
on various topics, e.g.,sensor information fusion, relevance of information, human–machine interaction
and human behavior classification based on state estimation methods using sensory data. He has (co-)authored a number of papers and managed projects on
these topics.
661
662
663
664
665
666
667
668
669
670
671
672
673
674 Q2
675
676
677
678
QUERIES
Q1: Author: Please verify the affiliation of the authors as typeset.
Q2. Author: Please verify the current location of author “Jan-Willen Marck.”