Unusual Event Detection Using Sparse Spatio

Proceedings of the 2013 IEEE Second International Conference on Image Information Processing (ICIIP-2013)
Unusual Event Detection using Sparse SpatioTemporal Features and Bag of Words Model
Balakrishna Mandadi
Dr. Amit Sethi
EEE Department
Indian Institute of Technology Guwahati
Guwahati, India
[email protected]
EEE Department
Indian Institute of Technology Guwahati
Guwahati, India
[email protected]
The problem of detecting unusual events can be approached in
different ways. Tracking based methods [1] are a good choice
in case of constrained environments where only limited set of
activities is possible. These methods use tracks of moving
objects to model the activities based on the speed, size,
direction and silhouettes of objects along these tracks [2]. By
tracking, the activity of one object can be separated from other
co-occurring activities. The advantage of these methods is that
complex motion patterns can be easily modeled using extracted
trajectories. But these methods are sensitive to tracking errors.
Explicit tracking of objects in crowded scenes is very complex
and is easily prone to errors due to frequent occlusions.
Abstract— We present a system for unusual event detection in
single fixed camera surveillance video. Instead of taking a binary
or multi-class supervised learning approach, we take a one-class
classification approach assuming that training dataset only
contains usual events. The videos are modeled using a bag of
words model for documents, where the words are prototypical
sparse spatio-temporal feature descriptors extracted along
moving objects in the scene of observation. We learn a
probabilistic model of the training data as a corpus of documents,
which contains a certain probabilistic mixture of latent topics,
using Latent Dirichlet Allocation framework. In this framework,
topics are further modeled as certain probabilistic mixture of
words. Unusual events are video clips that probabilistically
deviate more than a threshold from the distribution of the usual
events. Our results indicate potential to learn usual events from a
few examples, reliable flagging of unusual events, and sufficient
speed for practical applications.
The other kind of approaches [1] directly uses motion feature
vectors instead of tracks to describe video clips. In these
approaches, simple low level visual features like motion
histogram are extracted as part of the feature set. As there is no
detection and tracking, a particular activity cannot be separated
from other simultaneously co-occurring activities. For
example, the motion of a pedestrian from that of a car cannot
be separated without tracking. To alleviate this problem, some
sorts of mixture models are used to learn various co-occurring
activities in the scene. We also use the same kind of approach
without going for explicit tracking based methods.
Keywords—Automated surveillance; Unusual abnormal
rare event detection; Latent Dirichlet allocation
I.
INTRODUCTION
The ubiquitous presence of surveillance cameras, due to
increased threat to public life and property, calls for automatic
methods to detect any abnormal activities. Most of the present
day surveillance systems do not have automation to detect
illegal or unusual activities. It is highly expensive and difficult
to appoint a human observer to monitor the activities for each
and every system. Hence the need for automatic methods is
increasing day by day. Unusual activity detection is a part of
automated surveillance as a pre-warning that is to be further
processed by the security personnel to take higher level
decision.
Zhong and Shi [3] are one of the first to use “hard to describe”
but “easy to verify” notion of detecting unusual events from
video sequences. By slicing a long video sequence into small
segments and representing them with simple motion features,
they have clustered the video segments using importance
feature signal extraction. Unusual video segments are the ones
with small inter-cluster similarity. The above approach is well
suited for finding interesting events in a large pool of video
space but may not be able to detect unusual events over a live
streaming video. Despite its offline nature, the above method
has become popular due to its simple approach and inspired
later methods to use powerful statistical models to make it
online. Wang et al. [1] and Varadarajan et al. [4] used the
natural language processing models like Latent Dirichlet
Allocation (LDA) [5] and Probabilistic Latent Semantic
Analysis (pLSA) [6] to model the activities by treating every
video segment as document and quantized optical flow features
as vocabulary. Our approach of detecting unusual events
resembles to that of [1] but we used different set of features to
represent a video and used different abnormality measures to
detect unusual events. As motion information using optical
The notion of unusual event is difficult to define because those
are unpredictable and even change with scene of observation.
For example, loitering is a usual activity in public parks but
unusual in airports. In general, unusual activities can be defined
as those whose occurrence is rare or unexpected in a given
scenario. Given a long video sequence, those events which
occur repeatedly over time are considered usual. With myriad
possibilities of events in a video scene, it is quite difficult to
label all types of events and classify them using supervised
learning. Furthermore, since examples of unusual events will
be few, if at all, the learning will have to be based on highly
unbalanced classes. Dealing the problem as a one-class
classification will be a better choice.
978-1-4673-6101-9/13/$31.00 ©2013 IEEE
629
Proceedings of the 2013 IEEE Second International Conference on Image Information Processing (ICIIP-2013)
flow alone may not be used to model various activities
efficiently, we have used quantized spatio-temporal corner
descriptors to efficiently model various activities over the
scene of observation. The used features can capture both shape
and motion information of moving objects. To model the
activities we have also used LDA model as used in [1], but
used different measures of abnormality instead of likelihood
measure as it shows high for an unusual activity co-occurring
with a usual activity which is undesirable. By measuring the
deviation of new test clips’ inferred parameters from that of
model using probabilistic distances like Bhattacharya distance
and Kullback-Leibler (KL) divergence, we detect unusual
events.
The model can be a supervised or unsupervised model but
usually it is very difficult to get good training examples for
unusual events in case of crowded environments to use
supervised models. To solve the problem of unavailability of
unusual examples, we model the activities seen during training
using unsupervised learning. In general, we learn a probability
distribution over the training examples, and mark test cases
that are improbably under this distribution as unusual. The
probability distribution should be flexible and loose enough to
not raise false alarms for usual activities in the test examples,
while it should not be too loose to miss the detection of
obviously unusual cases.
The rest of the paper is organized as follows. In section II the
general mathematical formulation of the problem is given. In
section III we discuss about the video features and
representation. In section IV we discuss about the generative
model LDA. In section V we give experimental results
followed by a conclusion.
II.
III.
With the success of sparse spatio-temporal (ST) corner
descriptors for action recognition, we have used the same set
of features to represent a video clip. There are different set of
algorithms available for detecting spatio-temporal interest
points. Most popular among them are Harris3D [7] and
periodic detector [8]. Harris3D is the extension of popular
Harris corner detection algorithm [9] to videos and was
developed by Laptev and Lindeberg. The periodic detector
proposed by Dollar et al. [8] uses linear filters along temporal
directions to detect interest points. Because of its robustness
and simplicity in our experiments, we have used periodic
detector method♣ of representing a video clip proposed by
Dollar et al. [8]. Usually a response function is evaluated at
each location for any feature detection algorithm. The
response function for periodic detector uses separable 1D
temporal Gabor filters and is of the form,
MATHEMATICAL FORMULATION
Let V( x, y , t ) represents a small video clip of T frames duration
which may contain non-separable set of activities (e.g. a
walking pedestrian while a car is moving on road) where x, y
and t are the spatial and temporal coordinates respectively.
Let V = {V1 , V2 , V3 , " , VN } be the training dataset which is
obtained by slicing a long video sequence into overlapping or
non-overlapping equal length segments. The dataset V can be
labeled (label being usual or unusual) or unlabeled dataset
depending on the supervised or unsupervised approach to
detect a given video event as unusual. Feature representation
of a video clip can be done in many ways (We use histogram
of sparse feature descriptors to represent a video). In general,
let λ Vi ( l ) is the feature representation of a video clip, Vi
i = 1, " , N and l is the feature parameter. Now the original
dataset V can be replaced by the feature dataset as,
V
{ ( l ) , λ ( l ) ,", λ ( l )}
V≡λ = λ
V1
VN
V2
VIDEO REPRESENTATION
R ( x, y, t ) = ( J ∗ g (σ )* hev ) + ( J ∗ g (σ )* hod )
(3)
where R is the response function , J is the image intensity at
location ( x, y, t ) .
(1)
To learn the various sets of activities or motion patterns
generally a probabilistic model is fitted over the training data.
Let M v ( l ;θ ) be the generative model that is to be fitted over
the training video dataset λ V . θ is the parameter vector that is
learned while fitting the model to dataset. As θ is tuned to the
dataset λ V , M v ( l ;θ ) captures certain beliefs about the
(a) Test Frames
environment V .
Now given a new test video clip of same duration, Vs and its
feature representation, λ Vs ( l ) , its deviation from model is
calculated as D s using appropriate probabilistic distance (PD)
measures
(
D s = PD λ Vs ( l ) , M v ( l ;θ )
)
(b) Periodic Detector’s Interest points
(2)
Figure 1: (a) One of the frames from 5 sec test video clips (b) The
detected 3D interest points at different time instances projected
on a frame shown in (a) using periodic detector.
New test clip Vs can be flagged as unusual if D s > Dth , a
predetermined threshold from training sequence.
♣
Source code for periodic detector was downloaded from the author’s
homepage.
630
Proceedings of the 2013 IEEE Second International Conference on Image Information Processing (ICIIP-2013)
g=
1
e−( x
2
+ y2 )
previous section. LDA assumes that a document (video clip)
is generated from a random mixture of K latent topics
(atomic activities). Topics are a set of co-occurring words that
are captured from the corpus of documents during training.
LDA is a fully generative model which does not suffer from
overfitting problem as that of its predecessor pLSA [10, 11].
is the spatial smoothening kernel. hev and
2πσ
hod are a pair of quadrature Gabor filters of the form.
2
hev = − cos(2π tω )e
−t 2
τ2
and
hod = − sin(2π tω )e
−t 2
τ2
are the
quadrature filters with ω = 4 . The Gabor filters are bandpass
τ
in nature.
Response function is evaluated at every possible location
along three dimensions and those locations which fire high
response are spatio-temporal corners. Figure 1 above shows the
detected ST corners or interest points from a small video clip
by projecting on to single frame. The ST features correspond
to corner points of moving objects in the scene as in Figure 1.
To efficiently represent a video clip using the obtained ST
corners, cuboids of certain size are extracted around these
interest points [8]. We empirically determined 9x9x17 to be a
good cuboid size for the tested scenes. In general, it will be
dependent on the size of interesting objects and their pace. To
obtain the shape and motion information of moving objects,
the extracted cuboids are represented using a feature vector of
3D gradients.
Figure 2: Graphical representation of LDA model as in [5] (shaded
part represents observed variable)
As LDA is a widely accepted model, we have directly taken
some of the equations and notations required for our problem.
For complete explanation refer [5] .
The multinomial parameter of topics over document, θ , is
considered random of known Dirichlet distribution given by:
Vocabulary: To build vocabulary for the bag of words model,
the obtained descriptors are quantized using k-means
clustering. The cluster centers are the prototype words which
can act as vocabulary. The number of clusters k depends on
different types of descriptors in the scene of observation and
can be empirically determined. To allow for the new unseen
descriptors which may be a part of unusual events during
testing, for each prototype feature a counter prototype is
defined. The descriptors which are nearer to the prototype
centers but far away from the farthest descriptor of the cluster
belong to the counter prototype. The location of the moving
objects is also important in case of unusual activity detection.
To get the location information, the frame is divided into H
horizontal and V vertical patches and the count of detected
corner points within that patch is also appended to the
vocabulary histogram. Now, the total possible words in the
vocabulary are 2k + HV . Our feature representation λ V of
video clip is the above histogram of prototype descriptors
present in the video clip. By treating every video clip as
document and the prototype feature descriptors as vocabulary,
we model the activities using LDA model which is explained
below.
IV.
p (θ | α ) =
Γ
(∑
∏
k
i =1
k
i =1
αi
)θ
Γ (α i )
α1 −1
1
"θ kα k −1
Where Γ ( x ) is the Gamma function of the form:
∞
Γ ( x ) = ∫ t x −1e−t dt , x is not a negative integer and zero.
0
Given the parameters α and β , the joint distribution of topic
mixture θ , a set of N topic labels z and a set of N words
w is given by:
N
p(θ, z, w | α, β) = p(θ | α)∏ p( zn | θ) p( wn | zn , β)
(4)
n =1
Where p ( zn | θ) is simply θi for the unique i such that z ni = 1 .
Integrating over θ and summing over z , we obtain the
marginal distribution of a document (video clip):
N
p(w | α, β) = ∫ p (θ | α )∏ ∑ p( zn | θ) p( wn | zn , β)dθ (5)
n =1 zn
Taking the product of the marginal probabilities of single
documents, we obtain the probability of a corpus D
M
N
p(D | α, β) = ∏ ∫ p(θd | α )∏ ∑ p( zdn | θd ) p( wdn | zdn , β)dθd
d =1
LATENT DIRICHLET ALLOCATION
n =1 zn
(6)
The LDA model can be represented graphically as shown in
Figure 2. As shown in the figure, there are three levels to the
representation. The parameter α and β are the corpus level
parameters, assumed to be sampled only once in the process of
generating a corpus. The variables θ d are document level
parameters, sampled once per document. Finally, the variables
z dn and wdn are word level variables and are sampled once for
every word in each document.
LDA [5] is a probabilistic generative model successfully used
in documents processing to extract semantic meaning of
document. A document is modeled as an unordered collection
of words which are obtained from the defined vocabulary.
Only the counts of individual words are important in a bag of
words model, while their order is neglected. Histogram of
words captures their co-occurrence to model different topics.
In case of videos, the documents are video clips and the
vocabulary is the set of prototype descriptors as defined in
631
Proceedings of the 2013 IEEE Second International Conference on Image Information Processing (ICIIP-2013)
The LDA model description is given above and various
parameters like α and β for the model should be learned from
training corpus (a set of related documents). The inference
problem is computing the posterior distribution of hidden
variables given a document and is given by (7)
p (θ, z , w | α, β)
(7)
p (θ, z | w , α, β) =
p ( w | α , β)
The denominator in (4) is given by (2) is intractable due to
coupling between θ and β . As the posterior distribution is
intractable for exact inference, the authors proposed
variational Bayes approximation for inference in their paper
by relaxing the link between the coupling parameters. The
general idea is to approximate the complex posterior using
simple free variational probability by minimizing the
difference between them. The free variational posterior is
given by (8)
= ln Γ
(∑
k
i =1
)
α ai − ln Γ
(∑
k
i =1
α bi
+ ∑ i =1{ln Γ (α bi ) − ln Γ (α ai )}
)
k
(14)
(
)
k
k
+ ∑ i =1 [α ai − α bi ] ⎡⎢ ψ (α ai ) − ψ ∑ i =1α ai ⎤⎥
⎣
⎦
The Bhattacharya distance between two Dirichlet distributions
[13] with parameters α a and α b is given by:
⎧⎪
⎫⎪
DBC (α a , α b ) = − ln ⎨ ∫ pa ( X; α a ) pb ( X; α b )dX ⎬
⎪⎩ X
⎭⎪
k
⎛ k α + α bi ⎞
⎛ α + α bi
= ln Γ ⎜ ∑ i =1 ai
− ∑ i =1 ln Γ ⎜ ai
⎟
2
2
⎝
⎠
⎝
1
k
+ ∑ i =1 ln Γ (α ai ) + ln Γ (α bi )
2
1
k
k
− ln Γ(∑ i =1 | α ai |) + ln Γ(∑ i =1 | α bi |)
2
{
{
N
q (θ, z | γ, φ) = q(θ | γ )∏ q ( zn | ϕn ) (8)
}
(15)
⎞
⎟
⎠
(16)
}
n =1
where
(ϕ1 , ϕ 2 ," , ϕ N ) are the free variational
and
parameters that are to be estimated in constraint of minimizing
the difference between true posterior (4) and variational
posterior (8). A tight lower bound on the log likelihood
directly translates into the following optimization problem [5],
γ
∗
∗
( γ , φ ) = arg min ( γ ,φ) D(q(θ, z | γ, φ) || p (θ, z | w, α, β))
V.
EXPERIMENTAL RESULTS
As there is no proper test benchmark available as of now, we
have collected our own videos from our campus. We provide
results of using both optical flow features and spatio-temporal
features. The video dataset consists of a 45 minute video of
which 35 minutes of video is used for training which contains
general activities that are seen daily. To test for unusual events,
an acted video is captured which contains unusual activities
like driving in wrong directions, entering the restricted lawn
area. A small part of the normal video and the acted video is
kept aside for testing purposes. The video sequence is sliced
into small non-overlapping segments of 5 seconds duration
which are treated as documents. After slicing into short video
segments, a total of 370 video segments for training and 158
video segments for testing are obtained. Of the 158 test video
clips first 58 contain daily seen usual activities and remaining
segments contain unusual activities. Each video segment is
represented as histogram of quantized optical flow features [1]
and histogram of quantized ST corner descriptors as explained
in section III. The results are discussed below.
(9)
where || is Kullback–Leibler divergence. The optimum values
for free variational parameters can be found by equating first
derivatives to zero,
ϕni ∝ βiw exp{E q [log(θi | γ )]}
(10)
γ i = α i + ∑ n =1ϕni
(11)
n
N
E q [log(θi | γ )] = ψ(γ i )-ψ(∑ j =1 γ j )
k
(12)
where ψ is the first derivative of the log Γ function. Equations
(7) and (8) are coupled equations which are solved recursively
using expectation maximization (EM) algorithm. The model
parameters α and β are learned using variational EM
algorithm as explained in [5]. Our aim is to infer the various
topic mixture weights γ ts present in a new test document
(video clip) using (8) and (9). This variational mixture
parameter is compared to that of the model parameter α to
determine the deviation between the two.
Optical flow as vocabulary: The optical flow [14] vocabulary
for the bag of words model is created as explained in [1] with
the added bins for magnitude. The frame size of 240x320 is
divided into spatial patches of 24x32, each patch containing
10x10 pixels. By dividing magnitude and direction of optical
flow over a patch into 4 bins each, a 16 bin histogram per patch
is calculated by counting the number of pixels belonging to
specific bins. All the 16-bin histograms per patch are
concatenated to represent a video segment (document). The
total vocabulary size is equal to the length of the histogram
bins i.e. 24x32x16=12,288. The number of topics K is
empirically determined to be 20. The optical flow histograms
of various training video segments are used to learn the
parameters of the model. Given a new video clip, the topic
distribution present in the given clip is inferred from the model
using(10), (11) and (12) iteratively. The deviation of new test
clip’s topic distribution from the model parameter is measured
using KL divergence given by(14). As it is clear from the KL
Measures of abnormality: As both γ ts and α are the
parameters of random Dirichlet distribution, statistical
distance measures like Kullback-Leibler(KL) divergence and
Bhattacharya distance can be used to measure the deviation
between the two. The KL divergence between two Dirichlet
distributions [12] is given by,
p ( X; α a )
(13)
DKL (α a , α b ) = ∫ pa ( X; α a ) ln a
dX
pb ( X; α b )
X
632
Proceedings of the 2013 IEEE Second International Conference on Image Information Processing (ICIIP-2013)
divergence curve in Figure 3, differentiationn between unusual
clips (59-158) and usual clips (1-58)) is not much
distinguishable because of insufficient repreesentation of video
by using noisy optical flow motion featuress. This problem is
alleviated by representing a video by using sparse ST corner
descriptors as shown in Figure 4. Figure 3 shhows some of the
frames from the unusual video segments corrresponding to high
deviation from the learned model.
abnormality given by (14) and (16). The differentiation
a usual clips (1-58) is clearly
between unusual video clips and
evident in Figure 4. The Bhattaacharya distance curve and KL
divergence curves are almost similar
s
except for the numerical
range. The threshold is calcuulated as maximum distance of
training clips from model param
meter.
(a)
1.
2.
3.
4.
5.
6
6.
(cc)
2.
3.
5.
1.
4.
Figure 4: (a) KL divergence curvves and (b) Bhattacharya distance
for test data sets using ST featurre representation. (c) Some of the
frames corresponding to unusuall video (high divergence). (1,3,5)Motion in the restricted lawn areaa, (2,6)-Driving in wrong directions.
(4)-Unusual hand movement of observer,
o
recorded while capturing
the videos. (Bounding boxes are hand labeled to visualize unusual
activity)
6.
(b)
Figure 3: (a) KL divergence abnormality measurre for test video clips
using optical flow as vocabulary. (b) Frames coorresponding to high
KL divergence. (1, 4, 5, 6) People moving in the restricted
r
lawn area.
(2, 3) Bicycles moving in wrong directions. (B
Bounding boxes are
hand labeled to visualize unusual activities over the
t scene)
Some of the frames of unusual video segments corresponding
to temporally local maximal abbnormality are shown in Figure 4.
Most of the detected unusuaal activities include driving in
wrong directions and entering restricted lawn areas which are
not seen during training.
ST features as vocabulary:
The vocabulary for ST corner descriptors is built as explained
in section III. The vocabulary is generated by quantizing the
ST descriptors using k-means clustering wiith the empirically
chosen k = 500. For location information thee frame is divided
into 6x8 patches and the ST features undeer that patches are
counted. Hence total vocabulary is 2k + HV
V = 1,048, which is
very less compared to that of optical flow reepresentation. The
number of topics K in the mixture model is manually selected
as 20 (Increasing the value of K results inn repetitive atomic
activities or mixture of other already learneed activities which
are redundant). The topic distribution of thee given test clip is
inferred as explained in section 3. Insteead of going for
likelihood measure of abnormality, which may
m show up high
for an unusual activity co-occurring with a usual activity, we
used KL divergence and Bhattacharya distannce as measures of
Due to limited training videoss used for the experiments, few
patterns which are expected to be usual are detected as
unusual are labeled as false allarm (FA) in Figure 4. The false
alarms are corresponding to car
c going reverse, hence in the
wrong direction and taking a U-turn
U
which are not seen widely
during training are shown in Figure 5. The present
implementation has a latency of 5 sec duration i.e. there is a
delay of 5 sec duration to calcuulate and determine the unusual
event over live streaming viddeos. The latency can be easily
reduced by taking overlappingg clips. To allow for the spatial
localization we tried to divide the
t frame into patches and a
633
Proceedings of the 2013 IEEE Second International Conference on Image Information Processing (ICIIP-2013)
[2]
[3]
[4]
(a)
[5]
[6]
[7]
[8]
(b)
[9]
Figure 5: (a) Frames corresponding to normal video events but
detected as unusual (false alarms) (b) Frames corresponding to
unusual but detected as usual (missed detection).
[10]
[11]
separate LDA model is learned for each patch. But the results
are not satisfactory due to the size of the patch where it cannot
accommodate the whole activity and everything seems new.
[12]
[13]
VI.
CONCLUSION
In case of one-class based learning of activities over the scene
of observation, video representation and generative model
selection are the primary things to be considered. Previous
methods used optical flow features to represent a video and
used language models to describe the scene. Motion features
using optical flow alone may not be useful to model the scene
everywhere especially in outdoor scenes where optical flow
gives erroneous results. To alleviate this problem we have
used robust representation of video clips using sparse interest
points, successful in case of action recognition to represent a
video clip. To learn the various co-occurring activities, a
popular bag of words generative model, LDA is used. At the
end, experimental results are provided to show that ST
features performed well to that of optical flow features to
detect unusual events in outdoor scenarios.
The future scope for the approach can be in the areas of video
representation and online learning of the model parameters.
Recently, deep learning techniques [15] have become popular
for learning important informative features from the video
without going for hand designed features. To allow for
continuous update of the model to capture the changing
environment, an online version of the LDA model [16],
proposed recently can be readily used here.
[14]
[15]
[16]
REFERENCES
[1]
X. Wang, X. Ma, and W. E. L. Grimson, “Unsupervised Activity
Perception in Crowded and Complicated Scenes Using Hierarchical
Bayesian Models,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 31, pp. 539–555, 2009.
634
C. Stauffer and W. E. L. Grimson, “Learning Patterns of Activity Using
Real-Time Tracking,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 22, pp. 747–757, 2000.
H. Zhong, J. Shi, and M. Visontai, “Detecting Unusual Activity in
Video,” in CVPR (2), 2004, pp. 819–826.
J. Varadarajan and J.-M. Odobez, “Topic Models for Scene Analysis and
Abnormality Detection,” in 9th International Workshop in Visual
Surveillance, 2009.
D. M. Blei, A. Y. Ng, M. I. Jordan, and J. Lafferty, “Latent dirichlet
allocation,” Journal of Machine Learning Research, vol. 3, p. 2003,
2003.
T. Hofmann, “Learning the Similarity of Documents: An InformationGeometric Approach to Document Retrieval and Categorization.” 2000.
I. Laptev and T. Lindeberg, “Space-time Interest Points,” in ICCV,
2003, pp. 432–439.
P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior
Recognition via Sparse Spatio-Temporal Features,” in VS-PETS, 2005.
C. Harris and M. Stephens, “A combined corner and edge detector,” in
Proc. of Fourth Alvey Vision Conference, 1988, pp. 147–151.
M. Steyvers and T. Griffiths, “Probabilistic topic models,” Handbook of
latent semantic analysis, vol. 427, pp. 424–440, 2007.
H. Shan, A. Banerjee, and N. C. Oza, “Discriminative MixedMembership Models,” in Proceedings of the Ninth IEEE International
Conference on Data Mining, 2009, pp. 466–475.
W. D. Penny, “Kullback-Liebler Divergences of Normal, Gamma,
Dirichlet and Wishart Densities,” Wellcome Department of Cognitive
Neurology, 2001.
T. W. Rauber, A. Conci, T. Braun, and K. Berns, “Bhattacharyya
probabilistic distance of the Dirichlet density and its application to Splitand-Merge image segmentation,” in 15th International Conference on
Systems, Signals and Image Processing, 2008, pp. 145–148.
B. D. Lucas and T. Kanade, “An Iterative Image Registration Technique
with an Application to Stereo Vision,” 1981, pp. 674–679.
Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng, “Learning hierarchical
invariant spatio-temporal features for action recognition with
independent subspace analysis,” in IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2011, pp. 3361–3368.
M. D. Hoffman, D. M. Blei, and F. Bach, “Online learning for latent
dirichlet allocation,” in NIPS, 2010.