Lecture 18: Gaussian Mixture Models and Expectation Maximization

March, 2016
Introduction to Data Science: Lecture 6
Dr. Lev Faivishevsky
Agenda
• Clustering
– Hierarchical
– K-means
– GMM
• Anomaly Detection
• Change Detection
Clustering
• Partition unlabeled examples into disjoint
subsets of clusters, such that:
– Examples within a cluster are very similar
– Examples in different clusters are very different
• Discover new categories in an unsupervised
manner (no sample category labels
provided).
3
Clustering Example
.
..
.
.
. .
. .
.
. .
. . .
.
4
Hierarchical Clustering
• Build a tree-based hierarchical taxonomy
(dendrogram) from a set of
unlabeled examples.
animal
vertebrate
invertebrate
fish reptile amphib. mammal
worm insect crustacean
• Recursive application of a standard clustering
algorithm can produce a hierarchical clustering.
5
Aglommerative vs. Divisive Clustering
• Aglommerative (bottom-up) methods start
with each example in its own cluster and
iteratively combine them to form larger and
larger clusters.
• Divisive (partitional, top-down) separate all
examples immediately into clusters.
6
Direct Clustering Method
• Direct clustering methods require a
specification of the number of clusters, k,
desired.
• A clustering evaluation function assigns a realvalue quality measure to a clustering.
• The number of clusters can be determined
automatically by explicitly generating
clusterings for multiple values of k and
choosing the best result according to a
clustering evaluation function.
7
Hierarchical Agglomerative Clustering
(HAC)
• Assumes a similarity function for determining
the similarity of two instances.
• Starts with all instances in a separate cluster
and then repeatedly joins the two clusters that
are most similar until there is only one cluster.
• The history of merging forms a binary tree or
hierarchy.
8
HAC Algorithm
Start with all instances in their own cluster.
Until there is only one cluster:
Among the current clusters, determine the two
clusters, ci and cj, that are most similar.
Replace ci and cj with a single cluster ci  cj
9
Cluster Similarity
• Assume a similarity function that determines the
similarity of two instances: sim(x,y).
– Cosine similarity of document vectors.
• How to compute similarity of two clusters each
possibly containing multiple instances?
– Single Link: Similarity of two most similar members.
– Complete Link: Similarity of two least similar members.
– Group Average: Average similarity between members.
10
Single Link Agglomerative Clustering
• Use maximum similarity of pairs:
sim (ci ,c j )  max sim ( x, y )
xci , yc j
• Can result in “straggly” (long and thin) clusters
due to chaining effect.
– Appropriate in some domains, such as clustering
islands.
11
Single Link Example
12
Complete Link Agglomerative
Clustering
• Use minimum similarity of pairs:
sim (ci ,c j )  min sim ( x, y)
xci , yc j
• Makes more “tight,” spherical clusters that are
typically preferable.
13
Complete Link Example
14
Computational Complexity
• In the first iteration, all HAC methods need to
compute similarity of all pairs of n individual
instances which is O(n2).
• In each of the subsequent n2 merging
iterations, it must compute the distance
between the most recently created cluster
and all other existing clusters.
• In order to maintain an overall O(n2)
performance, computing similarity to each
other cluster must be done in constant time.
15
Computing Cluster Similarity
• After merging ci and cj, the similarity of the
resulting cluster to any other cluster, ck, can be
computed by:
– Single Link:
sim ((ci  c j ), ck )  max( sim (ci , ck ), sim (c j , ck ))
– Complete Link:
sim ((ci  c j ), ck )  min( sim (ci , ck ), sim (c j , ck ))
16
Group Average Agglomerative
Clustering
• Use average similarity across all pairs within the
merged cluster to measure the similarity of two
clusters.
• Compromise between single and complete link.
• Averaged across all ordered pairs in the merged
cluster instead of unordered pairs between the two
clusters to encourage tight clusters.
17
Computing Group Average Similarity
• Assume cosine similarity and normalized
vectors with unit length.
• Always maintain sum of vectors in each
cluster.
• Compute similarity of clusters in constant
time:
18
Non-Hierarchical Clustering
• Typically must provide the number of desired
clusters, k.
• Randomly choose k instances as seeds, one per
cluster.
• Form initial clusters based on these seeds.
• Iterate, repeatedly reallocating instances to different
clusters to improve the overall clustering.
• Stop when clustering converges or after a fixed
number of iterations.
19
Distances: Ordinal and Categorical Variables
• Ordinal variables can be forced to lie within (0, 1) and then a
quantitative metric can be applied:
• For categorical variables, distances must be specified by user
between each pair of categories.
• Often weighted sum is used:
p
D( xi , x j )   wl d ( xil , x jl ),
l 1
p
w
l 1
l
 1, wl  0.
K-means Overview
• An unsupervised clustering algorithm
• “K” stands for number of clusters, it is typically a user input to
the algorithm; some criteria can be used to automatically
estimate K
• It is an approximation to an NP-hard combinatorial
optimization problem
• K-means algorithm is iterative in nature
• It converges, however only a local minimum is obtained
• Works only for numerical data
• Easy to implement
K-means: Setup
•
x1,…, xN are data points or vectors of observations
•
Each observation (vector xi) will be assigned to one and only one cluster
•
C(i) denotes cluster number for the ith observation
•
Dissimilarity measure: Euclidean distance metric
•
K-means minimizes within-cluster point scatter:
1 K
W (C )     xi  x j
2 k 1 C (i )k C ( j )k
2
K
  Nk
k 1
mk is the mean vector of the kth cluster
Nk is the number of observations in kth cluster

C (i )k
xi  mk
2
K-means Algorithm
• For a given cluster assignment C of the data points, compute
the cluster means mk:
mk 
x
i:C ( i )  k
Nk
i
, k  1,, K .
• For a current set of cluster means, assign each observation
as:
C (i)  arg min xi  mk , i  1,, N
2
1 k  K
• Iterate above two steps until convergence
K-means clustering example
K-means: Example 2, Step 1
Distance Metric: Euclidean Distance
expression in condition 2
5
4
k1
3
k2
2
1
k3
0
0
1
2
3
4
expression in condition 1
5
K-means: Example 2, Step 2
Distance Metric: Euclidean Distance
expression in condition 2
5
4
k1
3
k2
2
1
k3
0
0
1
2
3
4
expression in condition 1
5
K-means: Example 2, Step 3
Distance Metric: Euclidean Distance
expression in condition 2
5
4
k1
3
2
k3
k2
1
0
0
1
2
3
4
expression in condition 1
5
K-means: Example 2, Step 4
Distance Metric: Euclidean Distance
expression in condition 2
5
4
k1
3
2
k3
k2
1
0
0
1
2
3
4
expression in condition 1
5
K-means: Example 2, Step 5
Distance Metric: Euclidean Distance
expression in condition 2
5
4
k1
3
2
k2
k3
1
0
0
1
2
3
4
expression in condition 1
5
K-means: summary
• Algorithmically, very simple to implement
• K-means converges, but it finds a local minimum of the cost
function
• Works only for numerical observations
• K is a user input;
• Outliers can considerable trouble to K-means
The Problem
• You have data that you believe is drawn from
n populations
• You want to identify parameters for each
population
• You don’t know anything about the
populations a priori
– Except you believe that they’re gaussian…
Gaussian Mixture Models
• Rather than identifying clusters by “nearest”
centroids
• Fit a Set of k Gaussians to the data
• Maximum Likelihood over a mixture model
GMM example
Mixture Models
• Formally a Mixture Model is the weighted sum
of a number of pdfs where the weights are
determined by a distribution,
Gaussian Mixture Models
• GMM: the weighted sum of a number of
Gaussians where the weights are determined
by a distribution,
Graphical Models
with unobserved variables
• What if you have variables in a Graphical
model that are never observed?
– Latent Variables
• Training latent variable models is an
unsupervised learning application
uncomfortable
sweating
amused
laughing
Latent Variable HMMs
• We can cluster sequences using an HMM with
unobserved state variables
• We will train latent variable models using
Expectation Maximization
Expectation Maximization
• Both the training of GMMs and Graphical
Models with latent variables can be
accomplished using Expectation Maximization
– Step 1: Expectation (E-step)
• Evaluate the “responsibilities” of each cluster with the
current parameters
– Step 2: Maximization (M-step)
• Re-estimate parameters using the existing
“responsibilities”
• Similar to k-means training.
Latent Variable Representation
• We can represent a GMM involving a latent
variable
• What does this give us?
GMM data and Latent variables
One last bit
• We have representations of the joint p(x,z) and
the marginal, p(x)…
• The conditional of p(z|x) can be derived using
Bayes rule.
– The responsibility that a mixture component takes for
explaining an observation x.
Maximum Likelihood over a GMM
• As usual: Identify a likelihood function
• And set partials to zero…
Maximum Likelihood of a GMM
• Optimization of means.
Maximum Likelihood of a GMM
• Optimization of covariance
Maximum Likelihood of a GMM
• Optimization of mixing term
MLE of a GMM
EM for GMMs
• Initialize the parameters
– Evaluate the log likelihood
• Expectation-step: Evaluate the responsibilities
• Maximization-step: Re-estimate Parameters
– Evaluate the log likelihood
– Check for convergence
EM for GMMs
• E-step: Evaluate the Responsibilities
EM for GMMs
• M-Step: Re-estimate Parameters
Visual example of EM
Potential Problems
• Incorrect number of Mixture Components
• Singularities
Incorrect Number of Gaussians
Incorrect Number of Gaussians
Singularities
• A minority of the data can have a
disproportionate effect on the model
likelihood.
• For example…
GMM example
Relationship to K-means
• K-means makes hard decisions.
– Each data point gets assigned to a single cluster.
• GMM/EM makes soft decisions.
– Each data point can yield a posterior p(z|x)
• Soft K-means is a special case of EM.
Soft means as GMM/EM
• Assume equal covariance matrices for every
mixture component:
• Likelihood:
• Responsibilities:
• As epsilon approaches zero, the responsibility
approaches unity.
Soft K-Means as GMM/EM
• Overall Log likelihood as epsilon approaches
zero:
• The expectation of soft k-means is the
intercluster variability
• Note: only the means are reestimated in Soft
K-means.
– The covariance matrices are all tied.
General form of EM
• Given a joint distribution over observed and
latent variables:
• Want to maximize:
1. Initialize parameters
2. E Step: Evaluate:
3. M-Step: Re-estimate parameters (based on expectation of
complete-data log likelihood)
4. Check for convergence of params or likelihood
AnomalyDetection
Explored Methods
•
•
Change detection
–
KNN based Kullback Leibler Divergence
–
Compared with Kolmogorov Smirnov (1D)
Anomaly detection
–
One Class SVM
–
Compared with Mahalanobis distance
Anomaly detection
•
•
•
•
Single sample detection
Outlier wrt baseline behavior
Techniques
– Quantify usual behavior (train)
• “Multidimensional distribution”
– Measure probability for current point (test)
– Declare ‘outlier’, if p < threshold
Methods used
– One class SVM
– Mahalanobis distance
Mahalanobis distance
•
•
•
•
Data are assumed N(µ,S)
Fit (µ,S) from data (train)
Fine tune threshold on
– Use validation set
Detect outlier x with distance > threshold
One class SVM
• Cast of ordinary binary SVM
• State-of-the-art novelty detection
• Smallest volume sphere with (1-ν) of data inside
• Prob(outlier) = ν
•
ν comes explicitly into SVM target function
•
•
Robustness
Control of False Alarm rate
• Optionally fine tune threshold
•
•
Use validation set
Define precise location of decision surface ρ
Multidim. anomaly detection, 10K
runs, N(0,I(D))
Test
FA tuned
Actual FA
Detection Rate
SVM, 20D, Σ =Σ +I(20)
0.02
0.015
0.596
Mahalanobis, 20D, Σ =Σ +I(20)
0.02
0.016
0.601
SVM, 2D, Σ =Σ +I(2)
0.02
0.028
0.144
Mahalanobis, 2D, Σ =Σ +I(2)
0.02
0.023
0.150
SVM, 2D, µ =µ +1
0.02
0.022
0.112
Mahalanobis, 2D, µ =µ +1
0.02
0.022
0.131
SVM, 20D, µ =µ +1
0.02
0.017
0.613
Mahalanobis, 20D, µ =µ +1
0.02
0.017
0.622
SVM, 2D, ρ = ρ + 0.9
0.02
0.018
0.038
Mahalanobis, 2D, ρ = ρ + 0.9
0.02
0.020
0.041
Keystroke – Real world finger typing
timings
•
Real world data of finger typing timings
–
–
–
–
–
•
•
Overall each human is represented by 400 samples * 20 sensors
Dataset applicable to
–
–
–
•
Same 10-letter password is repetitively typed
51 human subjects
8 daily sessions per each human
50 repetitions in each daily session
Each typing is characterized by 20 timings of key up – key pressed
Anomaly detection
Change detection
Knowledge extraction (mutliclass classification)
Full description and some R implementations at
–
http://www.cs.cmu.edu/~keystroke/
Performance comparison on real data
Method
Anomaly
detection rate
(tuned for FA
0.05)
Actual False Alarm
rate
(tuned for FA 0.05)
Anomaly detection
rate (tuned for FA
0.02)
Actual False
Alarm rate
(tuned for FA
0.02)
SVM One Class, 20D
0.59 ± 0.281
0.050 ± 0.068
0.448 ± 0.304
0.024 ± 0.035
Mahalanobis, 20D
0.55 ± 0.295
0.062 ± 0.074
0.464 ± 0.307
0.045 ± 0.065
SVM One Class, 2D
0.441 ± 0.230
0.054 ± 0.069
0.319 ± 0.257
0.034 ± 0.058
Mahalanobis, 2D
0.446 ± 0.226
0.077 ± 0.077
0.362 ± 0.241
0.055 ± 0.073
SVM performance is preferable:
• Better Control in False Alarm rate
• Higher Detection rate
Change detection
•
•
•
Consistent change in system behavior
Different distributions in past and future
Techniques
–
–
–
•
Quantify distributions in past and future
Measure the distance between distributions
Detect change if distance higher than threshold
One dimensional case
–
–
–
Kolmogorov- Smirnov test
Avoids explicit estimation of distribution
Score is distribution-independent
•
Fine tuning of threshold may be avoided
Kolmogorov Smirnov Test
• Quantifies difference between empirical distributions of samples from 1D continuous r.v.
• Actually measures maximal difference between the curves
• Returns probability for the two samples to be of the same distribution
Multivariate change detection
• Techniques:
Temperature
– Quantify distributions in past and future
– Measure the distance between
distributions
– Detect change if distance higher than
threshold
t9
t8
t7
t5
t4
t2
t1
t6
t3
Pressure
Score by KNN estimator of
Kullback Leibler divergence
x
•
KNN avoids multidimensional distribution estimation
ν
ρ
tim
e
t=
0
•
For each point in cloud P calculate nearest neighbor distance in clouds P(ρ) and Q(ν)
Ρ – Past : Past distance
ν – Past : Future distance
•
Past
Future
Window
Compute Score = D(Past|| Future ) + D(Future||Past)
Faivishevsky, “INFORMATION THEORETIC MULTIVARIATE CHANGE DETECTION FOR MULTISENSORY INFORMATION PROCESSING IN INTERNET OF THINGS”, ICASSP 2016
Information Theoretic Multivariate Change
detection algorithm
Score
Train:
Threshold
– Threshold on KNN KL Past vs Future for a predefined Alarm rate f
1-f
Test:
– Check whether KNN KL Past vs Future > Threshold
f
#Windows
Change detection comparison,
10K runs, N(0,1)
Test
Window
FA tuned
Actual FA
Detection Rate
KL, detect µ = µ +1
50,10 (k=8)
0.001
0.0009
0.139
KS, detect µ = µ +1
50,10
0.001
0.0004
0.131
KL, detect µ = µ +2
50,10 (k=8)
0.001
0.0006
0.921
KS, detect µ = µ +2
50,10
0.001
0.0004
0.871
KL, detect µ = µ +1
30,10 (k=5)
0.01
0.013
0.31
KS, detect µ = µ + 1
30,10
0.01
0.005
0.326
KL, detect µ = µ + 1
30,10 (k=8)
0.01
0.006
0.343
KL, detect µ = µ + 2
30,10
0.01
0.008
0.960
KS, detect µ = µ + 2
30,10 (k=8)
0.01
0.006
0.94
KL, detect Ϭ = Ϭ + 1 30,10 (k=8)
0.01
0.012
0.063
KS, detect Ϭ = Ϭ +
1
0.01
0.007
0.015
KL, detect Ϭ = Ϭ + 2 30,10 (k=8)
0.01
0.011
0.132
KS, detect Ϭ = Ϭ +
2
0.01
0.004
0.025
KL, detect Ϭ = Ϭ + 3 30,10 (k=8)
0.01
0.009
0.264
KS, detect Ϭ = Ϭ +
3
0.01
0.006
0.036
30,10
30,10
30,10
KL and KS similar for Δµ detection, KL is better for ΔϬ, KL is better controllable
Multidimensional change detection,
N(0,I(D))
Test
Window
FA tuned
Actual FA
Detection Rate
KL, 20D, Σ =Σ +I(20) 30,10 (k=8)
0.01
0.010
0.560
KL, 2D, Σ =Σ +I(2)
30,10 (k=8)
0.01
0.007
0.067
KL, 20D, µ =µ +1
30,10 (k=8)
0.01
0.010
1.000
KL, 2D, µ =µ +1
30,10 (k=8)
0.01
0.007
0.638
KL, 2D, ρ = ρ + 0.9
30,10 (k=8)
0.01
0.02
0.35
KNN KL leverages multidimensional information to
detect changes, that cannot be detected by onedimensional methods:
1. Small changes in µ
2. Small changes in Σ
3. Changes in ρ
Application of Keystrokes data to change
detection
1. Use session of consecutive 20X timings of a human as a start
2. Stick session of consecutive 20X timings of another human
3. Check whether
– A change detection method detects the stick point
– False alarm
4. Repeat 1-3 to get substantial statistics
KNN KL Change Detection
performance on real data
Method
KNN KL
Divergence, 20D
KNN KL
Divergence, 20D
KNN KL
Divergence, 2D
KNN KL
Divergence, 2D
Window Size Change
Detection rate
(tuned for FA
0.01)
10
0.974 ± 0.056
Actual False
Alarm rate
(tuned for FA
0.01)
0.029 ± 0.064
K statistics
4
0.761 ± 0.175
0.019 ± 0.020
2
10
0.704 ± 0.184
0.019 ± 0.032
2
4
0.489 ± 0.077
0.016 ± 0.021
2
2
Thank you!