Unsupervised Learning

CSLT ML Summer Seminar (9)
Unsupervised Learning
Dong Wang
Some slides are from Bengio and Lecun’s NIPS15 Tutoring
Content
• Introduction
• Clustering
• Manifold learning
• Unsupervised learning in graphical model
• Unsupervised learning in neural model
Unsupervised learning
• Learning the intrinsic structure in data, without labels or targets.
• Some unsupervised learning tasks
• Dimension reduction
• Feature extraction
• Visualization
Why unsupervised learning?
• It is a way how human brain demonstrates capabilities
• It can use big data
• It helps identify causial factors (disentanglement)
• It helps discover relations
• It helps alleviates the cuse of dimensionality
• It helps learn features
• It helps pre-train supervised models
Ways of supervised learning
• Clustering: separate data into different areas
• Divide and conquer
• Conditional probability than marginal probability
• Coarse factor analysis
• Manifold learning: discover low-dimensional space where the data concentrate
• Dimension reduction
• visualization
• Factor analysis and graphical model
•
•
•
•
Linear Gaussians
Dynamic Bayesian
Hierachical Bayesian
Energy models
• Neural model and deep learning
• Hopfiled net and SOM
• RBM
• Autoencoder and S2S
Content
• Introduction
• Clustering
• Manifold learning
• Unsupervised learning in graphical model
• Unsupervised learning in neural model
Clustering
• Centroid clustering (Partitioning algorithms):
Construct various partitions and then evaluate
them by some criterion
• Connectivity-based clustering (Hierarchical
algorithms): Create a hierarchical decomposition of
the set of data (or objects) using some criterion
• Density-based clustering: based on connectivity
and density functions
• Model-based clustering: A model is hypothesized
for each of the clusters and the idea is to find the
best fit of that model to each other
www.cs.bu.edu/fac/gkollios/ada01/LectNotes/Clustering2.ppt
Centroid-based clustering (Partitioning
Algorithms)
• Partitioning method: Construct a partition of a database D of n objects into a set of k clusters
• Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion
• Global optimal: exhaustively enumerate all partitions
• Heuristic methods: k-means and k-medoids algorithms
• k-means (MacQueen’67): Each cluster is represented by the center of the cluster
• k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is
represented by one of the objects in the cluster
K-mean
• Given k, the k-means
algorithm is implemented
in 4 steps:
• Partition objects into k
nonempty subsets
• Compute seed points as the
centroids of the clusters of
the current partition. The
centroid is the center (mean
point) of the cluster.
• Assign each object to the
cluster with the nearest
seed point.
• Go back to Step 2, stop
when no more new
assignment.
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
0
0
1
2
3
4
5
6
7
8
9
10
Connectivity-based clustering (hierarchical
clustering)
• Use distance matrix
as clustering criteria.
This method does
not require the
number of clusters k
as an input, but
needs a termination
condition
Step 0
a
Step 1
Step 2 Step 3 Step 4
agglomerative
(AGNES)
ab
b
abcde
c
cde
d
de
e
Step 4
Step 3
Step 2 Step 1 Step 0
divisive
(DIANA)
Agglomerative Nesting
•
•
•
•
Agglomerative, Bottom-up approach
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
Divisive Analysis
• Top-down approach
• Inverse order of AGNES
• Eventually each node forms a cluster on its own
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
Correlation clustering
• Correlation is represented as a
graph G(V, E)
• Clustering the data such that
the agreement is maximized.
• Agreement: the sum of
positive weight with in a
cluster plus the sum of
abstract negative weights
across clusters.
Density-based clustering
• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).
• Automatically identifying subspaces of a high dimensional data space that allow better clustering
than original space
• CLIQUE can be considered as both density-based and grid-based
• It partitions each dimension into the same number of equal length interval
• It partitions an m-dimensional data space into non-overlapping rectangular units
• A unit is dense if the fraction of total data points contained in the unit exceeds the input
model parameter
• A cluster is a maximal set of connected dense units within a subspace
=3
30
40
Vacation
20
50
Salary
(10,000)
0 1 2 3 4 5 6 7
30
Vacation
(week)
0 1 2 3 4 5 6 7
age
60
20
30
40
50
age
50
age
60
DBSCAN
• Martin Ester, Hans-Peter Kriegel, Jörg
Sander and Xiaowei Xu in 1996.
• Let p with sufficient neighbours in a
predefined radius v is a core point, all
points within v are reachable from p.
• If p is a core point, then it forms a
cluster together with all points (core or
non-core) that are reachable from it.
Model-based clustering
• Assume data generated from K
probability distributions
• Gaussian mixture model
• Soft or probabilistic version of Kmeans clustering
• EM Algorithm
• Probabilistic representation, learn
P(X) and P(X|c)
Content
• Introduction
• Clustering
• Manifold learning
• Unsupervised learning in graphical model
• Unsupervised learning in neural model
Manifold learning
• Manifold: subspace that data concentrate, related to dimension
reduction.
• Linear manifold learning
• PCA, MDS
• Non-linear manifold learning
•
•
•
•
ISOMAP
SOM
Local linear embedding (LLE)
T-SNE
K. Q. Weinberger and L. K. Saul (2004). Unsupervised learning
of image manifolds by semidefinite programming, CVPR 2004.
PCA
• A Gaussian lienra
model
• A simple graphical
model
• Find a global lowdimensional space,
with which recovery
is the best.
• PCs are eigen vectors
of the covariance
matrix.
∑
x
Multidimensional scaling
• Also known as Principal Coordinates
Analysis
• Takes an input matrix giving
dissimilarities between pairs of
items
• Outputs a coordinate matrix whose
configuration minimizes a loss
function, e.g., stree:
Linear model is not sufficient
• Both PCA and MDS are global
linear, relying on eigen
decomposition: covariance
matrix for PCA and centered
proximity matrix for MDS.
PCA
• MDS inference:
Manifold Learning-Dimensionality Reduction
Nonlinear manifold
• Most of data are in nonlinera manifold
ISOmap
ISOmap
Joshua B. Tenenbaum1,*, Vin de Silva2, John C. Langford3, A Global Geometric Framework for Nonlinear
Dimensionality Reduction, Science 2000.
SOM
• Define a lattice
• Each node in the lattice is
associated with a weight
• For each input, find the node
that it is most close to
• Update the node and its
neighbours!
https://en.wikipedia.org/wiki/Self-organizing_map
Local linear embedding
• For each data x, find a weight
vector that can recover x from
its neighbouring points.
• In low-dimensional space, find Y
to make the reconstruction
minimum.
• Using an engein decompositiion
to solve the problem
Sam T.Roweis and Lawrence K.Saul, 2000, Locally linear embedding
http://www.pami.sjtu.edu.cn/people/xzj/introducelle.htm
Spectral embedding
http://snap.stanford.edu/class/cs224wreadings/ng01spectralcluster.pdf
t-SNE
• Define the distribution on the
orginal space
• Define the distribution on the
projection space
• Let the distributions in the two
spaces close to each other
• Solve the crowding problem
• Gaussian assumed
• Can be extended to other
distributions, e.g. von Mises
Fisher distribution.
T-SNE
Some comparisons
http://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html
Content
• Introduction
• Clustering
• Manifold learning
• Unsupervised learning in graphical model
• Unsupervised learning in neural model
What are clustering in graphical model?
• It is a simple latent
model with a discrete
variable z
• The simplest form is
GMM, but can be any
rich distribution
• More complex
structures can be used
φ
Hidden!
z
μ,Σ
x
N
What are manifold learning in graphical
model?
• It is not more than finding limited
explainary factors!
• x = Wz + μ + ϵ ; z: N(z|0,I), ϵ: N(z|0, σ2 I)
• Recall PCA?
But graphical model is much more than that
• It is possible to design more complex
structure or conditional probability
to obtain non-linear manifold. E.g.,
make the latent variable input
dependent
• It is possible to construct dynamic
structures to generate sequencial
data (e.g., HMM, LDS)
• It is possible to construct multi-level
structures to represent and discover
high-level factors
LDA: A typical generation process
• If a model can
describe the
(complex) data
generatoin process,
what others you still
want?
• And, it can involve
your knowledge.
DBN/DBM: A undirected generation
http://www.cs.toronto.edu/~hinton/adi/index.htm
http://www.cs.nyu.edu/~gwtaylor/publications/nips2006mhm
ublv/videos/1walking.gif
Rethink Graphical model
• Graphical model does not care about labels, all are variables
• Graphical model is typcially generative, or descriptive. It is naturally
unsupervised, and all are unsupervised
• Do we have supvision then? Yes, structure, priors, conditionals...
• Classification can be conducted by inference, or likelihood ratio
• It is a vast framework for unsupervised learning
Content
• Introduction
• Clustering
• Manifold learning
• Unsupervised learning in graphical model
• Unsupervised learning in neural model
Unsupervised learning is auxliary in NN
• For neural models that target for classificatoin
• Pretraining
• Information disentanglement
• Layer-wised abstraction
• For neural models that work as genreative model
• Describe the coding and regeneration
• Describe the dynamic property
P(x) as information
Representation learning
• More abstract, more invaraint
• Handle different modalities
• Easy to transfer, deal with oneshot learning and zero-shot
learning
P(X) as generative model
Variational Auto-encoder
Denoising AE
Learn manifold!
Guillaume Alain and Yoshua Bengio, What Regularized Auto-Encoders Learn from the Data Generating
Distribution
Difference between Bayesian and Neural in
unsupervised learning
• Bayesian model requires some structure knowledge, neural model
relies on pure data
• Some randomness should be defined in neural model, otherwise it is
less capable of manifold learning. Could be on input, or on hidden
units.
• Nueral models leverage deep non-linearity to map complex
distribution to simple distribution. Bayesian model remains complex.
• Neural model relies on nemerical optimization, while Baysian model
relies on approximation inference and EM. Bayesian model training
seems fine: it is slow, but should reach a good model easier. Neural
model can be speed up, but maybe underfitting.
Wrap up
• Many unsupervised learning methods, PCA, ICA, sparse coding, NMF, factor
analysis, RBM, GMM, HMM, AE,...
• There are lots of unsupervised learning, in clustering, manifold learning,
metric learning, etc. There methods are important, but should find more
support from probabilistic theory.
• We recommend two frameworks for unsupervised learning:
• The graphical model perspective, where unsupervised learning as a way of
probabilistic structure optimizatoin.
• The deep representation learning perspective, where underlying factors are
disentangled by layer-wised abstraction with various constraints (knowledge,
prior,...).
• Unsupervised learning is mostly important for sucessful AI.