CSLT ML Summer Seminar (9) Unsupervised Learning Dong Wang Some slides are from Bengio and Lecun’s NIPS15 Tutoring Content • Introduction • Clustering • Manifold learning • Unsupervised learning in graphical model • Unsupervised learning in neural model Unsupervised learning • Learning the intrinsic structure in data, without labels or targets. • Some unsupervised learning tasks • Dimension reduction • Feature extraction • Visualization Why unsupervised learning? • It is a way how human brain demonstrates capabilities • It can use big data • It helps identify causial factors (disentanglement) • It helps discover relations • It helps alleviates the cuse of dimensionality • It helps learn features • It helps pre-train supervised models Ways of supervised learning • Clustering: separate data into different areas • Divide and conquer • Conditional probability than marginal probability • Coarse factor analysis • Manifold learning: discover low-dimensional space where the data concentrate • Dimension reduction • visualization • Factor analysis and graphical model • • • • Linear Gaussians Dynamic Bayesian Hierachical Bayesian Energy models • Neural model and deep learning • Hopfiled net and SOM • RBM • Autoencoder and S2S Content • Introduction • Clustering • Manifold learning • Unsupervised learning in graphical model • Unsupervised learning in neural model Clustering • Centroid clustering (Partitioning algorithms): Construct various partitions and then evaluate them by some criterion • Connectivity-based clustering (Hierarchical algorithms): Create a hierarchical decomposition of the set of data (or objects) using some criterion • Density-based clustering: based on connectivity and density functions • Model-based clustering: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other www.cs.bu.edu/fac/gkollios/ada01/LectNotes/Clustering2.ppt Centroid-based clustering (Partitioning Algorithms) • Partitioning method: Construct a partition of a database D of n objects into a set of k clusters • Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion • Global optimal: exhaustively enumerate all partitions • Heuristic methods: k-means and k-medoids algorithms • k-means (MacQueen’67): Each cluster is represented by the center of the cluster • k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster K-mean • Given k, the k-means algorithm is implemented in 4 steps: • Partition objects into k nonempty subsets • Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. • Assign each object to the cluster with the nearest seed point. • Go back to Step 2, stop when no more new assignment. 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 0 0 1 2 3 4 5 6 7 8 9 10 Connectivity-based clustering (hierarchical clustering) • Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 a Step 1 Step 2 Step 3 Step 4 agglomerative (AGNES) ab b abcde c cde d de e Step 4 Step 3 Step 2 Step 1 Step 0 divisive (DIANA) Agglomerative Nesting • • • • Agglomerative, Bottom-up approach Merge nodes that have the least dissimilarity Go on in a non-descending fashion Eventually all nodes belong to the same cluster 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Divisive Analysis • Top-down approach • Inverse order of AGNES • Eventually each node forms a cluster on its own 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Correlation clustering • Correlation is represented as a graph G(V, E) • Clustering the data such that the agreement is maximized. • Agreement: the sum of positive weight with in a cluster plus the sum of abstract negative weights across clusters. Density-based clustering • Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98). • Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space • CLIQUE can be considered as both density-based and grid-based • It partitions each dimension into the same number of equal length interval • It partitions an m-dimensional data space into non-overlapping rectangular units • A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter • A cluster is a maximal set of connected dense units within a subspace =3 30 40 Vacation 20 50 Salary (10,000) 0 1 2 3 4 5 6 7 30 Vacation (week) 0 1 2 3 4 5 6 7 age 60 20 30 40 50 age 50 age 60 DBSCAN • Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. • Let p with sufficient neighbours in a predefined radius v is a core point, all points within v are reachable from p. • If p is a core point, then it forms a cluster together with all points (core or non-core) that are reachable from it. Model-based clustering • Assume data generated from K probability distributions • Gaussian mixture model • Soft or probabilistic version of Kmeans clustering • EM Algorithm • Probabilistic representation, learn P(X) and P(X|c) Content • Introduction • Clustering • Manifold learning • Unsupervised learning in graphical model • Unsupervised learning in neural model Manifold learning • Manifold: subspace that data concentrate, related to dimension reduction. • Linear manifold learning • PCA, MDS • Non-linear manifold learning • • • • ISOMAP SOM Local linear embedding (LLE) T-SNE K. Q. Weinberger and L. K. Saul (2004). Unsupervised learning of image manifolds by semidefinite programming, CVPR 2004. PCA • A Gaussian lienra model • A simple graphical model • Find a global lowdimensional space, with which recovery is the best. • PCs are eigen vectors of the covariance matrix. ∑ x Multidimensional scaling • Also known as Principal Coordinates Analysis • Takes an input matrix giving dissimilarities between pairs of items • Outputs a coordinate matrix whose configuration minimizes a loss function, e.g., stree: Linear model is not sufficient • Both PCA and MDS are global linear, relying on eigen decomposition: covariance matrix for PCA and centered proximity matrix for MDS. PCA • MDS inference: Manifold Learning-Dimensionality Reduction Nonlinear manifold • Most of data are in nonlinera manifold ISOmap ISOmap Joshua B. Tenenbaum1,*, Vin de Silva2, John C. Langford3, A Global Geometric Framework for Nonlinear Dimensionality Reduction, Science 2000. SOM • Define a lattice • Each node in the lattice is associated with a weight • For each input, find the node that it is most close to • Update the node and its neighbours! https://en.wikipedia.org/wiki/Self-organizing_map Local linear embedding • For each data x, find a weight vector that can recover x from its neighbouring points. • In low-dimensional space, find Y to make the reconstruction minimum. • Using an engein decompositiion to solve the problem Sam T.Roweis and Lawrence K.Saul, 2000, Locally linear embedding http://www.pami.sjtu.edu.cn/people/xzj/introducelle.htm Spectral embedding http://snap.stanford.edu/class/cs224wreadings/ng01spectralcluster.pdf t-SNE • Define the distribution on the orginal space • Define the distribution on the projection space • Let the distributions in the two spaces close to each other • Solve the crowding problem • Gaussian assumed • Can be extended to other distributions, e.g. von Mises Fisher distribution. T-SNE Some comparisons http://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html Content • Introduction • Clustering • Manifold learning • Unsupervised learning in graphical model • Unsupervised learning in neural model What are clustering in graphical model? • It is a simple latent model with a discrete variable z • The simplest form is GMM, but can be any rich distribution • More complex structures can be used φ Hidden! z μ,Σ x N What are manifold learning in graphical model? • It is not more than finding limited explainary factors! • x = Wz + μ + ϵ ; z: N(z|0,I), ϵ: N(z|0, σ2 I) • Recall PCA? But graphical model is much more than that • It is possible to design more complex structure or conditional probability to obtain non-linear manifold. E.g., make the latent variable input dependent • It is possible to construct dynamic structures to generate sequencial data (e.g., HMM, LDS) • It is possible to construct multi-level structures to represent and discover high-level factors LDA: A typical generation process • If a model can describe the (complex) data generatoin process, what others you still want? • And, it can involve your knowledge. DBN/DBM: A undirected generation http://www.cs.toronto.edu/~hinton/adi/index.htm http://www.cs.nyu.edu/~gwtaylor/publications/nips2006mhm ublv/videos/1walking.gif Rethink Graphical model • Graphical model does not care about labels, all are variables • Graphical model is typcially generative, or descriptive. It is naturally unsupervised, and all are unsupervised • Do we have supvision then? Yes, structure, priors, conditionals... • Classification can be conducted by inference, or likelihood ratio • It is a vast framework for unsupervised learning Content • Introduction • Clustering • Manifold learning • Unsupervised learning in graphical model • Unsupervised learning in neural model Unsupervised learning is auxliary in NN • For neural models that target for classificatoin • Pretraining • Information disentanglement • Layer-wised abstraction • For neural models that work as genreative model • Describe the coding and regeneration • Describe the dynamic property P(x) as information Representation learning • More abstract, more invaraint • Handle different modalities • Easy to transfer, deal with oneshot learning and zero-shot learning P(X) as generative model Variational Auto-encoder Denoising AE Learn manifold! Guillaume Alain and Yoshua Bengio, What Regularized Auto-Encoders Learn from the Data Generating Distribution Difference between Bayesian and Neural in unsupervised learning • Bayesian model requires some structure knowledge, neural model relies on pure data • Some randomness should be defined in neural model, otherwise it is less capable of manifold learning. Could be on input, or on hidden units. • Nueral models leverage deep non-linearity to map complex distribution to simple distribution. Bayesian model remains complex. • Neural model relies on nemerical optimization, while Baysian model relies on approximation inference and EM. Bayesian model training seems fine: it is slow, but should reach a good model easier. Neural model can be speed up, but maybe underfitting. Wrap up • Many unsupervised learning methods, PCA, ICA, sparse coding, NMF, factor analysis, RBM, GMM, HMM, AE,... • There are lots of unsupervised learning, in clustering, manifold learning, metric learning, etc. There methods are important, but should find more support from probabilistic theory. • We recommend two frameworks for unsupervised learning: • The graphical model perspective, where unsupervised learning as a way of probabilistic structure optimizatoin. • The deep representation learning perspective, where underlying factors are disentangled by layer-wised abstraction with various constraints (knowledge, prior,...). • Unsupervised learning is mostly important for sucessful AI.
© Copyright 2026 Paperzz