pdf, 1.5MB

Phoneme learning (cont) Computa(onal Cogni(ve Science 2011 Lecture 9 Last (me: Intro to phonemes Phonemes: The par(cular sounds of speech (e.g., d, th, sh, ee). Each language has a different set of phonemes, and one of the tasks of learning is to figure out which ones are important in your language. Last (me: Learning phonemes People seem to learn on the basis of distribu(onal informa(on. Basic idea: expose people to distribu(ons of sounds, then test them on their ability to hear the different contrasts. Abstract idea Actual vowel phonemes Last (me: k-­‐means clustering A simple algorithm for learning clusters given raw data. Converges quickly to a local maximum. Last (me: Problems with k-­‐means Final result dependent on star(ng posi(on Can’t do “soQ” assignment; bad when clusters overlap Can’t handle different sizes of clusters Can’t handle elongated clusters Today: Beyond k-­‐means 1.  SoQ category assignments: soQ k-­‐means Problems with this 2. Mixture of Gaussians Problems with this 3. A more complex model of phone(c learning A simple fix of k-­‐means So5 k-­‐means One problem with k-­‐means Category assignments are not soQ. Points near borders should arguably affect the means of all nearby clusters. SoQ k-­‐means clustering: Pseudocode Initialization: !
Set each mean to a random value!
Initialise “previous” responsibility matrix rprev!
Set up initial current responsibility matrix rcurr!
While rprev ~= rcurr!
Assignment step:!
Each data point is assigned to each of the means
!probabilistically, as a function of its !
!distance from the mean!
Update step:!
Recalculate the means!
End!
SoQ k-­‐means clustering: Pseudocode Initialization: !
Set each mean to a random value!
Initialise “previous” responsibility matrix rprev!
Set up initial current responsibility matrix rcurr!
While rprev ~= rcurr!
Before: the “responsibility” matrix Assignment step:!
rcurr captured a binary assignment of Each data point is assigned to each of the means
datapoints to clusters !probabilistically, as
a function
of its !
!distance from the mean!
Update step:!
Recalculate the means! m(k) is the mean of the kth cluster End!
SoQ k-­‐means clustering: Pseudocode Initialization: !
Set each mean to a random value!
Initialise “previous” responsibility matrix rprev!
Set up initial current responsibility matrix rcurr!
While rprev ~= rcurr!
Now: rcurr captures a “soQ” assignment of datapoints to clusters Assignment step:!
Each data point is assigned to each of the means
!probabilistically, as a function of its !
!distance from the mean!
Update step:!
Recalculate the means! m(k) is the mean of the kth cluster End!
β governs the “s(ffness” of assignments SoQ k-­‐means clustering: The code (File softkmeanscluster.m. Iden(cal to kmeanscluster.m, except for assignment step) SoQ k-­‐means oQen performs sensibly Correct fakedatasetmedium.mat β= 7
SoQ k-­‐means oQen performs sensibly As β→∞, it turns into hard k-­‐means clustering
β=1
β=4
β=7
β=10
β=20
β=50
…. But it does not solve some of the problems Can’t handle different sizes of clusters Correct β=1
β=15
β=7
β=50
…. But it does not solve some of the problems Can’t handle elongated clusters Correct β=1
β=15
β=7
β=50
What’s going on? Take a step back, first. How are data (e.g., phonemes) probably generated? Some sort of underlying process which imposes a distribu(on over data points What’s going on? K-­‐means clustering is implicitly making some basic assump(ons about the nature of that distribu(on Distance metric is the same in every direc(on and for every cluster What’s going on? As a result, k-­‐means assumes that all clusters are the same size, and symmetric (circular). Both of these are impossible to do well The fix Change it to reflect the assump(on that the data are generated by Gaussians (normal distribu(ons), and allow the Gaussians to have different variances in different direc(ons This is called the mixture of Gaussians The algorithm for calcula(ng the best set of Gaussians is some(mes called the EM algorithm, named for the two steps involved. Extending the algorithm Mixture of Gaussians with EM Mixture of Gaussians with EM Assignment step (also called the E-­‐step, or Expecta(on step) The responsibili(es are: Mixture of Gaussians with EM Assignment step (also called the E-­‐step, or Expecta(on step) The responsibili(es are: Equa(on for a Gaussian Mixture of Gaussians with EM Assignment step (also called the E-­‐step, or Expecta(on step) The responsibili(es are: weight for cluster k standard Mean along devia(on dimension i of cluster k for cluster k Total number of dimensions i Point n at dimension i Mixture of Gaussians with EM Assignment step (also called the E-­‐step, or Expecta(on step) The responsibili(es are: Note that this equa(on assumes that the standard devia(on is the same along all dimensions; it will not be able to handle elongated clusters Mixture of Gaussians with EM Assignment step (also called the E-­‐step, or Expecta(on step) The responsibili(es are: A new version that allows different standard devia(on along each dimension Mixture of Gaussians with EM Assignment step (also called the E-­‐step, or Expecta(on step) The responsibili(es are: Note that this is exactly the same as calcula(ng the likelihood of that point under the Gaussian (normal) distribu(on with parameters w, m and σ Mixture of Gaussians with EM Update step (also called the M-­‐step, or Maximisa(on step) This (me, in addi(on to the mean we need to update the standard devia(ons and weights. Mean: This is just the mean of all of the points in the cluster, weighted by the propor(on of their likelihood (“responsibility”) taken care of by that cluster Mixture of Gaussians with EM Update step (also called the M-­‐step, or Maximisa(on step) This (me, in addi(on to the mean we need to update the standard devia(ons and weights. Variance: This is just the average variance of all of the points in the cluster, weighted by the propor(on of their likelihood (“responsibility”) taken care of by that cluster Mixture of Gaussians with EM Update step (also called the M-­‐step, or Maximisa(on step) This (me, in addi(on to the mean we need to update the standard devia(ons and weights. Weight: This is just the sum of all of the responsibili(es in that cluster (so clusters that account for more points have higher weight) Mixture of Gaussians: Pseudocode Initialization: !
Set each mean, standard deviation, and weight to a random
value!
Initialise “previous” responsibility matrix rprev!
Set up initial current responsibility matrix rcurr!
While rprev ~= rcurr!
Assignment step (E-step):!
Calculate the likelihood of each data point in each
!cluster, assuming the cluster is a Gaussian with
!the current mean, standard deviation, and weight!
Update step
Recalculate
Recalculate
Recalculate
(M-step):!
the means!
the standard deviations!
the weights!
End!
Corresponds to “version 3” algorithm on page 304 of MacKay (see readings) How does this do? Correct fakedataseteasy.mat How does this do? Correct largeandsmall.mat How does this do? Correct elongated.mat How does this do on phonemes? Correct Does this when any likelihood becomes zero (or infinite) phonemedataset.mat How does this do on phonemes? Correct In trying to account for the two points at the bogom, it sits on one and sets its variance to zero, resul(ng in infinite likelihood How does this do on phonemes? Correct … Can fix by removing those two troublesome points How does this do on phonemes? Correct … Or by puhng in a “kludge” that prevents the variance from going below some small constant term (e.g., 0.001) Good things about mixture of Gaussians As with k-­‐means, convergence to a local maximum is fast and guaranteed Performance is considerably beger than k-­‐means: can fit asymmetric clusters of unequal variance Can handle soQ assignment Interpretable probabilis(cally, in terms of maximising the likelihood of the dataset assuming the clusters are Gaussian Bad things about mixture of Gaussians As with k-­‐means, not guaranteed to converge to a global maximum; s(ll sensi(ve to ini(al condi(ons (You can especially see this if you set the ini(al variances too low or too high) As with k-­‐means, you have to tell it how many clusters there are Occasionally shows pathological behaviour in which (a) one cluster has infinitely small variance, or (b) all means are the same, and all points are shared among all clusters Making it properly Bayesian (i.e., sehng a prior on the variance) may help Improving it even more… Further refinements to the model Desiderata We’d like it to not show pathological behaviour It would be nice if it could figure out how many categories there are It would also be good if it could tell us something about human phoneme percep(on/learning Now, presen(ng a model that does all of that, so you can see what the current state-­‐of-­‐the-­‐art is. Overview of the model Interes(ng psychology in how phone(c learning and word learning can help each other Phone(c learning Word learning Will only give technical details for this one Model of phone(c learning Builds on mixture of Gaussians, but with three main differences: 1.  Unlike EM, which finds a single best solu(on, integrates over all solu(ons (we will learn computa(onal techniques for doing this later in the course). 2.  Sets a prior over the means, (co)variances, and weights. 3.  Learns how many categories would be appropriate (by sehng a special kind of prior over the # of categories; we will learn how to do this later in the course too). Feldman & Griffiths, 2009 Model of phone(c learning Speech sounds are assumed to be produced by selec(ng a phone(c category c from the set of possible categories. The sound is a sample from the Gaussian associated with that category. Categories differ in their means µc, covariance Σc, and number of speech sounds in that category ν (basically the weight) Feldman & Griffiths, 2009 Calcula(ng the likelihood Calculate p(x|c) by integra(ng over all possible means and covariances, and including a prior on which means and covariances are likely a priori likelihood priors (chosen because these distribu(ons make the math work nicely) Feldman & Griffiths, 2009 Calcula(ng the prior over # categories Calculate p(c) using a special prior (which we will see later in more detail) that basically assigns data points to categories propor(onal to the size of the exis(ng category. This is a “rich get richer” prior. α is a free parameter; nc is the number of items in category c. Feldman & Griffiths, 2009 Results Creates a model that does beger than our simple mixture of Gaussians with EM model did Correct Model Feldman & Griffiths, 2009 But it’s s(ll not great Idea: People have addi(onal informa(on: what words exist in the language Suppose you know that b and p are dis(nct phone(c categories, but aren’t sure about a and i Hearing two separate words ball and pill helps: may imply that a and i are dis(nct Feldman & Griffiths, 2009 Add this knowledge into the model Now the model receives sets of speech sounds (i.e., words) and has to figure out which of them are dis(nct words in the lexicon Hindi English [g]obi Word 1 Word 1 [k]obi Word 2 Word 1 [kh]obi Word 3 Word 2 … skipping details of how this is done… Feldman & Griffiths, 2009 Add this knowledge into the model The model that does so significantly improves its ability to iden(fy the correct phoneme categories Correct Phone(c only Phone(c + words Feldman & Griffiths, 2009 Add this knowledge into the model The model that does so significantly improves its ability to iden(fy the correct phoneme categories Implica(ons for cogni(ve science: 1.  You can get reasonably far with sensible distribu(onal learning 2.  Learning is nevertheless massively improved when distribu(onal learning is combined with other informa(on (e.g., about the words in the language) which is also learned Feldman & Griffiths, 2009 Take-­‐home points SoQ k-­‐means clustering allows for probabilis(c assignment of points to clusters, but does not solve the problems k-­‐means has on datasets with clusters of unequal size or variance Mixture of Gaussians is an extension of this that assumes that each cluster is a Gaussian with a different mean and variance. It solves those problems, but s(ll occasionally shows pathological behaviour caused by infinite likelihood on one point These problems might be addressed to some extent with priors on the means and variances Feldman & Griffiths model does this, and also integrates over clusters. This improves performance. They also combine with a word learning model and improve it even more. This suggests that a human learner might similarly use such informa(on. References K-­‐means clustering and mixture of Gaussians MacKay, D. (2003) Informa7on theory, inference, and learning algorithms. Chapter 22. Phone(c learning model Feldman, N., & Griffiths, T. (2009) Learning phone(c categories by learning a lexicon. Proceedings of the 31st Annual Conference of the Cogni7ve Science Society.