Population Stratification with Limited Data By Kamalika Chaudhuri, Eran Halperin, Satish Rao and Shuheng Zhou The Problem  Given: Samples from two hidden distributions P1 and P2  Unknown labels   Each sample/individual: k features: 0/1 values  Population P1 : feature f is 1 w.p. p1f  Population P2 : feature f is 1 w.p. p2f  Unknown feature probabilities  The Problem  Given: 2n samples from two hidden distributions P1 and P2  Unknown labels   Goal: Classify each individual correctly for most inputs Applications  Preprocessing step in statistical analysis: Analyze the factors that cause a complex disease, such as cancer  Cluster the samples into populations, then apply statistical analysis   Collaborative Filtering Feature can be “likes Star Wars or not”  Cluster users into types using the features  Our Results   Need some separation between the distributions! Measure of Separation : distance between means     = L1 distance between means / k  = L22 distance between means / k Our Results: Optimization function and poly-time algorithm :  k = W(√k log n)  Optimization function :  k = W( log n)  Our Results  This talk:   Optimization function and poly-time algorithm :  k = W(√k log n) Example: P1 : For each feature f, p1f = ½  P2 : For each feature f, p2f = ½ + √log n/√k   Information-theoretically optimal:  There exists two distributions with this separation and constant overlap in probability mass Optimization Function  What measure to optimize to get the correct clustering? Need a robust measure which works for small separations A Robust Measure  Find the best balanced partition (S,S’) such that: f |Nf(S) – Nf(S’)| is maximum Nf(S), Nf(S’) : # of individuals with feature f in S, S’ A Robust Measure  Find the best balanced partition (S,S’) such that: f |Nf(S) – Nf(S’)| is maximum Nf(S), Nf(S’) : # of individuals with feature f in S, S’ Theorem : Optimizing this measure provides the correct partition w.h.p. if  k = W(√k log n) Proof Sketch:  How does the optimal partition behave? E[ f(P)] =  k n + k √n E[ f(Any partition)] = k √n Pr[ | f(P) – E[f] | >n√k ] · 2-n Pr[ | f(P) – E[f] | > n√k] · 2-n The partition with the optimal value of f in (I) dominates all the partitions in (II) w.h.p for the separation conditions An Algorithm  How can we find the partition which optimizes this measure? Theorem: There exists an algorithm which finds the correct partition when  k = W(√k log2n) Running Time : O(nk log2 n) An Algorithm Algorithm: 1. Divide individuals into two sets: A and B 2. Start with a random partition of A 3. Iterate log n times: 1. 2. Classify B using current partition of A and a proximity score And the same for A An Algorithm Iterate:    Random Partition:   Classify B using current partition of A and a score And vice versa. ( 1/2 + 1/√n) imbalance Each iteration produces a partition with more imbalance Classification Score  Our Score: For each feature f, If Nf(S) > Nf(S’) add 1 to the score if f is present, else subtract 1  If Nf(S) < Nf(S’) add 1 to the score if f is absent, else subtract 1   Classify: Individuals above the median score : S  Individuals below the median score : S’  Classification     Lemma: If the current partition has (1/2 + )-imbalance, the next iteration produces a partition with (1/2 + 2)-imbalance [for  < c] Lemma: If the current partition has (1/2 + c)-imbalance, the next iteration produces the correct partition with our separation conditions. (log n) rounds needed to get the correct partition Use a fresh set of features in each round to get independence Proof Sketch:  Lemma: If the current partition has (1/2 + )imbalance, the next iteration produces a partition with (1/2 + 2)-imbalance [for  < c] G = ( 2 k√n) Initially: G ≈ (log n) G Population 1 X, Y ≈ Bin(k, ½) Population 2 Proof Sketch:  Lemma: If the current partition has (1/2 + )imbalance, the next iteration produces a partition with (1/2 + 2)-imbalance [for  < c] G = ( 2 k√n) Pr[ Correct Classification ] = ½ + Ga/√k /(½ + ½) >½+2 G Population 1 [From separation conditions] Population 2 Proof Sketch:  Lemma: If the current partition has (1/2 + c)imbalance, the next iteration produces the correct partition with our separation conditions. G = ( 2 k√n) All but a 1/poly(n) fraction is correctly classified Population 1 Population 2 Related Work  Learning Mixtures of Gaussians [D99]:   Best performance by Spectral Algorithms [VW02, AM05,KSV05] Our algorithm : Matches the bounds in [VW02] for two clusters  Not a spectral algorithm !  Open Questions  How to extend our algorithm to work for multiple clusters ?  What is the relationship between our algorithm and spectral algorithms? Matches spectral algorithms of [M01] for two-way graph partitioning  Can our algorithm do better?  Thank You!