Population Stratification with Limited Data By Kamalika Chaudhuri, Eran Halperin, Satish Rao and Shuheng Zhou The Problem Given: Samples from two hidden distributions P1 and P2 Unknown labels Each sample/individual: k features: 0/1 values Population P1 : feature f is 1 w.p. p1f Population P2 : feature f is 1 w.p. p2f Unknown feature probabilities The Problem Given: 2n samples from two hidden distributions P1 and P2 Unknown labels Goal: Classify each individual correctly for most inputs Applications Preprocessing step in statistical analysis: Analyze the factors that cause a complex disease, such as cancer Cluster the samples into populations, then apply statistical analysis Collaborative Filtering Feature can be “likes Star Wars or not” Cluster users into types using the features Our Results Need some separation between the distributions! Measure of Separation : distance between means = L1 distance between means / k = L22 distance between means / k Our Results: Optimization function and poly-time algorithm : k = W(√k log n) Optimization function : k = W( log n) Our Results This talk: Optimization function and poly-time algorithm : k = W(√k log n) Example: P1 : For each feature f, p1f = ½ P2 : For each feature f, p2f = ½ + √log n/√k Information-theoretically optimal: There exists two distributions with this separation and constant overlap in probability mass Optimization Function What measure to optimize to get the correct clustering? Need a robust measure which works for small separations A Robust Measure Find the best balanced partition (S,S’) such that: f |Nf(S) – Nf(S’)| is maximum Nf(S), Nf(S’) : # of individuals with feature f in S, S’ A Robust Measure Find the best balanced partition (S,S’) such that: f |Nf(S) – Nf(S’)| is maximum Nf(S), Nf(S’) : # of individuals with feature f in S, S’ Theorem : Optimizing this measure provides the correct partition w.h.p. if k = W(√k log n) Proof Sketch: How does the optimal partition behave? E[ f(P)] = k n + k √n E[ f(Any partition)] = k √n Pr[ | f(P) – E[f] | >n√k ] · 2-n Pr[ | f(P) – E[f] | > n√k] · 2-n The partition with the optimal value of f in (I) dominates all the partitions in (II) w.h.p for the separation conditions An Algorithm How can we find the partition which optimizes this measure? Theorem: There exists an algorithm which finds the correct partition when k = W(√k log2n) Running Time : O(nk log2 n) An Algorithm Algorithm: 1. Divide individuals into two sets: A and B 2. Start with a random partition of A 3. Iterate log n times: 1. 2. Classify B using current partition of A and a proximity score And the same for A An Algorithm Iterate: Random Partition: Classify B using current partition of A and a score And vice versa. ( 1/2 + 1/√n) imbalance Each iteration produces a partition with more imbalance Classification Score Our Score: For each feature f, If Nf(S) > Nf(S’) add 1 to the score if f is present, else subtract 1 If Nf(S) < Nf(S’) add 1 to the score if f is absent, else subtract 1 Classify: Individuals above the median score : S Individuals below the median score : S’ Classification Lemma: If the current partition has (1/2 + )-imbalance, the next iteration produces a partition with (1/2 + 2)-imbalance [for < c] Lemma: If the current partition has (1/2 + c)-imbalance, the next iteration produces the correct partition with our separation conditions. (log n) rounds needed to get the correct partition Use a fresh set of features in each round to get independence Proof Sketch: Lemma: If the current partition has (1/2 + )imbalance, the next iteration produces a partition with (1/2 + 2)-imbalance [for < c] G = ( 2 k√n) Initially: G ≈ (log n) G Population 1 X, Y ≈ Bin(k, ½) Population 2 Proof Sketch: Lemma: If the current partition has (1/2 + )imbalance, the next iteration produces a partition with (1/2 + 2)-imbalance [for < c] G = ( 2 k√n) Pr[ Correct Classification ] = ½ + Ga/√k /(½ + ½) >½+2 G Population 1 [From separation conditions] Population 2 Proof Sketch: Lemma: If the current partition has (1/2 + c)imbalance, the next iteration produces the correct partition with our separation conditions. G = ( 2 k√n) All but a 1/poly(n) fraction is correctly classified Population 1 Population 2 Related Work Learning Mixtures of Gaussians [D99]: Best performance by Spectral Algorithms [VW02, AM05,KSV05] Our algorithm : Matches the bounds in [VW02] for two clusters Not a spectral algorithm ! Open Questions How to extend our algorithm to work for multiple clusters ? What is the relationship between our algorithm and spectral algorithms? Matches spectral algorithms of [M01] for two-way graph partitioning Can our algorithm do better? Thank You!
© Copyright 2025 Paperzz