Slides - UCSD CSE

Population Stratification
with Limited Data
By
Kamalika Chaudhuri, Eran Halperin, Satish
Rao and Shuheng Zhou
The Problem

Given:
Samples from two hidden distributions P1 and P2
 Unknown labels


Each sample/individual:
k features: 0/1 values
 Population P1 :
feature f is 1 w.p. p1f
 Population P2 :
feature f is 1 w.p. p2f
 Unknown feature probabilities

The Problem

Given:
2n samples from two hidden distributions P1 and P2
 Unknown labels


Goal: Classify each individual correctly for most
inputs
Applications

Preprocessing step in statistical analysis:
Analyze the factors that cause a complex disease,
such as cancer
 Cluster the samples into populations, then apply
statistical analysis


Collaborative Filtering
Feature can be “likes Star Wars or not”
 Cluster users into types using the features

Our Results


Need some separation between the distributions!
Measure of Separation : distance between means



 = L1 distance between means / k
 = L22 distance between means / k
Our Results:
Optimization function and poly-time algorithm :
 k = W(√k log n)
 Optimization function :  k = W( log n)

Our Results

This talk:


Optimization function and poly-time algorithm :
 k = W(√k log n)
Example:
P1 : For each feature f, p1f = ½
 P2 : For each feature f, p2f = ½ + √log n/√k


Information-theoretically optimal:

There exists two distributions with this separation
and constant overlap in probability mass
Optimization Function

What measure to optimize to get the correct
clustering?
Need a robust measure which works for small
separations
A Robust Measure

Find the best balanced partition (S,S’) such that:
f |Nf(S) – Nf(S’)|
is maximum
Nf(S), Nf(S’) : # of individuals with feature f in S, S’
A Robust Measure

Find the best balanced partition (S,S’) such that:
f |Nf(S) – Nf(S’)|
is maximum
Nf(S), Nf(S’) : # of individuals with feature f in S, S’
Theorem : Optimizing this measure provides the
correct partition w.h.p. if
 k = W(√k log n)
Proof Sketch:

How does the optimal partition behave?
E[ f(P)] =  k n + k √n
E[ f(Any partition)] = k √n
Pr[ | f(P) – E[f] | >n√k ] · 2-n
Pr[ | f(P) – E[f] | > n√k] · 2-n
The partition with the optimal value of f in (I)
dominates all the partitions in (II) w.h.p for the
separation conditions
An Algorithm

How can we find the partition which optimizes
this measure?
Theorem: There exists an algorithm which
finds the correct partition when
 k = W(√k log2n)
Running Time : O(nk log2 n)
An Algorithm
Algorithm:
1. Divide individuals into two sets: A and B
2. Start with a random partition of A
3. Iterate log n times:
1.
2.
Classify B using current partition of A and a
proximity score
And the same for A
An Algorithm
Iterate:



Random Partition:


Classify B using current
partition of A and a score
And vice versa.
( 1/2 + 1/√n) imbalance
Each iteration produces a
partition with more
imbalance
Classification Score

Our Score: For each feature f,
If Nf(S) > Nf(S’)
add 1 to the score if f is present, else subtract 1
 If Nf(S) < Nf(S’)
add 1 to the score if f is absent, else subtract 1


Classify:
Individuals above the median score : S
 Individuals below the median score : S’

Classification




Lemma: If the current partition has (1/2 + )-imbalance, the
next iteration produces a partition with (1/2 + 2)-imbalance
[for  < c]
Lemma: If the current partition has (1/2 + c)-imbalance, the
next iteration produces the correct partition with our separation
conditions.
(log n) rounds needed to get the correct partition
Use a fresh set of features in each round to get independence
Proof Sketch:

Lemma: If the current partition has (1/2 + )imbalance, the next iteration produces a partition
with (1/2 + 2)-imbalance [for  < c]
G = ( 2 k√n)
Initially:
G ≈ (log n)
G
Population 1
X, Y ≈ Bin(k, ½)
Population 2
Proof Sketch:

Lemma: If the current partition has (1/2 + )imbalance, the next iteration produces a partition
with (1/2 + 2)-imbalance [for  < c]
G = ( 2 k√n)
Pr[ Correct Classification ]
= ½ + Ga/√k /(½ + ½)
>½+2
G
Population 1
[From separation conditions]
Population 2
Proof Sketch:

Lemma: If the current partition has (1/2 + c)imbalance, the next iteration produces the correct
partition with our separation conditions.
G = ( 2 k√n)
All but a 1/poly(n)
fraction is correctly
classified
Population 1
Population 2
Related Work

Learning Mixtures of Gaussians [D99]:


Best performance by Spectral Algorithms [VW02,
AM05,KSV05]
Our algorithm :
Matches the bounds in [VW02] for two clusters
 Not a spectral algorithm !

Open Questions

How to extend our algorithm to work for
multiple clusters ?

What is the relationship between our algorithm
and spectral algorithms?
Matches spectral algorithms of [M01] for two-way
graph partitioning
 Can our algorithm do better?

Thank You!