Clustering Data Streams [1]

Clustering Data Streams
A presentation by George Toderici
Talk Outline
1.
2.
3.
4.
5.
6.
7.
Goals of the paper
Notation reminder
Clustering With Little Memory
Data Stream Model
Clustering with Data Streams
Lower Bounds and Deterministic Algorithms
Conclusion
2
Goals of the paper
 Since the k-Median problem is NP-hard
this paper attempts to create an
approximation with the following
constraints:
 Minimize memory usage
 Minimize CPU usage
 Work both on metric spaces and the special
case of Euclidean space
3
Talk Outline
1.
2.
3.
4.
5.
6.
7.
Goals of the paper
Notation reminder
Clustering With Little Memory
Data Stream Model
Clustering with Data Streams
Lower Bounds and Deterministic Algorithms
Conclusion
4
Notation Reminder
O(g(n)) – running time is upper bounded by g(n)
Ω(g(n)) – running time is lower bounded by g(n)
o(g(n)) – running time is asymptotically
negligible
θ(g(n)) – memory usage is upper bounded by g(n)
[not commonly used]
Soft-Oh: ~
O ( g (n))  O( g (n) log k g (n))
5
Paper-specific Notation
 cij is the distance between points i and j
 di the number of points associated with
median i
 NOTE: Do not confuse c and d. Presumably
the distance has been chosen to be “cij”
because distance can be treated as a “cost”.
It would have been more intuitive to have
it called “d” from the word “distance”.
6
Talk Outline
1.
2.
3.
4.
5.
6.
7.
Goals of the paper
Notation reminder
Clustering With Little Memory
Data Stream Model
Clustering with Data Streams
Lower Bounds and Deterministic Algorithms
Conclusion
7
Clustering with little memory
Algorithm: SmallSpace(S)
1) Divide S into l disjoint pieces X1…Xl
2) For each Xi find O(k) centers in it. Assign each
point to its closest center.
3) Let X’ be the set of O(lk) centers obtained
where each center is weighed by the number
of points assigned to it
4) Cluster X’ to find k centers
8
SmallSpace (2)
Chunk1
Main Memory
K
Chunk2
Chunk3
…
9
SmallSpace analysis
 Since we are interested in using as little memory
as possible, l has to be chosen so that both each
partition of S and X’ fit in main memory.
However, no such l may exist if S is very large.
 We will use this algorithm as a starting point and
improve it so that it will satisfy all requirements.
10
Theorem 1
Given an instance of the k-median problem
with a solution of cost C, where the medians
may not belong to the set of input points,
there exists a solution of cost 2C where all
the medians belong to the set of input points
(metric space requirement).
11
Theorem 1 Proof
Consider the figure:
 The distance from (4) [closest to the true
mean] to any other point (i) in the data is
bounded by cim+cm4 [triangle inequality]
 Therefore, the maximum cost for the median
will be at most two times the cost of the
median clustering with no constraints (worst
case)
12
Theorem 2
Consider a set of n points partitioned into x1,…,xl
(disjoint sets). The sum of the optimum solution
values for the k-median problem on the l sets of
points is at most twice the cost of the optimum kmedian problem solution for all n points.
13
Theorem 2 Proof
 This is Theorem 1, but on l clusters.
 Apply theorem 2 l times, and obtain a
maximum cost which is two times the cost
in the case when it is allowed to have
medians which are not part of the data
14
Theorem 3 (SmallSize Step 2)
If the sum of the costs of the l optimum kmedian solutions for x1,…,xl is C and if C* is
the cost of the optimum k-median solution
on the entire set S, then there exists a
solution of cost at most 2(C+C*) to a the new
weighted instance X’.
15
Theorem 3 Proof (1)
 Let i’ be a point in X’ (a median obtained
by SmallSpaces)
 Let the point to which i’ is assigned to in
the optimum continuous solution be (i’),
and the number of points assigned to i’ be
di’
c
d
 Then the cost of X’ is
i
'

(
i
'
)
i'
i'

16
Theorem 3 Proof (2)
 Let i be a point in the set S. then let i’(i) be the
median in X’ to which it was assigned by
SmallSpace.
 Then the cost of X’ can be written as:
c
i i '( i ) ( i '( i ))
 Let the median assigned to i in the optimal
continuous solution on S be (i)
17
Theorem 3 Proof (3)
 Because  is optimal for X’, the cost is no more
than
c
i i '( i ) ( i )
 i cii'(i )  ci (i ) 
 The last sum evaluates to C + C* for the continuous case or
2(C + C*) in the metric space case

[Reminder: The sum of the costs of the l optimum k-median
solutions for x1,…,xl is C and C* is the cost of the optimum kmedian solution on the entire set S]
18
Theorem 4 (SmallSize step 2, 4)
 If we modify step 2 to use a bicriteria
approximation algorithm (a,b) where at most ak
medians are output with a cost of at most b times
the optimal k-Median solutions, and then:
 Modify Step 4 to run a c-approximation
algorithm, then:
Theorem 4: The algorithm SmallSpace has an
approximation factor of 2c(1+b)+2b [not proven
here]
19
SmallerSpace
Algorithm SmallerSpace(S,i)
1) Divide S into l disjoint pieces X1…Xl
2) For each Xi find O(k) centers in it. Assign each
point to its closest center.
3) Let X’ be the O(lk) centers obtained in (2)
where each center is weighed by the number of
points assigned to it
4) Call SmallerSpace(X’, i-1)
20
SmallerSpace [2]
k
A small factor is lost in the approximation
with each level of divide and conquer
k
k
k
In general, if |Memory|=ne, need 1/e levels,
approximation factor 2O(1/e)
k
• If n=1012 and M=106, then regular 2level algorithm
• If n=1012 and M=103 then need 4 levels,
approximation factor 24
k
k
k
k
k
…
k
21
SmallerSpace Analysis
Theorem 5: For a constant i, SmallerSpace(S,i)
gives a constant factor approximation to the kmedian problem.
Proof: The approximation at level j is
Aj=2Aj-1(2b+1) + 2b (Theorem 2,4) which has the
solution Aj=c(2(b+1))j which is O(1) if j is constant.
22
SmallerSpace Analysis (2)
 Then, since all intermediate medians X’
must be stored in memory, the number of
subsets l that we partition S into is limited.
 In fact, we need lk <= M, and such an l may
not exist (where M is the memory size)
23
Talk Outline
1.
2.
3.
4.
5.
6.
7.
Goals of the paper
Notation reminder
Clustering With Little Memory
Data Stream Model
Clustering with Data Streams
Lower Bounds and Deterministic Algorithms
Conclusion
24
Datastream model
 Datastream: set of ordered points: x1,…,xi ,…, xn
 Algorithm performance is measured as the number
of passes on the data given the constraints of
available memory
 Usually the number of points is extremely large so
it is impossible to fit all of them in memory
 Usually once a point has been read it is very
expensive to read it again. Most algorithms assume
the data will not be available for a second pass.
25
Data Stream Algorithm
1) Input the first m points; use a bicriterion
algorithm to reduce these to O(k) (e.g., 2k) points.
Weigh each intermediate median by the number
of points assign to it. (depending on algorithm
used this can take O(m2) or O(mk))
2) Repeat (1) until we have seen m2/(2k) of the
original data points.
3) Cluster these m first-level medians into 2k
second-level medians
26
Data Stream Algorithm (2)
4) Maintain at most m level-i medians, and on
seeing m, generate 2k level-i+1 medians with the
weight of the new median as the sum of the
weights of the intermediate medians.
5) When we have seen all data points or when we
decide to stop we cluster all intermediate medians
into k final medians
27
Data Stream Algorithm (3)
m
2k
m
2k
Level 2
M->K
…
m
2k
m
2k
m
2k
…
m
2k
…
2k
Level 3
M->K
2k
Level i
M->K
2k
…
Final
K
28
Data Stream Algorithm Analysis
 The algorithm requires O(log(n/m)log(m/k)) levels
 If k much smaller than m, and m = O(n) for  < 1:
 (n) space
 O(n1+ ) run time
 up to a O(21/) approximation factor (constant
factor approximation)
29
Talk Outline
1.
2.
3.
4.
5.
6.
7.
Goals of the paper
Notation reminder
Clustering With Little Memory
Data Stream Model
Clustering with Data Streams
Lower Bounds and Deterministic Algorithms
Conclusion
30
Randomized Algorithm
1)
2)
3)
4)
5)
6)
Draw a sample of size s = (nk)1/2
Find k medians from these s points using a
primal dual algorithm
Assign each of the original points to its closest
median
Collect n/s points with the largest assignment
distance
Find k medians from among these n/s points
At this point we have 2k medians
31
Randomized Algorithm Analysis
 The algorithm gives a O(1) approximation
with 2k medians with constant probability.
 O(log n) passes for high probability results
~
 O (nk ) time and space
 Space can be improved to O((nk)1/2)
32
Full Algorithm
1) Input the first O(M/k) points then use the
randomized algorithm to find 2k intermediate
median points
2) Use a local search algorithm to cluster O(M)
intermediate median points of level i to 2k
medians of level i+1
3) Use the primal dual algorithm to cluster the final
O(k) medians into k medians
33
Full Algorithm (2)
 The full algorithm is still one pass (we call
the randomized algorithm only once per
input set)
~
 Step 1 is O (nk )
 Step 2 is O(nk)
~
 Therefore, the final cost is O (nk )
34
Talk Outline
1.
2.
3.
4.
5.
6.
7.
Goals of the paper
Notation reminder
Clustering With Little Memory
Data Stream Model
Clustering with Data Streams
Lower Bounds and Deterministic Algorithms
Conclusion
35
Lower Bounds
 Consider a clustering where the distance between two points is
1 if they belong to the same cluster and 0 otherwise
 An algorithm is not constant factor if it does not discover a
clustering of cost 0
 Finding such a clustering is equivalent to the following: in a
complete k-partite graph G for some k, find the k-partition of
vertices of G into independent sets.
 The best algorithm to find that requires (nk) queries and
therefore lower bounds any c.f. clustering algorithm
36
Deterministic Algorithms: A1
1) Partition the n original points into p1
subsets
2) Apply the primal dual algorithm to each
subset (O(an2) for each)
3) Apply it again to the p1k weighted points
obtained at (2) to get the final k medians
37
A1: Details
 If we choose the number of subsets p1 =
(n/k)2/3 we have:
• O(n4/3k2/3) runtime and space
• 4c2 + 4c approximation factor by Theorem 4,
where c is the approximation given by the
primal-dual algorithm
38
Deterministic Algorithms: A2
1) Split the dataset into p2 partitions
2) Apply A1 on each of them
3) Apply A1 on all the intermediate medians
at (2)
39
A2: Details
 If we choose the number of subsets p1 =
(n/k)4/5 in order to minimize the running
time we have:
• O(n16/15k14/15) runtime and space
 We can see a trend!
40
Deterministic Algorithm
 Create algorithm Ai that calls Ai-1 on pi
partitions
 Then the complexity in both time and
space of this algorithm will be:
i
a2 n


 1 1 


2i 1
 2 1 
k


 1 1 


2i 1
 2 1 
41
Deterministic Algorithm (2)
 The approximation factor grows with i,
however:
ci  4c
2
i 1
 4ci 1
 We can set i=(log log log n) in order to get
the exponent of n in the running time to be
1.
42
Deterministic Algorithm (2)
 This gives an algorithm running in
~
O (nk ) space and time.
43
Talk Outline
1.
2.
3.
4.
5.
6.
7.
Goals of the paper
Notation reminder
Clustering With Little Memory
Data Stream Model
Clustering with Data Streams
Lower Bounds and Deterministic Algorithms
Conclusion
44
Conclusion
 We have presented a variety of algorithms
optimized to address the problem of
clustering in systems where the amount of
data is huge
 All the algorithms presented are just
approximations to the k-means problem
45
References
1)
Eric W. Weisstein. "Complete k-Partite Graph." From MathWorld--A Wolfram Web
Resource. http://mathworld.wolfram.com/Completek-PartiteGraph.html
2)
http://theory.stanford.edu/~nmishra/CS361-2002/lecture9-nina.ppt
46