1 Parallel Mining of Closed Sequential Patterns

Parallel Mining of Closed
Sequential Patterns
Shengnan Cong, Jiawei Han, David Padua
Proceeding of the 11th ACM SIGKDD international
conference on Knowledge discovery in data mining Chicago,
Illinois, USA, 2005
Advisor：Jia-Ling Koh
Speaker：Chun-Wei Hsieh
1
Introduction

Numerous applications:
–

Closed Sequential patterns
–
–

DNA sequences, Analysis of web log, customer shopping
sequences, XML query access patterns…
have All information
are more compact
Many applications are time-critical and involve huge
volumes of data.
2
Sequential Algorithm-BIDE



Step 1: Identify the frequent 1-sequences
Step 2: Project the dataset along each
frequent 1-sequence
Step 3: Mine each resulting projected dataset
3
Sequential Algorithm-BIDE
The projected dataset forsequence AB is {C,CB,C,BCA}.
4
Task Decomposition

1. Each processor counts the occurrence of 1-sequences in a
different part of the dataset. A global add reduction is executed
to obtain the overall counts.

2. Build pseudoprojections. This is done in parallel by assigning
a different part of the dataset to each processor. The pseudoprojections are communicated to all processors via an all-to-all
broadcast.

3. Dynamic scheduling to distribute the processing of the
projections across processors.
5
Task Decomposition

In the second step, it is more efficient to implement the
broadcast using a virtual ring structure.

Assume there are N processor, and
Processor K
–
–

Only receives the package from Processor ((K-1) mod N)
Only Sends the package to Processor ((K+1) mod N)
It needs (N-1) send-receive steps and consumes no more than
0.5% of the mining time.
6
Task Scheduling
1. A master processor maintains a queue of pseudoprojection identifiers. Other processors is initially
assigned a projection.
2. After mining a projection, a processor sends a request
to the master processor for another projection.
3. This process continues until the queue of projections
is empty.
7
Task Scheduling

If the largest subtask takes 25% of the total mining time,
the best possible speedup is only 4 regardless of the
number of processors available.

To improve the dynamic scheduling, the approach is to
find which projections require long mining time, and to
decompose them.
8
Relative Mining Time Estimation

Random sampling
– selects random subset of the projections
– is not accurate if the overhead is kept small

Selective sampling
–
–
uses every sequence of the projections
discards infrequent 1-sequences and the last L frequent 1sequences ( L = a given fraction t * the average length of
the sequences in the dataset )
9
Selective sampling

For example,
–
–
–
–


assume (A : 4), (B : 4), (C : 4), (D :3), (E : 3), (F : 3), (G : 1) are the
1-sequences
the support threshold = 4
the average length of the sequences in the dataset = 4
Suppose t = 75%
L = 4 ∗ 0 .75 = 3
Given a sequence as AABCACDCFDB,
selective sampling will reduce this sequence to AABCA
10
Relative Mining Time Estimation
11
Par-CSP Algorithm
12
Experiments





64 nodes
OS: Redhat Linux 7.2
CPU: 1GHz Intel Pentium 3
RAM: 1GB
Compiler: GNU g++ 2.96
13
Experiments
•Synthetic Dataset: IBM dataset generator
•Real Dataset: Gazelle, Web click-stream
14
Experiments
15
Experiments
16
Experiments
17
Experiments
18

Download Report

1 Parallel Mining of Closed Sequential Patterns

Paperzz.com

Your Paperzz