Managing Reference: Ensuring Referential Integrity of Ontologies for

Improvement of Apriori Algorithm
in Log mining
2003.5.21
Junghee Lee([email protected])
Jaeho Lee([email protected])
Information and Communications University, Korea
Introduction

Log mining


Association Rule




There may be some associations among user’s request
to a web server and we can find some patterns between
requested pages by analyzing web server’s log data
Probability that particular rules are happened together.
X => Y, where X ∧ Y = Ø
Apriori Algorithm
The Drawback of apriori algorithm

Ignore pattern’s weight
Apriori Algorithm

Algorithm

A frequent set


A candidate set


An set whose support is greater than some user-specified
minimum support (denoted Fk, where k is the size of the set)
A potentially frequent set (denoted Ck, where k is the size of
the set)
Step

Pass 1
1.
2.
Generate the candidate sets in C1
Save the frequent sets in F1
Apriori Algorithm

Algorithm (cont.)

Pass k
For( k=1; Fk != { }; k++ ) do {
Ck+1= the candidate set made using Fk
For each transaction t do
Ck+1.counter ++
Fk+1 is the set which satisfy the support on the Ck+1
}
The result = F1 U F2 U F3 U ….. Fk
Apriori Algorithm
Given the transaction table shown below



Database

Transaction table
ID
Visited sites
ID
A
B
C
D
E
1
{A, E, A, E, A, A}
1
1
0
0
0
1
2
{B, E, D, E, B}
2
0
1
0
1
1
3
{A}
3
1
0
0
0
0
4
{C, D, E, C, D}
4
0
0
1
1
1
Required support rate : r (%)

1: visited

0: unvisited
Apriori Algorithm

After Pass 1
(Required support rate : 50%)
Site set Support
rate(%)
Site set Support
rate(%)
{A}
50
{A}
50
{B}
25
{D}
50
{C}
25
{E}
75
{D}
50
{E}
75
Apriori Algorithm

After Pass 2
(Required support rate : 50%)
Site set Support
rate(%)

{A, D}
0
{A, E}
25
{D, E}
50
Results
{A}, {D}, {E}, {D, E}
Site set Support
rate(%)
{D, E}
50
The drawback of Apriori Algorithm


Ignore the number of counts(weighted frequency),
just consider whether the site is visited or not.
Don’t consider total number of sites
Modified Apriori Algorithm

Apply weighted frequency



Count the visit of sites
More counts means higher rate.
Algorithm

Pass 1
1.
2.
Generate the candidate sets in C1
Save the frequent sets in F1
Applied Apriori Algorithm

Algorithm

Pass k
For( k=1; Fk != { }; k++ ) do {
Ck+1= the candidate set made using Fk
For each transaction t do
Ck+1.counter = sum of countk
Rate = Support rate + Added rate
Fk+1 is the set which satisfy the rate on the Ck+1
}
The result = F1 U F2 U F3 U ….. Fk
Modified Apriori Algorithm
Given the transaction table shown below



Database

Transaction table
ID
Visited sites
ID
A
B
C
D
E
1
{A, E, A, E, A, A}
1
4
0
0
0
2
2
{B, E, D, E, B}
2
0
2
0
1
2
3
{A}
3
1
0
0
0
0
4
{C, D, E, C, D}
4
0
0
2
2
1
Required support rate : r (%)
number: the count of visits

Total counts: 17

Modified Apriori Algorithm

After Pass 1
{C}
(Required support rate : 50%)
<si,ci> Support Added Sum
rate(%) rate(%) (%)
Site set Support
rate(%)
<2,5>
50
12
62
{A}
62
<1,2>
25
2
27
{D}
57
<1,2>
25
2
27
{D}
<2,3>
50
7
57
{E}
<3,5>
75
18
93
Site
set
{A}
{B}
{E}
93
<si,ci>: si means the number of ID that visit i site, and ci means the visit
counts of i site. (i∈ site set)

Added rate

ARi 
e.g.) AR A 
ci
s
 i 100(%)
 ci S
5 2
  100(%)  12(%)
17 5
S: the number of sites
Modified Apriori Algorithm

After Pass 2
Site
set

<si,ci
>
(Required support rate : 50%)
Support Added Sum
rate(%) rate(%) (%)
{A, D} <0,0>
0
0
0
Site set Support
rate(%)
{A, E} <1,2>
25
2
27
{D, E}
{D, E} <2,2>
50
5
55
Results
{A}, {D}, {E}, {D, E}
Results are same in both algorithm under 50%, however, if
required rate is 55%, the results are different.

55
Implementation

Log prefetch

Using shell programming

Input :




file format of sites ( e.g. html, htm, php)
Required support rate
Optionally, period or year to want to use in log file
Output


Config file ( the total number of sites, the total number of ip that
visits sites, required support rate, list of sites)
Transaction table
Implementation

Algorithm

Using java

Input :



Config file
Transaction table
Output

Output file contain the result set
Conclusion

Comparison of two algorithm

Complexity



Apriori Algorithm:
O(max(n, m))
Modified Apriori Algorithm
O(max(n, m))
Accuracy
Future work



We don’t consider the count of refreshed sites.
Make more refined formula to calculate added rate.
Apply to cache source of web site.
Output