Improvement of Apriori Algorithm in Log mining 2003.5.21 Junghee Lee([email protected]) Jaeho Lee([email protected]) Information and Communications University, Korea Introduction Log mining Association Rule There may be some associations among user’s request to a web server and we can find some patterns between requested pages by analyzing web server’s log data Probability that particular rules are happened together. X => Y, where X ∧ Y = Ø Apriori Algorithm The Drawback of apriori algorithm Ignore pattern’s weight Apriori Algorithm Algorithm A frequent set A candidate set An set whose support is greater than some user-specified minimum support (denoted Fk, where k is the size of the set) A potentially frequent set (denoted Ck, where k is the size of the set) Step Pass 1 1. 2. Generate the candidate sets in C1 Save the frequent sets in F1 Apriori Algorithm Algorithm (cont.) Pass k For( k=1; Fk != { }; k++ ) do { Ck+1= the candidate set made using Fk For each transaction t do Ck+1.counter ++ Fk+1 is the set which satisfy the support on the Ck+1 } The result = F1 U F2 U F3 U ….. Fk Apriori Algorithm Given the transaction table shown below Database Transaction table ID Visited sites ID A B C D E 1 {A, E, A, E, A, A} 1 1 0 0 0 1 2 {B, E, D, E, B} 2 0 1 0 1 1 3 {A} 3 1 0 0 0 0 4 {C, D, E, C, D} 4 0 0 1 1 1 Required support rate : r (%) 1: visited 0: unvisited Apriori Algorithm After Pass 1 (Required support rate : 50%) Site set Support rate(%) Site set Support rate(%) {A} 50 {A} 50 {B} 25 {D} 50 {C} 25 {E} 75 {D} 50 {E} 75 Apriori Algorithm After Pass 2 (Required support rate : 50%) Site set Support rate(%) {A, D} 0 {A, E} 25 {D, E} 50 Results {A}, {D}, {E}, {D, E} Site set Support rate(%) {D, E} 50 The drawback of Apriori Algorithm Ignore the number of counts(weighted frequency), just consider whether the site is visited or not. Don’t consider total number of sites Modified Apriori Algorithm Apply weighted frequency Count the visit of sites More counts means higher rate. Algorithm Pass 1 1. 2. Generate the candidate sets in C1 Save the frequent sets in F1 Applied Apriori Algorithm Algorithm Pass k For( k=1; Fk != { }; k++ ) do { Ck+1= the candidate set made using Fk For each transaction t do Ck+1.counter = sum of countk Rate = Support rate + Added rate Fk+1 is the set which satisfy the rate on the Ck+1 } The result = F1 U F2 U F3 U ….. Fk Modified Apriori Algorithm Given the transaction table shown below Database Transaction table ID Visited sites ID A B C D E 1 {A, E, A, E, A, A} 1 4 0 0 0 2 2 {B, E, D, E, B} 2 0 2 0 1 2 3 {A} 3 1 0 0 0 0 4 {C, D, E, C, D} 4 0 0 2 2 1 Required support rate : r (%) number: the count of visits Total counts: 17 Modified Apriori Algorithm After Pass 1 {C} (Required support rate : 50%) <si,ci> Support Added Sum rate(%) rate(%) (%) Site set Support rate(%) <2,5> 50 12 62 {A} 62 <1,2> 25 2 27 {D} 57 <1,2> 25 2 27 {D} <2,3> 50 7 57 {E} <3,5> 75 18 93 Site set {A} {B} {E} 93 <si,ci>: si means the number of ID that visit i site, and ci means the visit counts of i site. (i∈ site set) Added rate ARi e.g.) AR A ci s i 100(%) ci S 5 2 100(%) 12(%) 17 5 S: the number of sites Modified Apriori Algorithm After Pass 2 Site set <si,ci > (Required support rate : 50%) Support Added Sum rate(%) rate(%) (%) {A, D} <0,0> 0 0 0 Site set Support rate(%) {A, E} <1,2> 25 2 27 {D, E} {D, E} <2,2> 50 5 55 Results {A}, {D}, {E}, {D, E} Results are same in both algorithm under 50%, however, if required rate is 55%, the results are different. 55 Implementation Log prefetch Using shell programming Input : file format of sites ( e.g. html, htm, php) Required support rate Optionally, period or year to want to use in log file Output Config file ( the total number of sites, the total number of ip that visits sites, required support rate, list of sites) Transaction table Implementation Algorithm Using java Input : Config file Transaction table Output Output file contain the result set Conclusion Comparison of two algorithm Complexity Apriori Algorithm: O(max(n, m)) Modified Apriori Algorithm O(max(n, m)) Accuracy Future work We don’t consider the count of refreshed sites. Make more refined formula to calculate added rate. Apply to cache source of web site. Output
© Copyright 2026 Paperzz