2017년 1월 한국기술교육대학교 민준기 Discrete model ◦ Given a set of uncertain objects 𝔻, an object 𝑈 ∈ 𝔻 is modeled as a set of instances and denoted by 𝑈 = {𝑢1 , 𝑢2 , … , 𝑢 𝑈 } where 𝑢𝑖 is associated with an existence probability 𝑃(𝑢𝑖 ) Continuous model Please refer to the ◦ An uncertain object 𝑈 ∈ 𝔻 is modeled as an uncertainty paper for details region 𝑈. 𝑅 with its probabilistic distribution function 𝑈. 𝑓(⋅) Obj. An Example of the Discrete Model A B Instance 𝒖𝒊 𝑷(𝒖𝒊 ) 𝑎1 = 〈10,40〉 0.5 𝑎2 = 〈75,10〉 0.4 𝑏1 = 〈55,20〉 0.2 𝑏2 = 〈65,30〉 0.2 Obj. C D Instance 𝒖𝒊 𝑷(𝒖𝒊 ) 𝑐1 = 〈95,60〉 0.8 𝑐2 = 〈80,70〉 0.1 𝑑1 = 〈5,80〉 0.4 𝑑2 = 〈90,25〉 0.5 Given a set of d-dimensional points {𝑝1 , 𝑝2 , … , 𝑝𝑛 } with 𝑝𝑖 = 〈𝑝𝑖 (1), … , 𝑝𝑖 (𝑑)〉 Skyline is the set of all points that are not dominated by any other point Weight A point 𝑝𝑖 = 〈𝑝𝑖 (1), … , 𝑝𝑖 𝑑 〉 dominates another point 𝑝𝑗 = 〈𝑝𝑖 (1), … , 𝑝𝑖 𝑑 〉 if 𝑝𝑖 𝑘 ≤ 𝑝𝑗 (𝑘) for all dimensions 1 ≤ 𝑘 ≤ 𝑑 𝑝𝑖 𝑘 < 𝑝𝑗 (𝑘) in at least a single dimension 𝑘 We denote it by 𝑝𝑖 ≺ 𝑝𝑗 Dell 6 5 Acer Samsung The skyline is {Samsung, Asus} Asus 4 200 250 300 Price v1 ui v2 and v1 ui v2 When an instance ui of an object U exists, if v1 does not exist and v2 does not exist, ui is a skyline instance Probabilistic is the set of all objects 𝑈 in 𝔻 such that Prob(ui is askyline skyline) 𝑃𝑠𝑘𝑦 𝑈 ≥ 𝑇𝑝exists) and not(v exists) = Prob(ui eixists) x Prob(not(v 1 2 = P(ui) x Prob(not(v1 exists or v2 exists)) = P(ui) x (1- (P(v1)+P(v2)) In general, the skyline probability of an instance ui 𝑃𝑠𝑘𝑦 𝑢𝑖 = 𝑃 𝑢𝑖 1− 𝑽∈𝔻,𝑽≠𝑼 • Skyline probability of an object 𝑈 is 𝑃𝑠𝑘𝑦 𝑈 = 𝒖𝒊 ∈𝑼 𝑃𝑠𝑘𝑦 (𝑢𝑖 ) 𝑃 𝑣𝑗 𝒗𝒋 ∈𝑽,𝒗𝒋 ≺𝒖𝒊 Serial probabilistic skyline algorithms ◦ ◦ ◦ ◦ [Pei, Jiang, Lin, Yuan: VLDB 2007] [Atallah, Qi: PODS 2009] [Zhang, Lin, Zhang, Wang, Zhu, Yu: ICDE 2009] [Böhm, Fiedler, Oswald, Plant, Wackersreuther: CKIM 2009] Parallel probabilistic skyline algorithms ◦ PSMR [Ding, Wang, Xin, Yuan:can BigData 2013] Our algorithm process not Considers a special of discrete modelwith in which each object onlycase the discrete model has a single instance multiple instances but also the continuous model Utilizes random partitioning Two MapReduce phases ◦ Local skyline probability phase Map Split 𝔻 into disjoint partitions P1, ..., Pm Generate every partition-pair (Pi, Pj) with 1ijm Reduce Given a parition-pair (Pi, Pj), for every instance u U in each partition Pi (or Pj), local skyline probability of u with the other objects in Pj (or Pi) is computed ◦ Global skyline phase Map : do nothing Reduce: computes skyline probability of U using every local skylines of U’s instances generated at previous reduce phase P1 Obj. A B Instance 𝒖𝒊 𝑷(𝒖𝒊 ) 𝑎1 = 〈10,40〉 0.5 𝑎2 = 〈75,10〉 0.4 𝑏1 = 〈55,20〉 0.2 𝑏2 = 〈65,30〉 0.2 Obj. C D d1 P2 Instance 𝒖𝒊 𝑷(𝒖𝒊 ) 𝑐1 = 〈95,60〉 0.8 𝑐2 = 〈80,70〉 0.1 𝑑1 = 〈5,80〉 0.4 𝑑2 = 〈90,25〉 0.5 b2 c2 c1 a1 d2 a 2 b1 1− 𝑽∈𝑷𝒌,𝑽≠𝑼 𝑃 𝑣𝑗 𝒗𝒋 ∈𝑽,𝒗𝒋 ≺𝒖𝒊 Map/Shuffle Value (u, P(u), PLS(u,k)) a1, 0.5, 1.0 Ke y Value (u, P(u), PLS(u,k)) A A a2, 0.4, 1.0 B B A a2, 0.4, 1.0 C c2, 0.1, 1.0 b1, 0.2, 1.0 B b1, 0.2, 1.0 D d1, 0.4, 1.0 b2, 0.2, 1.0 B b2, 0.2, 1.0 D d2, 0.5, 1.0 C c1, 0.8, 0.1*0.6 C c2, 0.1, 0.1*0.6 2, D={d1, d2} D d1, 0.4, 1.0 2, C={c1, c2} D d2, 0.5, 0.6*0.8 (1,1) 1, A= {a1, a2} 1, B={b1, b2} 1, A={a1, a2} 1, B={b1, b2} 2, C={c1, c2} 2, D={d1, d2} Reduce a1, 0.5, 1.0 Value (u, P(u), PLS(u,k)) c1, 0.8, 0.5 Value (2,2) Ke y A Ke y C Key (1,2) d2 dominates c1 (P1, P1) (P1, P2) (P2, P2) a1, a2 dominate c1, c2 b1, b2 dominate c1, c2 a2 dominates d2 Value (u, P(u), PLS(u,k)) a1, 0.5, 1.0 A a2, 0.4, 1.0 B b1, 0.2, 1.0 B b2, 0.2, 1.0 Key A Value a1, 0.5, 1.0 A a2, 0.4, 1.0 B b1, 0.2, 1.0 B b2, 0.2, 1.0 C c1, 0.8, 0.1*0.6 C c2, 0.1, 0.1*0.6 D d1, 0.4, 1.0 D d2, 0.5, 0.6*0.8 Key C Value c1, 0.8, 0.5 C 𝑃𝑠𝑘𝑦 𝑢𝑖 = 𝑃 𝑢𝑖 Value A a1, 0.5, 1.0 A a1, 0.5, 1.0 A a2, 0.4, 1.0 A a2, 0.4, 1.0 B b1, 0.2, 1.0 B b1, 0.2, 1.0 B b2, 0.2, 1.0 B b2, 0.2, 1.0 C c1, 0.8, 0.06 C c1, 0.8, 0.5 C c2, 0.1, 0.06 C c2, 0.1, 1.0 D d1, 0.4, 1.0 c2, 0.1, 1.0 D d1, 0.4, 1.0 D d1, 0.4, 1.0 D d2, 0.5, 0.48 D d2, 0.5, 1.0 D d2, 0.5, 1.0 Map/Shuffle Key 1− 𝑽∈𝔻,𝑽≠𝑼 PLS(ui,k) = 𝑽∈𝑷𝒌,𝑽≠𝑼 𝑃 𝑣𝑗 𝒗𝒋 ∈𝑽,𝒗𝒋 ≺𝒖𝒊 1− 𝑇ℎ𝑢𝑠, 𝑃𝑠𝑘𝑦 𝑢𝑖 = 𝑃 𝑢𝑖 𝒗𝒋 ∈𝑽,𝒗𝒋 ≺𝒖𝒊 𝑃 𝑣𝑗 𝒌=𝟏,𝒎 𝑃𝐿𝑆 (𝑢𝑖, 𝑘) 𝑃𝑠𝑘𝑦 𝑈 = 𝑃𝑠𝑘𝑦 (𝑢𝑖 ) 𝒖𝒊 ∈𝑼 Reduce Ke y A Key A Value 0.9 B 0.4 C 0.03 D 0.64 For C, Psky(c1) = 0.8*(0.06*0.5) Psky(c2) = 0.1*(1.0*0.06) Thus, Psky(C) = 0.024+0.006 =0.03 Two MapReduce phases ◦ PS-BR-MR distributes each object to every partition-pair ◦ We need an additional aggregation phase to compute the skyline probabilistic of each object by summing the local skyline probability of its instances in multiple partitions No early Filtering ◦ Even though the skyline probability P (u) of every instance u of U is less than Tp, we cannot prune U since Psky(U) could be less than Tp 𝑃𝑠𝑘𝑦 𝑈 = 𝒖𝒊 ∈𝑼 𝑃𝑠𝑘𝑦 (𝑢𝑖 ) PS-QPF-MR consists of the two phases ◦ Build a quadtree using a sample (without MapReduce) to split data into partitions ◦ Compute the probabilistic skyline for each partition independently in parallel by using MapReduce We devised three filtering techniques to reduce the number of checking dominance relationships We developed optimization techniques ◦ Reducing memory usage ◦ Reducing network overhead ◦ Balancing workloads Quadtrees subdivide the 𝑑-dimensional space recursively into sub-regions [Finkel and Bentley: Acta Informatica 1974] ◦ Internal nodes have exactly 2d children ◦ Each leaf node has at most a predefined number of points 𝜌 Build a quadtree by using sample objects ◦ For example, assume that a1, b1, b2 and c2 are sampled and 100 the maximum number of instances in a leaf node is 2 node(01) 75 50 25 node(11) d1 Node id node(00) a1 25 node(10) When 𝜌 = 2 a2 d2 50 75 100 To reduce number of checking dominance relationship between instances, we apply three filtering technique before every object distributes to leaf nodes. ◦ Upper-bound filtering Compute the upper bound of skyline probability of an object by using the quadtree ◦ Zero-probability filtering The instance with zero skyline probability is removed ◦ Dominance-power filtering Maintain a small number of objects with the high dominating power to check whether other object is a probabilistic skyline candidate or not y1 x2 w1 x1 Psky(y1)=P(y1)*(1 - x1 - x2)*(1 - w1) ≤P(y1)*(1 x 2) Consider a leaf node n ◦ Every instance dominating n.min also dominates the instances in n Compute the upper bound of skyline probabilities of instances in 𝑛 using the probability that the instances dominating the min point do not exist When we build a quadtree with a sample S, for each leaf node n, we compute the probability Pup(n, S) that the instances in S dominating n.min do not exist Leaf node n w2 z2 Min point of n x2 w1 x1 z1 Psky(z1)=P(z1)*(1 - x1 - x2)*(1 - w1 - w2) ≤P(z1)*(1x2)*(1 - w1) Psky(z2)=P(z2)*(1 - x2)*(1 - w1 - w2) ≤P(z2)*(1 - x2)*(1 - w1) probability that the instances dominating the min point do not exist For each object U, the upper bound of Psky(U) is the sum of the upper bounds of the skyline probabilities of its instances If the upper bound is less than Tp ◦ We do not compute the exact skyline probability of the object since it is not skyline object Psky(z1) ≤ 0.3 Psky(z2) ≤ 0.2 Psky(Z) = Psky(z1) + Psky(z2) ≤ 0.3 + 0.2 = 0.5 A specific case of upper-bound filtering ◦ If Pup(n, S) = 0, we remove every instance in the leaf node n For example, ◦ Psky (y1) = P(y1)*(1-P(x1)-P(x2))(1-P(w1)). ◦ Suppose X = {x1, x2} S, where P(x1)+P(x2)=1, then Pup(n,S) = 0. ◦ Then, Psky (y1) = 0.0 since (1-P(x1)-P(x2)) = 0.0 ◦ Note that, Psky(Y) = Psky(y1)+Psky(y2) = Psky(y2) up ◦ In addition, Psky (z1) = P(z1)*(1-P(x1)-P(x2)) (1-P(y1))(1-P(w1)) = P(z1)*(1-P(x1)-P(x2))(1-P(w1)) We eliminate the instances in a leaf node n when P (n, S) = 0.0 ◦ We can remove such instances y1 and z1 y2 Leaf node z n Object Instance Probability x1 0.6 x2 0.4 Y y1 0.2 Z z1 0.8 1 X n.min x1 y1 x2 w1 Basic idea is similar to Upper-Bound filtering. ◦ 𝑃𝑠𝑘𝑦 𝑈 = 𝒖𝒊 ∈𝑼 𝑃 𝑢𝑖 𝑽∈𝔻,𝑽≠𝑼 1− 𝒗𝒋 ∈𝑽,𝒗𝒋 ≺𝒖𝒊 𝑃 ◦ For an object U with a set F 𝔻 , if HigherHigher probability probability → → More tight More upper uppertight bound is bound is computed computed Object X D Instance Probability x1 0.6 x2 0.4 x2 0 x1 D 𝑢𝑖 ∈𝑈 𝑃 𝑢𝑖 𝑣𝑗 𝑽∈𝑭,𝑉≠𝑈 1− More large dominating More large dominating area area → More instances → More instances may be may be dominated by dominated by this instance this instance We maintain a dominating object set F per each mapper dynamically. In MapReduce, a mapper takes a set of objects, called chunk, and the mapper calls map function with each object. ◦ To maintain top-K highest dominance power objects as F, we utilize min-heap. Mapper ◦ In a map function with an object U, Min heap after applying three filtering techniques, Map (object A) U is put into min-heap if DP(U) > DP(min-heap.root) Map (object B) or |min-heap| < K Map (object C) If the skyline probabilities of the same object’s instances are computed in different machines ◦ We need an extra MapReduce phase to compute the skyline probability of an object by summing the skyline probabilities of its instances Psky(A) = Psky(a1) + Psky(a2) To compute probabilistic skyline without an extra MapReduce phase ◦ We allocate all instances of each object to a single partition Distributes objects based on the leaf nodes of a quadtree ◦ In each partition, the skyline probabilities of objects whose max point are in the leaf node of the partition are computed. Max point of an object U is [max 𝑢𝑖 (1) , … , max 𝑢𝑖 (𝑑)] 𝑢𝑖 ∈𝑈 100 75 50 25 D d1 c2 node(01) node(11) B b2 a1 b1 node(00) 25 c1 C A 75 Partition M-list node(00) None node(01) None The max point of the node(10) a12,} a2, b1, b2 object A={a1,a d2 node(10) a2 50 𝑢𝑖 ∈𝑈 node(11) 100 c1, c2, d1, d2 List of instances of objects whose max points are in the node • We also require other instances to compute skyline probabilities To compute the skyline probability of c2, we require a1, a2, b1 and b2 Definition: leaf nodespatial n1 weakly dominatesbetween a leaf node n2nodes If To do so, weA utilized relationship leaf n1.min(k) < n2.max(k) for k = 1,..., d, where n.min (n.max) is the closest (farthest) corner of the leaf node n from the origin. 100 75 Lemma: If a leaf node n1 does not weakly dominate a leaf node n2, every instances in n1 does not dominate any instance of the objects allocating n2. node(11).max d1 c2 node(01) node(11) c1 Since node(01) does not weakly dominate node(10), d1 does note dominate any instance of A and them o 50 b2 node(10) node(10).max node(01).min a1 25 b1 d2 node(10).min a2 node(00) 25 50 75 100 Based on Lemma, if n1 weakly dominates n2, every instance in n1 may dominates an instances of an object U allocating n2. ◦ note that node(10) weakly dominates node(10). List of instances required to compute the skyline probabilities of instances in M-list 100 75 50 25 d1 c2 node(01) c1 node(11) node(10) a1 node(00) 25 b2 b1 a2 50 d2 75 100 Partition M-list R-list node(00) None a1 node(01) None a1, d1 node(10) a1, a2, b1, b2 d2 node(11) c1, c2, d1, d2 a1, a2, b1, b2 We compute probabilistic skyline objects in each partition in parallel Probabilistic Skyline Object Partition M-List R-List node(00) None a1 node(01) None a1, d1 node(10) a1, a2, b1, b2 d2 node(11) c1, c2, d1, d2 a1, a2, b1, b2 A D Time complexity = O(|M-list|*(|M-list|+|R-list|)) Space complexity = InO(|M-list|+|R-list|) the partition with node(11) In the partition with node(10) M-list ∪ R-list M-list Object a1A a2B b1 b2 Probability a1 0.9 a2 0.4 b1 b2 d2 Tp=0.5 Inst Psky(ui) a1 0.5 a2 0.4 b1 0.2 b2 0.2 Inst Psky(ui) Object Psky(U) c1 0.024 C 0.03 c2 0.006 D 0.64 d1 0.4 d2 0.24 We can reduce the memory usage if ◦ All instances in M-list appear first ◦ All instances in R-list appear next We can sort input of reduce function by using the secondary sorting functionality provided by the MapReduce framework Compare with all instances in M-list Memory Input c1, “M” Partition M-List R-List node(11) c1, c2, d1, d2 a1, a2, b1, b2 Space complexity = O(|M-list|+|R-list|) Space complexity = O(|M-list|) c2, “M” c1, “M” d1, “M” c2, “M” d2, “M” a1, “R” d1, “M” a2, “R” d2, “M” b1, “R” a1 is discarded b2, “R” aa21,, “R” As # of partitions increases, the network overhead increases Partition merging can reduce the number of instances transmitted by network We assume we can keep 4 instances in main memory Partition M-list R-list node(00) None a1 Partition M-list R-list node(01) None a1, d1 node(10) d2 node(10) a1, a2, b1, b2 d2 a1, a2, b1, b2 Similar to the Merged c1, c2, d1, a1, a2, b1, bin packing d2 b2 node(11) c1, c2, d1, a1, a2, b 1, problem which When we need to consider the memory d2 we merge b2 partition, is NP-Complete constraint Memory usage = O(|MCannot fit in main list|) memory We develop an Partition M-list R-list approximation Partition M-list R-list algorithmMerged’ a , a , b , node(10) a , a , b , d None 1 b2 2 1 2 1 2 1 b2, c1, c2, d1, After merging partition, |M-list|s of partitions are similar ◦ But, size of R-list may be skewed ◦ # of partitions may be less than # of machines ◦ We have to balance |R-list| by split R-list Time complexity = O(|M-list|*(|M-list|+|R-list|)) When we can use 3 machines Partition M-list R-list node(10) a1, a2, b1, b2 d2 Merged c1, c2, d1, d2 a1, a2, b1, b2 We developed an optimal greedy algorithm Partition M-list R-list node(10) a1, a2, b1, b2 d2 Merged1 c1, c2, d1, d2 a1, a2 Merged2 c1, c2, d1, d2 b1, b2 Greedy Heuristics: split R-list into equi-sized sub-lists
© Copyright 2024 Paperzz