R-List - DKE

2017년 1월
한국기술교육대학교
민준기

Discrete model
◦ Given a set of uncertain objects 𝔻, an object 𝑈 ∈ 𝔻 is
modeled as a set of instances and denoted by 𝑈 =
{𝑢1 , 𝑢2 , … , 𝑢 𝑈 } where 𝑢𝑖 is associated with an existence
probability 𝑃(𝑢𝑖 )

Continuous model
Please refer to the
◦ An uncertain object 𝑈 ∈ 𝔻 is modeled
as an
uncertainty
paper for
details
region 𝑈. 𝑅 with its probabilistic distribution function
𝑈. 𝑓(⋅)
Obj.
An Example of the
Discrete Model
A
B
Instance 𝒖𝒊 𝑷(𝒖𝒊 )
𝑎1 = 〈10,40〉
0.5
𝑎2 = 〈75,10〉
0.4
𝑏1 = 〈55,20〉
0.2
𝑏2 = 〈65,30〉
0.2
Obj.
C
D
Instance 𝒖𝒊
𝑷(𝒖𝒊 )
𝑐1 = 〈95,60〉
0.8
𝑐2 = 〈80,70〉
0.1
𝑑1 = 〈5,80〉
0.4
𝑑2 = 〈90,25〉
0.5


Given a set of d-dimensional points
{𝑝1 , 𝑝2 , … , 𝑝𝑛 } with 𝑝𝑖 = 〈𝑝𝑖 (1), … , 𝑝𝑖 (𝑑)〉
Skyline is the set of all points that are not
dominated by any other point
Weight
A point 𝑝𝑖 = 〈𝑝𝑖 (1), … , 𝑝𝑖 𝑑 〉 dominates another point 𝑝𝑗 = 〈𝑝𝑖 (1), … , 𝑝𝑖 𝑑 〉 if
 𝑝𝑖 𝑘 ≤ 𝑝𝑗 (𝑘) for all dimensions 1 ≤ 𝑘 ≤ 𝑑
 𝑝𝑖 𝑘 < 𝑝𝑗 (𝑘) in at least a single dimension 𝑘
We denote it by 𝑝𝑖 ≺ 𝑝𝑗
Dell
6
5
Acer
Samsung
The skyline is {Samsung, Asus}
Asus
4
200
250
300
Price
v1
ui
v2
and
v1
ui
v2
When an instance ui of an object U exists,
if v1 does not exist and v2 does not exist,
ui is a skyline instance
Probabilistic
is the set of all objects 𝑈 in 𝔻 such that
Prob(ui is askyline
skyline)
𝑃𝑠𝑘𝑦 𝑈 ≥ 𝑇𝑝exists) and not(v exists)
= Prob(ui eixists) x Prob(not(v
1
2
= P(ui) x Prob(not(v1 exists or v2 exists))
= P(ui) x (1- (P(v1)+P(v2))

In general, the skyline probability of an instance ui
𝑃𝑠𝑘𝑦 𝑢𝑖 = 𝑃 𝑢𝑖
1−
𝑽∈𝔻,𝑽≠𝑼
•
Skyline probability of an object 𝑈 is
𝑃𝑠𝑘𝑦 𝑈 =
𝒖𝒊 ∈𝑼 𝑃𝑠𝑘𝑦 (𝑢𝑖 )
𝑃 𝑣𝑗
𝒗𝒋 ∈𝑽,𝒗𝒋 ≺𝒖𝒊

Serial probabilistic skyline algorithms
◦
◦
◦
◦

[Pei, Jiang, Lin, Yuan: VLDB 2007]
[Atallah, Qi: PODS 2009]
[Zhang, Lin, Zhang, Wang, Zhu, Yu: ICDE 2009]
[Böhm, Fiedler, Oswald, Plant, Wackersreuther: CKIM 2009]
Parallel probabilistic skyline algorithms
◦ PSMR [Ding, Wang,
Xin, Yuan:can
BigData
2013]
Our algorithm
process
not
 Considers a special
of discrete
modelwith
in which each object
onlycase
the discrete
model
has a single instance
multiple instances but also the
continuous model


Utilizes random partitioning
Two MapReduce phases
◦ Local skyline probability phase
 Map
 Split 𝔻 into disjoint partitions P1, ..., Pm
 Generate every partition-pair (Pi, Pj) with 1ijm
 Reduce
 Given a parition-pair (Pi, Pj), for every instance u  U in each
partition Pi (or Pj), local skyline probability of u with the other
objects in Pj (or Pi) is computed
◦ Global skyline phase
 Map : do nothing
 Reduce:
 computes skyline probability of U using every local skylines
of U’s instances generated at previous reduce phase
P1
Obj.
A
B
Instance 𝒖𝒊 𝑷(𝒖𝒊 )
𝑎1 = 〈10,40〉
0.5
𝑎2 = 〈75,10〉
0.4
𝑏1 = 〈55,20〉
0.2
𝑏2 = 〈65,30〉
0.2
Obj.
C
D
d1
P2
Instance 𝒖𝒊
𝑷(𝒖𝒊 )
𝑐1 = 〈95,60〉
0.8
𝑐2 = 〈80,70〉
0.1
𝑑1 = 〈5,80〉
0.4
𝑑2 = 〈90,25〉
0.5
b2 c2 c1
a1
d2
a
2
b1
1−
𝑽∈𝑷𝒌,𝑽≠𝑼
𝑃 𝑣𝑗
𝒗𝒋 ∈𝑽,𝒗𝒋 ≺𝒖𝒊
Map/Shuffle
Value
(u, P(u), PLS(u,k))
a1, 0.5, 1.0
Ke
y
Value
(u, P(u), PLS(u,k))
A
A
a2, 0.4, 1.0
B
B
A
a2, 0.4, 1.0
C
c2, 0.1, 1.0
b1, 0.2, 1.0
B
b1, 0.2, 1.0
D
d1, 0.4, 1.0
b2, 0.2, 1.0
B
b2, 0.2, 1.0
D
d2, 0.5, 1.0
C
c1, 0.8, 0.1*0.6
C
c2, 0.1, 0.1*0.6
2, D={d1, d2}
D
d1, 0.4, 1.0
2, C={c1, c2}
D
d2, 0.5, 0.6*0.8
(1,1)
1, A= {a1,
a2}
1, B={b1, b2}
1, A={a1, a2}
1, B={b1, b2}
2, C={c1, c2}
2, D={d1, d2}
Reduce
a1, 0.5, 1.0
Value
(u, P(u), PLS(u,k))
c1, 0.8, 0.5
Value
(2,2)
Ke
y
A
Ke
y
C
Key
(1,2)
d2 dominates c1
(P1, P1)
(P1, P2)
(P2, P2)
a1, a2 dominate c1, c2
b1, b2 dominate c1, c2
a2 dominates d2
Value
(u, P(u), PLS(u,k))
a1, 0.5, 1.0
A
a2, 0.4, 1.0
B
b1, 0.2, 1.0
B
b2, 0.2, 1.0
Key
A
Value
a1, 0.5, 1.0
A
a2, 0.4, 1.0
B
b1, 0.2, 1.0
B
b2, 0.2, 1.0
C
c1, 0.8, 0.1*0.6
C
c2, 0.1, 0.1*0.6
D
d1, 0.4, 1.0
D
d2, 0.5, 0.6*0.8
Key
C
Value
c1, 0.8, 0.5
C
𝑃𝑠𝑘𝑦 𝑢𝑖 = 𝑃 𝑢𝑖
Value
A
a1, 0.5, 1.0
A
a1, 0.5, 1.0
A
a2, 0.4, 1.0
A
a2, 0.4, 1.0
B
b1, 0.2, 1.0
B
b1, 0.2, 1.0
B
b2, 0.2, 1.0
B
b2, 0.2, 1.0
C
c1, 0.8, 0.06
C
c1, 0.8, 0.5
C
c2, 0.1, 0.06
C
c2, 0.1, 1.0
D
d1, 0.4, 1.0
c2, 0.1, 1.0
D
d1, 0.4, 1.0
D
d1, 0.4, 1.0
D
d2, 0.5, 0.48
D
d2, 0.5, 1.0
D
d2, 0.5, 1.0
Map/Shuffle
Key
1−
𝑽∈𝔻,𝑽≠𝑼
PLS(ui,k) =
𝑽∈𝑷𝒌,𝑽≠𝑼
𝑃 𝑣𝑗
𝒗𝒋 ∈𝑽,𝒗𝒋 ≺𝒖𝒊
1−
𝑇ℎ𝑢𝑠, 𝑃𝑠𝑘𝑦 𝑢𝑖 = 𝑃 𝑢𝑖
𝒗𝒋 ∈𝑽,𝒗𝒋 ≺𝒖𝒊 𝑃
𝑣𝑗
𝒌=𝟏,𝒎 𝑃𝐿𝑆 (𝑢𝑖, 𝑘)
𝑃𝑠𝑘𝑦 𝑈 =
𝑃𝑠𝑘𝑦 (𝑢𝑖 )
𝒖𝒊 ∈𝑼
Reduce
Ke
y
A
Key
A
Value
0.9
B
0.4
C
0.03
D
0.64
For C, Psky(c1) = 0.8*(0.06*0.5)
Psky(c2) = 0.1*(1.0*0.06)
Thus, Psky(C) = 0.024+0.006 =0.03

Two MapReduce phases
◦ PS-BR-MR distributes each object to every
partition-pair
◦ We need an additional aggregation phase to
compute the skyline probabilistic of each object by
summing the local skyline probability of its
instances in multiple partitions

No early Filtering
◦ Even though the skyline probability P (u) of every
instance u of U is less than Tp, we cannot prune U
since Psky(U) could be less than Tp
 𝑃𝑠𝑘𝑦 𝑈 =
𝒖𝒊 ∈𝑼 𝑃𝑠𝑘𝑦 (𝑢𝑖 )

PS-QPF-MR consists of the two phases
◦ Build a quadtree using a sample (without MapReduce) to
split data into partitions
◦ Compute the probabilistic skyline for each partition
independently in parallel by using MapReduce


We devised three filtering techniques to reduce the
number of checking dominance relationships
We developed optimization techniques
◦ Reducing memory usage
◦ Reducing network overhead
◦ Balancing workloads

Quadtrees subdivide the 𝑑-dimensional space
recursively into sub-regions [Finkel and Bentley:
Acta Informatica 1974]
◦ Internal nodes have exactly 2d children
◦ Each leaf node has at most a predefined number of points 𝜌

Build a quadtree by using sample objects
◦ For example, assume that a1, b1, b2 and c2 are sampled and
100 the maximum number of instances in a leaf node is 2
node(01)
75
50
25
node(11)
d1
Node id
node(00)
a1
25
node(10) When 𝜌 = 2
a2 d2
50
75
100

To reduce number of checking dominance relationship
between instances, we apply three filtering technique before
every object distributes to leaf nodes.
◦ Upper-bound filtering
 Compute the upper bound of skyline probability of an object by
using the quadtree
◦ Zero-probability filtering
 The instance with zero skyline probability is removed
◦ Dominance-power filtering
 Maintain a small number of objects with the high dominating
power to check whether other object is a probabilistic skyline
candidate or not
y1
x2
w1
x1
Psky(y1)=P(y1)*(1 - x1 - x2)*(1 - w1)
≤P(y1)*(1 x 2)

Consider a leaf node n
◦ Every instance dominating n.min also dominates the
instances in n
 Compute the upper bound of skyline probabilities of
instances in 𝑛 using the probability that the instances
dominating the min point do not exist
 When we build a quadtree with a sample S, for each leaf
node n, we compute the probability Pup(n, S) that the
instances in S dominating n.min do not exist
Leaf node n
w2
z2
Min point of n
x2
w1
x1
z1
Psky(z1)=P(z1)*(1 - x1 - x2)*(1 - w1 - w2)
≤P(z1)*(1x2)*(1
- w1)
Psky(z2)=P(z2)*(1 - x2)*(1 - w1 - w2)
≤P(z2)*(1 - x2)*(1 - w1)
probability that the instances dominating
the min point do not exist


For each object U, the upper bound of Psky(U)
is the sum of the upper bounds of the
skyline probabilities of its instances
If the upper bound is less than Tp
◦ We do not compute the exact skyline probability
of the object since it is not skyline object
Psky(z1) ≤ 0.3
Psky(z2) ≤ 0.2
Psky(Z)
= Psky(z1) + Psky(z2) ≤ 0.3 + 0.2 = 0.5


A specific case of upper-bound filtering
◦ If Pup(n, S) = 0, we remove every instance in the leaf node n
For example,
◦ Psky (y1) = P(y1)*(1-P(x1)-P(x2))(1-P(w1)).
◦ Suppose X = {x1, x2}  S, where P(x1)+P(x2)=1, then Pup(n,S) = 0.
◦ Then, Psky (y1) = 0.0 since (1-P(x1)-P(x2)) = 0.0
◦ Note that, Psky(Y) = Psky(y1)+Psky(y2) = Psky(y2)
up
◦ In addition,
 Psky (z1) = P(z1)*(1-P(x1)-P(x2)) (1-P(y1))(1-P(w1))

= P(z1)*(1-P(x1)-P(x2))(1-P(w1))
We eliminate the instances in a leaf node n
when P (n, S) = 0.0
◦ We can remove such instances y1 and z1
y2
Leaf node
z n
Object
Instance
Probability
x1
0.6
x2
0.4
Y
y1
0.2
Z
z1
0.8
1
X
n.min
x1
y1
x2
w1

Basic idea is similar to Upper-Bound filtering.
◦ 𝑃𝑠𝑘𝑦 𝑈 =
𝒖𝒊 ∈𝑼 𝑃
𝑢𝑖
𝑽∈𝔻,𝑽≠𝑼
1−
𝒗𝒋 ∈𝑽,𝒗𝒋 ≺𝒖𝒊 𝑃
◦ For an object U with a set F 𝔻 , if
HigherHigher
probability
probability
→
→ More tight
More
upper
uppertight
bound
is
bound is
computed
computed
Object
X
D
Instance
Probability
x1
0.6
x2
0.4
x2
0
x1
D
𝑢𝑖 ∈𝑈 𝑃
𝑢𝑖
𝑣𝑗
𝑽∈𝑭,𝑉≠𝑈
1−
More large dominating
More large dominating area
area → More instances
→ More instances may be
may be dominated by
dominated by this instance
this instance


We maintain a dominating object set F per each
mapper dynamically.
In MapReduce, a mapper takes a set of objects,
called chunk, and the mapper calls map function
with each object.
◦ To maintain top-K highest dominance power objects as F,
we utilize min-heap.
Mapper
◦ In a map function with an object U,
Min heap
after applying three filtering techniques,
Map (object A)
U is put into min-heap
if DP(U) > DP(min-heap.root)
Map (object B)
or |min-heap| < K
Map (object C)

If the skyline probabilities of the same object’s
instances are computed in different machines
◦ We need an extra MapReduce phase to compute the skyline
probability of an object by summing the skyline
probabilities of its instances
Psky(A) = Psky(a1) + Psky(a2)

To compute probabilistic skyline without an extra
MapReduce phase
◦ We allocate all instances of each object to a single partition

Distributes objects based on the leaf nodes of a
quadtree
◦ In each partition, the skyline probabilities of objects whose
max point are in the leaf node of the partition are
computed.
Max point of an object U
is [max 𝑢𝑖 (1) , … , max 𝑢𝑖 (𝑑)]
𝑢𝑖 ∈𝑈
100
75
50
25
D
d1
c2
node(01)
node(11)
B
b2
a1
b1
node(00)
25
c1
C
A
75
Partition
M-list
node(00)
None
node(01)
None
The max
point of
the
node(10)
a12,} a2, b1, b2
object
A={a1,a
d2
node(10)
a2
50
𝑢𝑖 ∈𝑈
node(11)
100
c1, c2, d1, d2
List of
instances of
objects whose
max points are
in the node
• We also require other instances to compute skyline
probabilities
 To compute the skyline probability of c2, we require a1, a2,
b1 and b2
Definition:
leaf nodespatial
n1 weakly
dominatesbetween
a leaf node
n2nodes
If
 To
do so, weA utilized
relationship
leaf
n1.min(k) < n2.max(k) for k = 1,..., d, where n.min (n.max) is the
closest (farthest) corner of the leaf node n from the origin.
100
75
Lemma: If a leaf node n1 does not weakly dominate a leaf node n2,
every instances in n1 does not dominate any instance of the objects
allocating n2.
node(11).max
d1
c2
node(01)
node(11)
c1
Since node(01) does not weakly dominate node(10),
d1 does note dominate any instance of A and them o
50
b2 node(10) node(10).max
node(01).min
a1
25
b1
d2
node(10).min
a2
node(00)
25
50
75
100

Based on Lemma, if n1 weakly dominates n2,
every instance in n1 may dominates an
instances of an object U allocating n2.
◦ note that node(10) weakly dominates node(10).
List of instances required to compute
the skyline probabilities of instances in
M-list
100
75
50
25
d1
c2
node(01)
c1
node(11)
node(10)
a1
node(00)
25
b2
b1
a2
50
d2
75
100
Partition
M-list
R-list
node(00)
None
a1
node(01)
None
a1, d1
node(10)
a1, a2, b1, b2
d2
node(11)
c1, c2, d1, d2
a1, a2, b1, b2

We compute probabilistic skyline objects in each
partition in parallel
Probabilistic Skyline
Object
Partition
M-List
R-List
node(00)
None
a1
node(01)
None
a1, d1
node(10)
a1, a2, b1, b2
d2
node(11)
c1, c2, d1, d2
a1, a2, b1, b2
A
D
Time complexity
= O(|M-list|*(|M-list|+|R-list|))
Space complexity
=
InO(|M-list|+|R-list|)
the partition with node(11)
In the partition with node(10)
M-list ∪ R-list
M-list
Object
a1A
a2B
b1
b2
Probability
a1
0.9 a2
0.4 b1
b2
d2
Tp=0.5
Inst
Psky(ui)
a1
0.5
a2
0.4
b1
0.2
b2
0.2
Inst
Psky(ui)
Object
Psky(U)
c1
0.024
C
0.03
c2
0.006
D
0.64
d1
0.4
d2
0.24

We can reduce the memory usage if
◦ All instances in M-list appear first
◦ All instances in R-list appear next

We can sort input of reduce function by using the
secondary sorting functionality provided by the
MapReduce framework
Compare with all
instances in
M-list
Memory
Input
c1, “M”
Partition
M-List
R-List
node(11)
c1, c2, d1, d2
a1, a2, b1, b2
Space complexity
= O(|M-list|+|R-list|)
Space complexity
= O(|M-list|)
c2, “M”
c1, “M”
d1, “M”
c2, “M”
d2, “M”
a1, “R”
d1, “M”
a2, “R”
d2, “M”
b1, “R”
a1 is discarded
b2, “R”
aa21,, “R”


As # of partitions increases, the network overhead increases
Partition merging can reduce the number of instances
transmitted by network
We assume we can keep 4
instances in main memory
Partition
M-list
R-list
node(00)
None
a1
Partition
M-list
R-list
node(01)
None
a1, d1
node(10)
d2
node(10)
a1, a2, b1,
b2
d2
a1, a2, b1,
b2
Similar to the
Merged
c1, c2, d1,
a1, a2, b1,
bin packing
d2
b2
node(11) c1, c2, d1,
a1, a2, b
1,
problem
which
 When
we need to consider the memory
d2 we merge
b2 partition,
is NP-Complete
constraint Memory usage = O(|MCannot fit in main
list|)
memory
We develop an
Partition M-list
R-list approximation
Partition M-list
R-list
algorithmMerged’ a , a , b ,
node(10) a , a , b ,
d
None
1
b2
2
1
2
1
2
1
b2,
c1, c2, d1,

After merging partition, |M-list|s of partitions are
similar
◦ But, size of R-list may be skewed
◦ # of partitions may be less than # of machines
◦ We have to balance |R-list| by split R-list
Time complexity
= O(|M-list|*(|M-list|+|R-list|))
When we can use 3 machines
Partition
M-list
R-list
node(10)
a1, a2, b1,
b2
d2
Merged
c1, c2, d1,
d2
a1, a2, b1,
b2
We developed an
optimal greedy
algorithm
Partition
M-list
R-list
node(10)
a1, a2, b1,
b2
d2
Merged1
c1, c2, d1,
d2
a1, a2
Merged2
c1, c2, d1,
d2
b1, b2
Greedy Heuristics:
split R-list into equi-sized sub-lists