Slides

Probabilistic n-of-N Skyline Computation
over Uncertain Data Streams
Wenjie Zhang
University of New South Wales
Joint work:
Aiping Li (NUDT), Ying Zhang, Muhammad Aamir Cheema, Lijun
Chang (UNSW)
Outline
2
Overview
Algorithms
Experiment
Conclusion
Overview --- skyline
3

Skyline computation plays a vital role in daily lives
Smaller screen (lighter)
Higher CPU speed
Lower price
Overview
4
Higher Star
Shorter distance to airport
Lower price
Overview --- skyline
5
Skyline: candidates of best options in multi-criteria
decision applications.




n-dimensional numeric space D = (D1, …, Dn)
on each dimension, a user preference ≻ is defined
two points, u dominates v (u ≻ v), if
  Di (1 ≤ i ≤ n), u.Di ≻= v.Di
  Dj (1 ≤ j ≤ n), u.Dj ≻ v.Dj
Skyline: points not dominated
by another point.
Overview --- uncertainty exists
6
Overview --- uncertainty exists
7
Overview --- streaming
8

Streaming environment
 Online
trading system
 Stock management
 Financial market
 Real estate monitoring
 ……
Overview --- a conceptual example
9
2 0.1
1 0.1
4 0.8
6 0.5
3 0.4
5 0.1



Elements continuously arrive with occurrence probabilities
Problem : How to continuously compute skylines in a sliding
window with size N (elements)?
Sliding window: N = 5
Overview --- n-of-N
10



Different users may have different window sizes
Supporting different N ?
n-of-N model [ICDE 2005, Lin et al]
 Support

any window size n as long as n ≤ N
n-of-N skyline over uncertain streams
Related work
11
Probabilistic
skyline
computation

Probabilistic stream skyline (ICDE 09)

Probabilistic skyline (VLDB07)



Uncertain stream
processing


Probabilistic reverse skyline
(SIGMOD08)
Probabilistic aggregates and
sketches over uncertain streams
(SIGMOD07, SODA07, PODS07)
Frequent items on uncertain streams
(SIGMOD08)
Top-k queries over uncertain sliding
window (VLDB08)
……
Models and Problem Definition
12

Model: DS is a stream of elements, each element a is in
a d-dimensional space and with an occurrence
probability P(a) ( in (0, 1])
The skyline probability of an element a is:
Psky (a)  P(a )  a 'DS , a ' a (1  P(a ' ))

Problem Definition: retrieving elements from the most
recent n (n ≤ N) elements, with skyline probability no
less than a given threshold q
Challenges and Contributions
13

Space efficiency:
N
can be too large to fit in memorys
 Space reduction: O(N) to O(lnd-1N)

Time efficiency
 Elements
in sliding window continuously changes
 Naively re-computing with each change: cost
prohibitive
Outline
14
Overview
Algorithms
Experiment
Conclusion
Framework: what to keep ?
15
Psky (a)  P(a)  Pold (a)  Pnew (a)
Pold (2) = 1 – P(1)
2 0.1
1 0.1
Pnew(2) = (1 – P(3)) * (1 – P(4))
4 0.8
3 0.4
5 0.1
window size N : 5 probability threshold: 0.5
Framework: what to keep ?
16

Candidate set SN,q: Pnew (a)  q [ICDE09, Zhang et al]
 Correctness:
(1) no missing skyline points
(2) no false hits to determine SN, qs
(3) no false positive to determine skyline results
(4) no false negative to determine skyline results
--- probability based on SN,q may not be accurate, but
satisfies the threshold requirement.
Space of Candidate Set
17

Theorem: Candidate Set requires a poly-logarithmic
space on average case regarding uniform
distributions, O(f(q)lnd-1N).
Result Set
18

Critical dominance relations
a1 a2 a3 a4 a5 a6 a7 a8 a9 a10
Psky(a8) = (1–P(a2)) ×(1-P(a3)) ×(1-P(a5) ×(1-P(a10)) × P(a8)
Pold
Pnew
a3 critically dominates a8:
a8 is probabilistic skyline for any recent n
elements where n < 7 ( 10 – 3)
Algorithm
19



R-tree based indexing
Using dominance check technique to quickly identify
critical dominance relation.
Update as the window slides
 Insert
 Delete
Experiment
20

Data set:
 Real:
stock transactions. 2-d. probability assigned
randomly. Size: 2 million
 Synthetic: spatial location (independent or anticorrelated); probability (uniform or normal); 2d to 5d; 2
million
 Default values: p : 0.3; d: 3; N : 1M; spatial
distribution: anti-correlated; probability: uniform;
Experiment
21

Algorithms
 q-sky:
algorithm in [ICDE09, Zhang et al] to keep
candidate set. For an n-of-N query, naively check each
element in the candidate set.
 pnN: our processing algorithm utilizing critical
dominance relation.
 pmnN: our algorithms for continuously maintaining the
data structure for supporting pnN.
Experiment - pnN
22
Experiment - pnN
23
Experiment - scalability
24
Experiment - scalability
25
Experiment
26
Conclusion and Future Work
27

Conclusion
 Probabilistic
skyline in data streams following the n-of-
N model

Future work
 Computation
sharing
 More general uncertain model
28
Thanks !
Framework
29

Space required for SN,q:
 SN,q
is the minimum information to be maintained to get
a correct answer.
Psky(3)Psky
=(3)
0.9=> q
0.9 * (1 – 0.4) * (1- 0.3) < q
3 0.9
2 0.4
1 0.3
4
0.8
window size N : 4 probability threshold q: 0.5