Probabilistic n-of-N Skyline Computation over Uncertain Data Streams Wenjie Zhang University of New South Wales Joint work: Aiping Li (NUDT), Ying Zhang, Muhammad Aamir Cheema, Lijun Chang (UNSW) Outline 2 Overview Algorithms Experiment Conclusion Overview --- skyline 3 Skyline computation plays a vital role in daily lives Smaller screen (lighter) Higher CPU speed Lower price Overview 4 Higher Star Shorter distance to airport Lower price Overview --- skyline 5 Skyline: candidates of best options in multi-criteria decision applications. n-dimensional numeric space D = (D1, …, Dn) on each dimension, a user preference ≻ is defined two points, u dominates v (u ≻ v), if Di (1 ≤ i ≤ n), u.Di ≻= v.Di Dj (1 ≤ j ≤ n), u.Dj ≻ v.Dj Skyline: points not dominated by another point. Overview --- uncertainty exists 6 Overview --- uncertainty exists 7 Overview --- streaming 8 Streaming environment Online trading system Stock management Financial market Real estate monitoring …… Overview --- a conceptual example 9 2 0.1 1 0.1 4 0.8 6 0.5 3 0.4 5 0.1 Elements continuously arrive with occurrence probabilities Problem : How to continuously compute skylines in a sliding window with size N (elements)? Sliding window: N = 5 Overview --- n-of-N 10 Different users may have different window sizes Supporting different N ? n-of-N model [ICDE 2005, Lin et al] Support any window size n as long as n ≤ N n-of-N skyline over uncertain streams Related work 11 Probabilistic skyline computation Probabilistic stream skyline (ICDE 09) Probabilistic skyline (VLDB07) Uncertain stream processing Probabilistic reverse skyline (SIGMOD08) Probabilistic aggregates and sketches over uncertain streams (SIGMOD07, SODA07, PODS07) Frequent items on uncertain streams (SIGMOD08) Top-k queries over uncertain sliding window (VLDB08) …… Models and Problem Definition 12 Model: DS is a stream of elements, each element a is in a d-dimensional space and with an occurrence probability P(a) ( in (0, 1]) The skyline probability of an element a is: Psky (a) P(a ) a 'DS , a ' a (1 P(a ' )) Problem Definition: retrieving elements from the most recent n (n ≤ N) elements, with skyline probability no less than a given threshold q Challenges and Contributions 13 Space efficiency: N can be too large to fit in memorys Space reduction: O(N) to O(lnd-1N) Time efficiency Elements in sliding window continuously changes Naively re-computing with each change: cost prohibitive Outline 14 Overview Algorithms Experiment Conclusion Framework: what to keep ? 15 Psky (a) P(a) Pold (a) Pnew (a) Pold (2) = 1 – P(1) 2 0.1 1 0.1 Pnew(2) = (1 – P(3)) * (1 – P(4)) 4 0.8 3 0.4 5 0.1 window size N : 5 probability threshold: 0.5 Framework: what to keep ? 16 Candidate set SN,q: Pnew (a) q [ICDE09, Zhang et al] Correctness: (1) no missing skyline points (2) no false hits to determine SN, qs (3) no false positive to determine skyline results (4) no false negative to determine skyline results --- probability based on SN,q may not be accurate, but satisfies the threshold requirement. Space of Candidate Set 17 Theorem: Candidate Set requires a poly-logarithmic space on average case regarding uniform distributions, O(f(q)lnd-1N). Result Set 18 Critical dominance relations a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 Psky(a8) = (1–P(a2)) ×(1-P(a3)) ×(1-P(a5) ×(1-P(a10)) × P(a8) Pold Pnew a3 critically dominates a8: a8 is probabilistic skyline for any recent n elements where n < 7 ( 10 – 3) Algorithm 19 R-tree based indexing Using dominance check technique to quickly identify critical dominance relation. Update as the window slides Insert Delete Experiment 20 Data set: Real: stock transactions. 2-d. probability assigned randomly. Size: 2 million Synthetic: spatial location (independent or anticorrelated); probability (uniform or normal); 2d to 5d; 2 million Default values: p : 0.3; d: 3; N : 1M; spatial distribution: anti-correlated; probability: uniform; Experiment 21 Algorithms q-sky: algorithm in [ICDE09, Zhang et al] to keep candidate set. For an n-of-N query, naively check each element in the candidate set. pnN: our processing algorithm utilizing critical dominance relation. pmnN: our algorithms for continuously maintaining the data structure for supporting pnN. Experiment - pnN 22 Experiment - pnN 23 Experiment - scalability 24 Experiment - scalability 25 Experiment 26 Conclusion and Future Work 27 Conclusion Probabilistic skyline in data streams following the n-of- N model Future work Computation sharing More general uncertain model 28 Thanks ! Framework 29 Space required for SN,q: SN,q is the minimum information to be maintained to get a correct answer. Psky(3)Psky =(3) 0.9=> q 0.9 * (1 – 0.4) * (1- 0.3) < q 3 0.9 2 0.4 1 0.3 4 0.8 window size N : 4 probability threshold q: 0.5
© Copyright 2026 Paperzz