The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14th, 2007 Introduction • Many data representation problems require the optimization of one parameter under a bound on one or more others. • Classical approaches treat them in a direct manner, producing complicated solutions, and sometimes resorting to heuristics. • Parameters involved have a monotonic relationship. • Hence, an alternative approach is possible, based on dual problems. Outline • • • • • Histograms. Restricted Haar Wavelet Synopses. Unrestricted Haar and Haar+ Synopses. l-Diversification in 1D. Compact Hierarchical Histograms. Histograms • Approximate a data set [d1, d2, …, dn] with B buckets, si = [bi, ei, vi] so that a maximum-error metric is minimized. • Classical solution: Jagadish et al. VLDB 1998 Guha et al. VLDB 2004, Guha VLDB 2005 Ei, b min max E j, b 1, j 1, i 1 j i O nB log 2 n • Recent solutions: Buragohain et al. ICDE 2007 Guha and Shim TKDE 19(7) 2007 n O n n log U log B O n B log n 2 3 (linear for B n log 3 n n 230 1,073,741,824 B 199 ) Histograms • Solve the error-bounded problem. Maximum Absolute Error bound ε = 2 4 5 6 2 15 17 3 6 9 12 … [ 4 ] [ 16 ] [ 4.5 ] [… • Generalized to any weighted maximum-error metric. d , d Each value di defines a tolerance interval i i w w i i Bucket closed when running union of interval becomes null Complexity: On Histograms • Apply to the space-bounded problem. Perform binary search in the domain of the error bound ε For error values requiring space B B, with actual error Error-bounded algorithm running under constraint If error Complexity: requires ~ BB instead of error space, then optimal solution has been reached. O n log Independent of buckets B error , run an optimality test: * Restricted Haar Wavelet Synopses • Select subset of Haar wavelet decomposition coefficients, so that a maximum-error metric is minimized. • Classical solution: Garofalakis and Kumar PODS 2004 Guha VLDB 2005 18 E i, v, b min E iL , v, b, , max E i , v , b b R E i , v z , b , L i max E i , v z , b b 1 i R On 0 18 18 7 -8 11 25 9 26 10 -9 10 10 2 34 16 2 20 20 0 36 16 Restricted Haar Wavelet Synopses • Solve the error-bounded problem. Muthukrishnan FSTTCS 2005 S iL , v S iR , v , S i, v min S i , v z S i , v z 1 i R i L n Local search within each of log n Complexity: subtrees in bottom log log n n2 O log n • Apply to the space-bounded problem. 2 log * Complexity: O n log n no significant advantage Haar tree levels Unrestricted Haar and Haar+ Synopses • Assign arbitrary values to Haar/Haar+ coefficients, so that a maximum-error metric is minimized. c • Classical solutions: o+ C1 Guha and Harb KDD 2005, SODA 2006 c 1 c2 + - c3 E i, v, b min + + C2 C3 c c 4 7 E i , v z , b, c5 + c c + - c9 6 8 max zS iv 0 b b z 0 L E iR , v z , b b z 0 Karras and Mamoulis ICDE 2007 + d0 E iL , v z h , b, min max , E i , v z , b b z 0 0bzhbSiv,zH 0 h h R h E iL , v zl , b, E i, v, b min minv max , E iR , v, b b zl 0 0bzl bSi ,zLl 0 E i , v , b , L max zmin v E iR , v z r , b b z r 0 r S i , R 0 b b z 0 r + + d1 + d2 O R 2 n log n log 2 B d3 n O RB log n B time space Unrestricted Haar and Haar+ Synopses • Solve the error-bounded problem. S i, v min S iL , v z S iR , v z z 0 zS v i unrestricted Haar S iL , v z h , minv max , z h S i , H S iR , v z h z h 0 S i , v z , L l S i, v min minv max , z l S i , L S i , v z 0 R l S i , v , L min max v z r Si ,R S iR , v z r z r 0 Complexity: O R 2 n log n OR log n n time Haar+ space • Apply to the space-bounded problem. Complexity: O R 2 n log * log n significant time & space advantage l-Diversification in 1D • Given database table T(A1, A2,…, An), a quasi-identifier attribute set QT is a subset of attributes which can reveal the personal identity of records. • Equivalence class with respect to quasi-identifier attribute set QT is a set of records indistinguishable in the projection of T on QT. • A database table T with quasi-identifier set QT and sensitive attribute S conforms to the l-diversity property iff each equivalence class in T with respect to QT has at least l wellrepresented values of S [Machanavajjhala et al. ICDE 2006] • Utility metric: Extent of equivalence class (group). • Other parameter: Outliers, records whose quasi-identifier values are suppressed. l-Diversification in 1D • A two-dimensional example. Postcode Postcode 7 7 6 6 5 5 4 4 3 3 2 2 1 1 10 30 50 Age 70 90 10 30 Lead Poisoning Flu Parkinson’s Hyperthyroidism 50 Age 70 90 l-Diversification in 1D • Study the problem in one dimension (a single quasiidentifier). • Total order exists. • Similar to histogram construction. • Polynomially tractable. Sensitive value quasi-identifier l-Diversification in 1D • Groups consecutive in each sensitive value domain. • Groups order the same in each domain. • Example for l=3. Sensitive value r3 D4 r1 r6 D3 r4 r5 D2 r2 D1 quasi-identifier l-Diversification in 1D • Groups consecutive in each sensitive value domain. • Groups order the same in each domain. • Example for l=3 Sensitive value r3 D4 r1 r6 D3 r4 r5 D2 r2 D1 quasi-identifier l-Diversification in 1D • Given interval I of extent E, which includes c items with m different sensitive values, number of possible boundaries/groups in I is: c m O c Bm m c O2 ,m c ,m c m 2 c ,m c O 2 Cmc m O 3c ,m c Sensitive value e quasi-identifier E l-Diversification in 1D • Solve the outlier minimization problem. N a N c Nc, b b a| E b ,a PM b ,a min c b Complexity: O Cmc Cmw n log n O Bmc w time space • Apply to the accuracy maximization problem. Complexity: O Cmc Cmw n log * log n time • Apply to the privacy maximization problem. Complexity: O Cmc Cmw n log * log n time Compact Hierarchical Histograms • Assign arbitrary values to CHH coefficients, so that a maximumerror metric is minimized. c0 • Heuristic solutions: Reiss et al. VLDB 2006 O nB log 2 n log B O B log 2 n n c1 time space c2 c3 c4 d0 d1 c5 d2 c6 d3 The benefit of making node B a bucket (occupied) node depends on whether node A is a bucket node – and also on whether node C is a bucket node. [Reiss et al. VLDB 2006] Compact Hierarchical Histograms • Solve the error-bounded problem. Next-to-bottom level case v, S i, v si* , si* 1 0, S i, v 1, 2, a, b c, d v a, b c, d a, b c, d v a, b c, d a, b c, d v a, b c, d a, b c, d v a, b c, d a, b c, d v a, b c, d ci z a, b c, d z a, b c, d v a, b ci c2i 0 z c, d c2i+1 0 z a, b c, d c2i 0 0 a, b c, d Compact Hierarchical Histograms • Solve the error-bounded problem. General, recursive case v L0 S iL , v si*L v R0 S iR , v si*R si*L si*R , L0 R 0 v L0 R 0 S i, v si*L si*R 1, L 0 R 0 v L 0 R 0 L 0 R 0 v L 0 R 0 s * s * 2, L0 R 0 v L0 R 0 iR iL Complexity: logn n O 0 2 1 O n log 2 n 2 O 0 2 On logn time space • Apply to the space-bounded problem. 2 * Complexity: O n log n log log n Polynomially Tractable Conclusions • Offline data representation problems under constrains are more easily solvable through their counterparts optimizing another parameter. • Dual-problem-based algorithms are simpler, more scalable, more elegant, and more memoryparsimonious than the direct ones. • In the CHH case, the dual-problem-based algorithm achieves an optimal solution to the maximum-error longest-prefix-match CHH partitioning problem, which was considered intractable. • Future: assessment of privacy and CHH solutions. Related Work • • • • • • • • • • • • H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal histograms with quality guarantees. VLDB 1998 S. Guha, K. Shim, and J. Woo. REHIST: Relative error histogram construction algorithms. VLDB 2004 M. Garofalakis and A. Kumar. Deterministic wavelet thresholding for maximum-error metrics. PODS 2004 S. Guha. Space efficiency in synopsis construction algorithms. VLDB 2005 S. Guha and B. Harb. Wavelet Synopses for Data Streams: Minimizing NonEuclidean Error. KDD 2005 S. Muthukrishnan. Subquadratic algorithms for workload-aware haar wavelet synopses. FSTTCS 2005 S. Guha and B. Harb. Approximation algorithms for wavelet transform coding of data streams. SODA 2006 we devised a specialized, highly efficient method for the case that a F. Reiss, M. Garofalakis, and J. M. Hellerstein. Compact histograms for hierarchical identifiers. VLDB 2006 A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam. ldiversity: Privacy beyond k-anonymity. ICDE 2006 P. Karras and N. Mamoulis. The Haar+ tree: a refined synopsis data structure. ICDE 2007 Thank you! Questions?
© Copyright 2026 Paperzz