Document

The Impact of Duality
on Data Representation Problems
Panagiotis Karras
HKU, June 14th, 2007
Introduction
• Many data representation problems require the
optimization of one parameter under a bound on one
or more others.
• Classical approaches treat them in a direct manner,
producing complicated solutions, and sometimes
resorting to heuristics.
• Parameters involved have a monotonic relationship.
• Hence, an alternative approach is possible, based on
dual problems.
Outline
•
•
•
•
•
Histograms.
Restricted Haar Wavelet Synopses.
Unrestricted Haar and Haar+ Synopses.
l-Diversification in 1D.
Compact Hierarchical Histograms.
Histograms
• Approximate a data set [d1, d2, …, dn] with B buckets,
si = [bi, ei, vi] so that a maximum-error metric is minimized.
• Classical solution:
Jagadish et al. VLDB 1998
Guha et al. VLDB 2004, Guha VLDB 2005
Ei, b  min max E j, b  1,   j  1, i 
1 j i

O nB log 2 n

• Recent solutions:
Buragohain et al. ICDE 2007
Guha and Shim TKDE 19(7) 2007
n

O n  n log U log 
B


O n  B log n
2
3
 (linear for B 
n
log 3 n
n  230  1,073,741,824  B  199
)
Histograms
• Solve the error-bounded problem.
Maximum Absolute Error bound ε = 2
4 5 6 2 15 17 3 6 9 12 …
[
4
] [ 16 ] [ 4.5 ] […
• Generalized to any weighted maximum-error metric.


 
d

,
d

Each value di defines a tolerance interval  i
i

w
w
i
i

Bucket closed when running union of interval becomes null
Complexity:
On 
Histograms
• Apply to the space-bounded problem.
Perform binary search in the domain of the error bound ε
For error values requiring space
B  B, with actual error 
Error-bounded algorithm running under constraint
If
error  
Complexity:
requires
~
BB

instead of
error  
space, then optimal solution has been reached.
O n log 
Independent of buckets B
error  
, run an optimality test:
*

Restricted Haar Wavelet Synopses
• Select subset of Haar wavelet decomposition coefficients, so
that a maximum-error metric is minimized.
• Classical solution:
Garofalakis and Kumar PODS 2004
Guha VLDB 2005
18
E i, v, b   min


 E iL , v, b, 
,
max 




E
i
,
v
,
b

b
 R








E
i
,
v

z
,
b
,


L
i
max







E
i
,
v

z
,
b

b

1
i
 R


 
On
0
18
18
7
-8
11
25
9
26
10
-9
10
10
2
34
16
2
20
20
0
36
16
Restricted Haar Wavelet Synopses
• Solve the error-bounded problem.
Muthukrishnan FSTTCS 2005
S iL , v   S iR , v ,

S i, v   min 





S
i
,
v

z

S
i
,
v

z

1
i
R
i
 L


n

Local search within each of 

 log n 
Complexity:
subtrees in bottom log log n
 n2 

O
 log n 
• Apply to the space-bounded problem.
 2 log  * 

Complexity: O n
 log n 
no significant advantage
Haar tree levels
Unrestricted Haar and Haar+ Synopses
• Assign arbitrary values to Haar/Haar+ coefficients, so that a
maximum-error metric is minimized.
c
• Classical solutions:
o+
C1
Guha and Harb KDD 2005, SODA 2006
c
1
c2 +
- c3
E i, v, b   min
+
+
 
C2
C3
c
c
4
7

 E i , v  z , b,

c5 +
c
c
+
- c9
6
8
max



zS iv
0  b  b  z  0
L

 E iR , v  z , b  b  z  0
Karras and Mamoulis ICDE 2007
+
d0


 E iL , v  z h , b,
 
 min
max 
,





E
i
,
v

z
,
b

b

z

0
0bzhbSiv,zH  0 
h
h
 R
 
h





 E iL , v  zl , b,



E i, v, b   min  minv
max
,





 E iR , v, b  b   zl  0 
0bzl bSi ,zLl  0 








E
i
,
v
,
b
,




L
max 



 zmin
v
E iR , v  z r , b  b   z r  0  
r S i , R





0

b

b


z

0

r


+
+
d1
+
d2

O R 2 n log n log 2 B
d3

n


O RB log  n 
B


time
space
Unrestricted Haar and Haar+ Synopses
• Solve the error-bounded problem.
S i, v   min S iL , v  z   S iR , v  z   z  0
zS
v
i
unrestricted Haar


 S iL , v  z h ,
 
 minv max 
,
z h S i , H
 S iR , v  z h    z h  0  








S
i
,
v

z
,




L
l
S i, v   min  minv max 
,



z l S i , L




S
i
,
v

z

0
R
l











S
i
,
v
,


L
 min max 



v
 z r Si ,R 
 S iR , v  z r    z r  0  


Complexity:

O R 2 n log n

OR log n  n
time
Haar+
space
• Apply to the space-bounded problem.
Complexity:
 
O R 2 n log  *  log n

significant time & space advantage
l-Diversification in 1D
• Given database table T(A1, A2,…, An), a quasi-identifier attribute
set QT is a subset of attributes which can reveal the personal
identity of records.
• Equivalence class with respect to quasi-identifier attribute set
QT is a set of records indistinguishable in the projection of T on
QT.
• A database table T with quasi-identifier set QT and sensitive
attribute S conforms to the l-diversity property iff each
equivalence class in T with respect to QT has at least l wellrepresented values of S [Machanavajjhala et al. ICDE 2006]
• Utility metric: Extent of equivalence class (group).
• Other parameter: Outliers, records whose quasi-identifier
values are suppressed.
l-Diversification in 1D
• A two-dimensional example.
Postcode
Postcode
7
7
6
6
5
5
4
4
3
3
2
2
1
1
10
30
50
Age
70
90
10
30
Lead Poisoning
Flu
Parkinson’s
Hyperthyroidism
50
Age
70
90
l-Diversification in 1D
• Study the problem in one dimension (a single quasiidentifier).
• Total order exists.
• Similar to histogram construction.
• Polynomially tractable.
Sensitive value
quasi-identifier
l-Diversification in 1D
• Groups consecutive in each sensitive value domain.
• Groups order the same in each domain.
• Example for l=3.
Sensitive value
r3
D4
r1
r6
D3
r4
r5
D2
r2
D1
quasi-identifier
l-Diversification in 1D
• Groups consecutive in each sensitive value domain.
• Groups order the same in each domain.
• Example for l=3
Sensitive value
r3
D4
r1
r6
D3
r4
r5
D2
r2
D1
quasi-identifier
l-Diversification in 1D
• Given interval I of extent E, which includes c items with m different
sensitive values, number of possible boundaries/groups in I is:
   c m 
O   
c
Bm     m  



c
 O2
 
,m  c
,m  c
m
2
 


c



 ,m  c



O
2

 
Cmc      m 

 
 

O 3c
,m  c

 
Sensitive value
e
quasi-identifier
E
l-Diversification in 1D
• Solve the outlier minimization problem.
N a  

N c   Nc, b 
b a| E b ,a   PM b ,a 
min
c b

Complexity: O Cmc Cmw n log n


O Bmc w
time

space
• Apply to the accuracy maximization problem.


Complexity: O Cmc Cmw n log  *  log n

time
• Apply to the privacy maximization problem.
Complexity:


O Cmc Cmw n log *  log n

time
Compact Hierarchical Histograms
• Assign arbitrary values to CHH coefficients, so that a maximumerror metric is minimized.
c0
• Heuristic solutions:
Reiss et al. VLDB 2006

O nB log 2 n log B

O B log 2 n  n


c1
time
space
c2
c3
c4
d0
d1
c5
d2
c6
d3
The benefit of making node B a bucket (occupied)
node depends on whether node A is a bucket node
– and also on whether node C is a bucket node.
[Reiss et al. VLDB 2006]
Compact Hierarchical Histograms
• Solve the error-bounded problem.
Next-to-bottom level case


v, S i, v   si* , si*  1
0,

S i, v   1,
2,

a, b c, d     v  a, b c, d 
a, b c, d     v  a, b c, d   a, b c, d     v  a, b c, d 
a, b c, d     v  a, b c, d 
a, b c, d   
v  a, b c, d 
ci
z  a, b c, d 
z
a, b c, d   
v  a, b
ci
c2i
0
z  c, d 
c2i+1
0
z
a, b
c, d 
c2i
0
0
a, b
c, d 
Compact Hierarchical Histograms
• Solve the error-bounded problem.
General, recursive case
v  L0  S iL , v  si*L
v  R0  S iR , v  si*R
 si*L  si*R ,
L0  R 0    v  L0  R 0

S i, v    si*L  si*R  1, L 0  R 0    v  L 0  R 0  L 0  R 0    v  L 0  R 0
s *  s *  2,
L0  R 0    v  L0  R 0
iR
 iL
Complexity:

 logn  n 
O  0 2   1   O n log 2 n
2 




O  0 2  On
logn
time
space
• Apply to the space-bounded problem.


2
*
Complexity: O n log n log   log n

Polynomially Tractable
Conclusions
• Offline data representation problems under
constrains are more easily solvable through their
counterparts optimizing another parameter.
• Dual-problem-based algorithms are simpler, more
scalable, more elegant, and more memoryparsimonious than the direct ones.
• In the CHH case, the dual-problem-based algorithm
achieves an optimal solution to the maximum-error
longest-prefix-match CHH partitioning problem,
which was considered intractable.
• Future: assessment of privacy and CHH solutions.
Related Work
•
•
•
•
•
•
•
•
•
•
•
•
H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and
T. Suel. Optimal histograms with quality guarantees. VLDB 1998
S. Guha, K. Shim, and J. Woo. REHIST: Relative error histogram
construction algorithms. VLDB 2004
M. Garofalakis and A. Kumar. Deterministic wavelet thresholding for
maximum-error metrics. PODS 2004
S. Guha. Space efficiency in synopsis construction algorithms. VLDB 2005
S. Guha and B. Harb. Wavelet Synopses for Data Streams: Minimizing NonEuclidean Error. KDD 2005
S. Muthukrishnan. Subquadratic algorithms for workload-aware haar
wavelet synopses. FSTTCS 2005
S. Guha and B. Harb. Approximation algorithms for wavelet transform
coding of data streams. SODA 2006
we devised a specialized, highly efficient method for the case that a
F. Reiss, M. Garofalakis, and J. M. Hellerstein. Compact histograms for
hierarchical identifiers. VLDB 2006
A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam. ldiversity: Privacy beyond k-anonymity. ICDE 2006
P. Karras and N. Mamoulis. The Haar+ tree: a refined synopsis data
structure. ICDE 2007
Thank you! Questions?