Subadditivity of disclosure rules

UN/ECE Work Session On Statistical Data
Confidentiality (Geneva, 9-11 November 2005)
WP30: Safety rules in statistical
disclosure control for tabular data
Giovanni Merola
Winton Capital Management Ltd
[email protected]
Partially written while at ISTAT and partially supported by
EU project CASC.
G. Merola Winton
Capital Management
1
Plan of the Talk
1.
2.
3.
4.
5.
6.
7.
8.
9.
SDC for Magnitude tables;
Existing safety rules;
Generalised p-rule;
Rational estimates;
Prior distribution;
U-estimates;
Comparison on real SBS data;
MU-rules;
Concluding remarks.
G. Merola Winton
Capital Management
2
1. SDC for Magnitude Tables
Tables showing the sums of non-negative contributions
in each cell. Example:
Income £K
Young
Old
All Ages
Male
200
600
800
Female
150
450
600
All Sexes
350
1050
1400
Contributions in non-increasing order
150
130
120
90
50
40
z1≥ z2 ≥ z3 ≥ z4 ≥ ··· ···
G. Merola Winton
Capital Management
20
≥ zn
Total 600
(Old Males)
Total T is published n
is number contrib.n
3
1. SDC for Magnitude Tables cont.d
SDC policy:
1.
2.
3.
If the categories are confidential, (likely)
identification of respondents is disclosure;
else only the contributions of (likely)
identifiable respondents cannot be disclosed (too
precisely);
same rule for all cells, else microdata protection.
G. Merola Winton
Capital Management
4
2. Existing Safety Rules

Rare respondents are identifiable
–

Respondents with large contrib. are identifiable
–

threshold rule: n > m.
Dominance: (z1+···+zm)/T k.
Largest contributor is identifiable, hence second
largest must not estimate z1 closely
–
p-rule: [(T-z2) -z1]/z1> p.
G. Merola Winton
Capital Management
5
3. Generalised p-rule
Includes the existence of groups of respondents
z1 z2 z3 z4 ··· ··· zn
t2
Total is T
R2,2

Group with largest sum identifiable;
 group with second largest sum must not
estimate largest sum too closely;
G. Merola Winton
Capital Management
6
3. Generalised p-rule cont.d
Same estimate as p-rule:
maximum possible value
^tm=T-Rm,l

Gen. p-rule ((T-Rm,l) -tm)/tm > p
t1=z1 and R1,1=z2 p-rule
G. Merola Winton
Capital Management
7
3. Generalised p-rule cont.d

If zero contributions are known (external
intruder): Dominance rule with k=1/(1+p)
 If no groups: simple p-rule;
 If intruding group formed of (m-1)
respondents: threshold rule n>m protects
against exact estimation (p=0).
G. Merola Winton
Capital Management
Merola, G. M., 2003a. Generalized risk measures for tabular data. Proceedings of
the 54th Session of the International Statistical Institute.
8
4. Rational Estimates

An intruder can compute a lower and an
upper bound for the value of tm:

m

m
t  tm  t ;

For example, if z2=40 and T=100:
40=z2  z1  T- z2=60;
 the bounds are different for different
prior knowledge of the intruder.
G. Merola Winton
Capital Management
9
4. Rational Estimates cont.d
 tm
can be estimated by minimising the Mean
Square Error for some distribution F(tm) :
tm
min  
tm

2
ˆ
(tm  tm ) dF (tm );
for a well known property MSE is
minimised by the mean
tˆm  E (tm )
G. Merola Winton
Capital Management
10
5. Prior Distribution: Uniform

The ignorance about the distribution of tm
can be modelled with a Uniform
distribution:
tm~U(tm-, tm+)
 in this case the mean is simply:


t

t
tˆm  m m .
2

Note: same estimate for any symmetric F.
G. Merola Winton
Capital Management
11
5. Prior Distribution: maximising
The Generalised p-rule can be derived by
assuming a prior concentrated on the
maximum value
1 tm  tm
F (tm )  
;
0 else

We refer to the Gen p-rule as M-rule, and to
the that derived using the Uniform as Urule.
G. Merola Winton
Capital Management
12
6. U-estimates
Different prior knowledge of the intruder
knows T but not n: tˆm  T / 2
(Dominance);
 knows T and n, tˆm  (m  n)T / 2n;
 knows T and L contributions,

tˆm  (T  RL ,m  mz( m1) ) / 2

(Gen. p-rule*)
knows T, L contributions and n,
either as above or
tˆm  T  RL ,m  (n  m  L) z( m L ) / 2.

* for m=L=1 uniform p-rule is same as uniform
dominance Merola, G., 2003b. Safety rules in statistical disclosure control for tabular data.
G. Merola Winton
Capital Management
Contributi Istat 1, istituto Nazionale di Statistica, Roma.
13
6. U-estimates cont.d
Example
C=(970,376,274,253,203,169,161,121,86,62,21,10),
T=2706
Estimated
Rule
RelErr
z1
G. Merola Winton
Capital Management
Dom
2706
1.8 (t2/T=0.5)
p-rule
2330
1.4
U-Dom
1353
0.4
U (1:n)
1465
0.51
U(1;1)
1353
0.4
14
7. Comparison on real SBS data
We applied different rules to Italian SBS data,
turnover by Region and SIC for the years ’94
and ‘97. We considered the SIC with 2 and 3
digits.
G. Merola Winton
Capital Management
15
7. Comparison on real SBS data cont.d
Mean relative error for z1
G. Merola Winton
Capital Management
16
7. Comparison on real SBS data cont.d
Mean relative error for t2
G. Merola Winton
Capital Management
17
8. U-rules
The values for tm  tˆm / tm  p are intervals:
 Knowing only T (Dominance)
tm
1
1
 
2(1  p) T 2(1  p)


Knowing T and L contributions (gen p-rule)
tm
1
1


2(1  p ) T  RL ,m  mzm1 2(1  p )
G. Merola Winton
Capital Management
18
9. MU-rules

assuming both estimating approaches we
obtain subadditive rules, analogous to p-rule
but with stricter bounds
unsafe
safe
M-rule
T-R
(1+p)
unsafe
safe
U-rule
safe
^
^
tm
tm
(1+p)
(1-p)
unsafe
safe
MU-rule
^
^
tm
tm
(1-p)
(1+p)
tmG. Merola Winton
Capital Management
^
tm=
tm++tm2
tm+
tm
19
9. MU-rules cont.d

Safety rule when only T known
(Dominance)
t
1
m
T


2(1  p )
Safety rule when T and L contributions
known (gen p-rule)
tm
1

T  RL ,m  mzm1 2(1  p )
G. Merola Winton
Capital Management
20
10. Conclusions






The assumptions for the existing rules are
unrealistic;
using a simple noninformative distribution much
smaller relative error of estimation;
the corresponding rules are not subadditive;
joining assumptions leads to stricter rules;
identifiability of all largest respondents requires
these rules;
different prior can be used.
G. Merola Winton
Capital Management
21