t11068_Jiang.pdf

AN IMPROVED MEASURE OF NETWORK ANONYMITY USING
PROFILE FUNCTIONS
A Thesis by
Nan Jiang
MS, Xi’an University of Sci. and Tech., China, 2008
Submitted to the Department of Electrical Engineering and Computer Science
and the faculty of the Graduate School of
Wichita State University
in partial fulfillment of
the requirements for the degree of
Master of Science
July 2011
c Copyright 2011 by Nan Jiang
All Rights Reserved
AN IMPROVED MEASURE OF NETWORK ANONYMITY USING
PROFILE FUNCTIONS
The following faculty members have examined the final copy of this thesis for form and
content, and recommend that it be accepted in partial fulfillment of the requirement for the
degree of Master of Science with a major in Computer Science.
Rajiv Bagai, Committee Chair
Bin Tang, Committee Member
Tianshi Lu, Committee Member
iii
DEDICATION
To my parents and my family who have supported me throughout my life. Without their
help, it would not have been possible to finish my thesis.
iv
ACKNOWLEDGEMENTS
I would like to thank my advisor Dr. Rajiv Bagai who has made it possible for me
to complete this thesis. His support, knowledge, and patience have guided me from very
beginning to the end. I would like to thank Dr. Bin Tang and Dr. Tianshi Lu as well, for
their kind help and for being my committee members. I would like to thank Dr. Buma
Fridman of the Department of Mathematics and Statistics at the Wichita State University
for helpful discussions on profile intersections. I would like to thank my friend Dylan Holmes,
for his patient help on my English.
v
ABSTRACT
We present a graphical framework containing certain infinite profiles of probability
distributions that result from an attack on an anonymity system. We represent currently
popular anonymity metrics within our framework to show that existing metrics base their
decisions on just some small piece of information contained in a distribution. This explains
the counterintuitive, thus unsatisfactory, anonymity evaluation performed by any of these
metrics for carefully constructed examples in literature. We then propose a new anonymity
metric that takes entire profiles into consideration in arriving at the degree of anonymity
associated with a probability distribution. The comprehensive approach of our metric results
in correct measurement. A detailed comparison of our new metric, especially with the
popular metrics based on Shannon entropy, gives the rationale and degree of disagreement
between these approaches.
vi
TABLE OF CONTENTS
Chapter
Page
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1
Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2 PRELIMINARIES AND NOTATIONS . . . . . . . . . . . . . . . . . . . . . . . .
4
2.1
Multisets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
2.2
Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
2.3
Base-Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
3 A COMMON GRAPHICAL FRAMEWORK FOR EXISTING ANONYMITY
METRICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
3.1
Anonymity Set Size Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
3.2
Metrics Based on Shannon Entropy . . . . . . . . . . . . . . . . . . . . . . .
10
3.3
Maximal Probability Metric . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
3.3.1
Norm-Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
3.4
Metrics Based on Rényi Entropies . . . . . . . . . . . . . . . . . . . . . . . .
16
3.5
Euclidean Distance Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
4 AN ANALYSIS OF EXISTING METRICS . . . . . . . . . . . . . . . . . . . . . .
19
4.1
Base-Profile Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
4.2
Norm-Profile Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
5 A NEW, GLOBAL ANONYMITY METRIC . . . . . . . . . . . . . . . . . . . . .
30
5.1
Raw Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
30
TABLE OF CONTENTS (continued)
Chapter
Page
5.2
Normalized Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
5.3
Differences from Metrics Based on Shannon Entropy . . . . . . . . . . . . . .
33
6 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
viii
LIST OF FIGURES
Figure
Page
2.1
Base-profiles of three distributions in ∆n , namely n
b, n, and an arbitrary D ∈ ∆n .
3.1
A graphical representation of the entropy-based metric of Serjantov and Danezis
9
[6] as −B0D (1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
3.2
Bn (x) curves, for n = 1, 2, and 3. . . . . . . . . . . . . . . . . . . . . . . . . .
13
3.3
Base- and Norm-profiles for n
b, and an arbitrary D ∈ ∆n . . . . . . . . . . . . .
14
3.4
Range of all Rényi entropies for a distribution D ∈ ∆n . . . . . . . . . . . . . .
17
3.5
Square of the Euclidean distance metric, shown as asterisk (∗). . . . . . . . . .
18
4.1
Two distributions with same normalized Shannon entropy, but significantly different maximal probabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2
21
Base-profiles intersecting at x > 1: (a) One such intersection, (b) Two such
intersections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
4.3
[[∆2 ]] and loci Nxα for an arbitrary value of α, and some example values of x. . .
27
4.4
[[∆3 ]] and its intersections with some loci. . . . . . . . . . . . . . . . . . . . . .
28
5.1
Metrics S(D) and R(D) on [[∆2 ]]. . . . . . . . . . . . . . . . . . . . . . . . . . .
32
5.2
Areas considered for system’s normalized degree of anonymity. . . . . . . . . .
33
5.3
Initial base-profile slopes and alternating regions of dominance of distributions
D and E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
35
CHAPTER 1
INTRODUCTION
A fundamental problem in the area of anonymity systems is to measure the amount of
sender anonymity that remains for a particular message sent via a system in the aftermath of
an attack. It has become customary to consider attacks that result in probabilities for each
system user of being the actual sender of that message. Over the years, several metrics have
been proposed, ranging from simple ones based on the size of the underlying anonymity set to
information-theoretic ones based on entropy. Some surveys of metrics can be found in Diaz [1]
and Kelly et al. [2]. While each metric has usually arrived with a more convincing rationale
than the ones preceding it, truth be told, neither one of them has yet been completely
satisfactory. Irksome instances of probability distributions still exist for any of these metrics,
for which the result of that metric does not conform with our intuition of the system’s level
of anonymity.
1.1
Thesis Contributions
Contributions of this thesis are two-fold. We first construct a graphical framework,
suitable for placement of all popular anonymity metrics to date. By placing these metrics
in our framework, we illustrate that in their attempt to measure anonymity, the existing
metrics base their decisions on just some finite aspects of a given probability distribution,
whose information content is in fact potentially infinite. This provides an understanding,
in fact a graphic visualization, of why none of the existing metrics works correctly for all
probability distributions. We then propose a new anonymity metric that arrives at its values
after taking into account the entire information content of a probability distribution. The
comprehensive approach of our metric results in correct measurement. The results of this
thesis have been submitted in Bagai and Jiang [3].
The specific existing metrics analyzed in this thesis are the following:
1
(a) the anonymity set size metric of Chaum [4];
(b) the reduced anonymity set size metric of Kesdogan, Egner and Büschkes [5];
(c) the Shannon entropy based metric of Serjantov and Danezis [6];
(d) the normalized Shannon entropy based metric of Diaz et al. [7];
(e) the maximal probability metric of Tóth, Hornák and Vajda [8];
(f) the Rényi entropies based metric family of Clauß and Schiffner [9]; and
(g) the Euclidean distance metric of Andersson and Lundin [10].
The metrics based on Shannon entropy (items c and d in the above list) have been the most
popular for a number of years. Much of our analysis and comparison of our own metric is
thus with them.
Structures at the core of our graphical framework are two related functions, over the
infinite domain of nonnegative real values, that are induced by a given probability distribution. We call these functions base-profile and norm-profile of the distribution. Our analysis
shows that all above metrics essentially consider some local properties of these profiles, and
do not take their entire contours into account. A global consideration of these contours, especially intersections between them, as we reveal, is essential for a metric to always produce
the correct anonymity measure. The new metric we then propose does that and, in that
respect, it is an improved metric.
1.2
Thesis Organization
The rest of this thesis is organized as follows. Chapter 2 contains mathematical
preliminaries and notations used in this thesis. This chapter also introduces base-profiles, one
of the two important structures in our graphical framework. Chapter 3 presents an overview
of the existing anonymity metrics and places each of these metrics within our common
framework. Norm-profiles, the other important structure in our framework, are introduced in
this chapter at an appropriate point. Chapter 4 exposes the inadequacy of local observations
2
underlying existing metrics by studying the phenomenon of profile intersections that are
inevitably ignored by such observations. Our new metric, based on global observations of
profile contours, is then presented in Chapter 5. An analysis of similarities and differences
of our proposed metric with existing metrics based on Shannon entropy occupy much of this
chapter. Finally, Chapter 6 concludes our results and mentions some directions for future
work.
3
CHAPTER 2
PRELIMINARIES AND NOTATIONS
In this chapter, we give mathematical preliminaries and notations used in the thesis.
Probability distributions are often defined as sequences of nonnegative real values that add
up to 1. However, from the point of view of anonymity metrics, the order of values in them
is not important. Only the actual values, along with the number of occurrence of each value,
is relevant. We therefore immediately depart from standard practice, and define probability
distributions as multisets, as that view retains just the amount of information needed for
our purpose.
2.1
Multisets
Let a multiset M over any set S be a function M : S → N, where N is the set
{0, 1, 2, . . .} of natural numbers. For any σ ∈ S, M (σ) is the frequency of σ in the multiset
P
M . The cardinality of M , denoted |M |, is the sum {M (σ) | σ ∈ S} of frequencies in M
of all members of S.
We employ double-brackets [[. . .]] as the notation for multisets, in which each member
of the underlying set occurs (in any order) exactly as many times as its frequency in the
multiset. For example, if the underlying set S = {a, b, c}, then [[a, a, a, c]] denotes the multiset
of cardinality 4 in which the frequencies of a, b, and c are 3, 0, and 1, respectively. The same
multiset is also denoted by [[a, a, c, a]], [[a, c, a, a]], or [[c, a, a, a]].
2.2
Probability Distributions
A probability distribution (or just distribution) on any set U is a multiset D over the
closed interval [0, 1] of real numbers, such that:
|U | = |D| and
P
{σ · D(σ) | σ ∈ [0, 1]} = 1.
In the context of anonymity systems, U is typically a set of users, any of whom may have
sent a certain message via an anonymity system, and D is a distribution on U , arrived at by
4
an attacker, containing probabilities for each of those users of being the originator of that
message.
Example 2.2.1 Suppose U contains 6 users. Then, the distribution [[0.5, 0.2, 0.2, 0.1, 0, 0]]
assigns 50% probability to one of those users of being the message originator, 20% each to
two other users, 10% to another user, and zero probability to the two remaining users.
We limit ourselves to systems with a finite number of users, thus to distributions with
finite cardinalities. For any natural number n ≥ 1, let ∆n be the set of all distributions of
cardinality n. It is easily seen that, ∆1 has just one distribution [[1]] and, for all n > 1, ∆n
is uncountably infinite.
Two special distributions in ∆n , namely n
b and n, are defined as follows:



n − 1 if σ = 0,


n
b(σ) =
1
if σ = 1,



 0
otherwise.

 n
if σ = n1 ,
n(σ) =
 0
otherwise.
In other words, n
b is the peak distribution [[1, 0, 0, . . . , 0]], which contains a single 1 and
exactly (n − 1) occurrences of 0. This corresponds to the attacker having determined, with
full certainty, the actual sender of the underlying message, out of n potential senders. On
the other hand, n is the even distribution [[ n1 , n1 , . . . , n1 ]], in which the value
1
n
occurs n times.
This distribution corresponds to the attacker having made no headway in determining the
actual sender of the message.
In general, an attack results in a distribution D ∈ ∆n that lies somewhere in between
the two extreme distributions, n (for no information gained by the attack) and n
b (for full
information gained by the attack). All anonymity metrics considered in this thesis are those
that attempt to measure the amount of anonymity left in the system in the aftermath of
such an attack.
5
2.3
Base-Profiles
A central notion in our graphical framework is that of the base-profile of any distri-
bution D ∈ ∆n , which is a function BD : R → R, where R is the set of all real values. It is
defined as:
BD (x) =
P x
{σ · D(σ) | σ ∈ [0, 1]}.
Example 2.3.1 Let D ∈ ∆6 be the distribution:
[[0.5, 0.2, 0.2, 0.1, 0, 0]]
of Example 2.2.1. Then, the base-profile function of D is:
BD (x) = 0.5x · 1 + 0.2x · 2 + 0.1x · 1 + 0x · 2.
Throughout this thesis, we interpret 00 as 1.
We will be particularly interested in the values of the base-profile function BD (x) for
nonnegative values of x and, from those, mainly for x ≥ 1. An interesting observation is
that, in general, this function may be discontinuous at x = 0, as implied by the following
proposition, which follows immediately from the definitions so far:
Proposition 2.3.1 For any distribution D ∈ ∆n :
(a) BD (0) = |D| = n.
(b) limx→0 BD (x) is the number of nonzero values in D, i.e,
P
{D(σ) | σ 6= 0}.
(c) BD (1) = 1, i.e, all base-profiles intersect at x = 1, with value 1.

 1 if D = n
b,
(d) limx→∞ BD (x) =
 0 otherwise.
(e) If D 6= n
b and D 6= n, then for all x > 1,
Bn (x) < BD (x) < Bnb (x),
i.e, all base-profiles lie between the extreme ones Bn (x) and Bnb (x).
6
Proof
(a) BD (0) =
P 0
P
{σ · D(σ) | σ ∈ [0, 1]} = {1 · D(σ) | σ ∈ [0, 1]} = |D|, and |D| = n since
D ∈ ∆n . Notice that here we make use of our convention 00 = 1.
P x
P
{σ · D(σ) | σ ∈ [0, 1]} = {D(σ) · limx→0 σ x | σ ∈ [0, 1]}.
P
We now split the sum into two parts, for σ = 0 and σ 6= 0, yielding D(0) · 0 + {D(σ) ·
P
σ 0 | σ 6= 0} = {D(σ) | σ 6= 0}.
(b) limx→0 BD (x) = limx→0
(c) BD (1) =
P 1
{σ · D(σ)} which, by the definition of a distribution, is equal to one.
(d) First, suppose D = n
b. Then, by definition, Bnb (x) =
P
{σ x · n
b(σ) | σ ∈ [0, 1]}. But
n
b(1) = 1, n
b(0) = n − 1, and n
b(σ) = 0 otherwise, so this sum reduces to Bnb (x) = 1.
Evidently, limx→∞ Bnb (x) = limx→∞ 1 = 1.
P
Next, suppose D 6= n
b. Then D(σ) is nonzero only for σ < 1. So, BD (x) = {σ x ·
P x
P
D(σ) | σ ∈ [0, 1]} =
{σ · D(σ) | σ < 1}, and limx→∞ BD (x) = limx→∞ {σ x ·
P
D(σ) | σ < 1} = {D(σ) · limx→∞ σ x | σ < 1}. Since limx→∞ σ x = 0 for every 0 ≤ σ <
1, the sum vanishes.
P
(e) Fix x > 1. First, we show that BD (x) < Bnb (x). Indeed, BD (x) = {σ x · D(σ) | σ ∈
P 1
[0, 1]} <
{σ · D(σ) | σ ∈ [0, 1]} = 1. But as proved in (d), Bnb (x) = 1; hence,
BD (x) < 1 = Bnb (x).
Second, we show that Bn (x) < BD (x). The proof will use the method of Lagrange
multipliers to show that n minimizes BD (x) over all distributions D ∈ ∆n . Fix a
distribution D = [[p1 , p2 , · · · , pn ]] in ∆n , where the pi are not necessarily distinct.
Define f (D) = px1 + · · · + pxn , and g(D) = p1 + p2 + · · · + pn = 1, the normalization
constraint. The Lagrange auxiliary function is L(D, λ) = f (D) + λ · (g(D) − 1). In
this function L(D, λ), p1 , p2 , · · · , pn and λ are arguments. Getting partial derivations
of all arguments and letting them equal 0, we have that
7











∂L
∂p1
= x · px−1
+λ
1
=0
∂L
∂p2
= x · px−1
+λ
2
=0
∂L
= x · px−1
+λ
3
=0
∂p3





···




 ∂L
∂λ
= p 1 + p2 + p3 + · · · + pn − 1 = 0
This system of equations has n + 1 equations. Based on 1st to nth equation, we know
that: for 1 ≤ i ≤ n,
r
x−1
pi =
−
λ
x
(2.1)
then the last equation in the system becomes
r
r
r
λ
λ
λ
x−1
x−1
x−1
− +
− +
− + ··· +
x
x
x
r
λ
−
r x
λ
x−1
n·
−
r x
λ
x−1
−
x
λ
−
x
x−1
= 1
= 1
=
=
1
n
1
nx−1
λ = −
We can solve this system using (2.1):



p1






p

 2
p3





···




 λ
=
1
n
=
1
n
=
1
n
x
nx−1
x
= − nx−1
Hence, D = n minimizes f (D) subject to g(D) = 1.
Based on the proof of Proposition 2.3.1(e), we have also proved the following proposition:
Proposition 2.3.2 For all n ≥ 1 and x > 0, Bnb (x) = 1.
8
Figure 2.1 shows the base-profile functions of three distributions in ∆n , namely n
b, n, and an
arbitrary D ∈ ∆n . From Proposition 2.3.1, we know that BD (x) lies between the extreme
Bn_(0) = BD(0) = B^n(0)
n
B^n(x)
1
BD(x)
Bn_(x)
0
x
0 1
Figure 2.1: Base-profiles of three distributions in ∆n , namely n
b, n, and an arbitrary D ∈ ∆n .
base-profiles Bnb (x) and Bn (x), and intersects with any one of them at any x > 1 if and only
if the profiles are identical, i.e, D is that extreme distribution (b
n or n).
9
CHAPTER 3
A COMMON GRAPHICAL FRAMEWORK FOR EXISTING ANONYMITY
METRICS
In this chapter, we show that well-known techniques for measuring a system’s degree
of anonymity, given a distribution D ∈ ∆n resulting from an attack, in essence just look at
different local aspects of the base-profile BD of that distribution.
3.1
Anonymity Set Size Metrics
One of the first anonymity metrics, proposed by Chaum [4], was the anonymity set
size, i.e, the size of the set of users that could have sent a particular message. As stated in
Proposition 2.3.1, in our framework this value is:
BD (0),
for any given distribution D.
Proposition 2.3.1 also shows that the above value is in fact independent of D. Kesdogan, Egner and Büschkes [5] improved upon the above metric by defining the anonymity
set as containing just those users for whom the probability in D is nonzero. The size of this
set is then used as the measure of anonymity. Again, as seen in Proposition 2.3.1, within
our framework this value is:
lim BD (x).
x→0
Both of the above metrics ignore the potentially different probabilities contained in
D for different members of the anonymity set. The metrics considered in the following
chapters are well-known to provide a more accurate measure of anonymity by taking those
probabilities into account.
3.2
Metrics Based on Shannon Entropy
Anonymity metrics based on Shannon entropy [12] of the discrete random variable
with probability distribution D have been popular for a number of years. In their well-known
10
work, Serjantov and Danezis [6] proposed adopting this entropy value, i.e,
P
S(D) = − {σ · log(σ) · D(σ) | σ ∈ [0, 1]},
as a measure of anonymity. It is worth mentioning that the base of the logarithm in the
above expression is not important, as its choice corresponds only to the choice of a unit of
measurement. Also, 0 · log(0) is interpreted consistently as 0.
In our framework of base-profiles, the above is simply the negation of the slope of the
curve BD (x), at the value x = 1, as shown below.
Since, for any constant σ,
d x
σ
dx
= σ x · log(σ), the derivative of the base-profile of D
is:
B0D (x) =
P
d
BD (x) = {σ x · log(σ) · D(σ) | σ ∈ [0, 1]}.
dx
By substituting x = 1 in the above, we get:
S(D) = −B0D (1).
For any distribution D ∈ ∆n , the minimum value of S(D) is −B0nb (1) = 0, while its maximum value is −Bn0 (1), which is easily seen to be log(n). Figure 3.1 graphically depicts this
anonymity metric of Serjantov and Danezis [6].
BD(x)
x
1
Figure 3.1: A graphical representation of the entropy-based metric of Serjantov and Danezis
[6] as −B0D (1).
11
A shortcoming of this metric is that it is based only on the nonzero values contained
in a distribution. For example, let D1 = [[ 12 , 12 ]], and D2 = [[ 12 , 12 , 0]]. Then, S(D1 ) = S(D2 ) =
log(2). A completely toothless attack that makes no inroads at breaking anonymity will
result in D1 . On the other hand, an attack that results in D2 has achieved partial success
towards its goal by completely ruling out one of the three possible users, thus lowering
anonymity. A metric that results in the same anonymity measure in both cases is subject to
misinterpretation, if the maximum possible measures of the respective systems are not taken
into account.
In another well-known work, Diaz et al. [7] proposed a similar metric, which is additionally normalized by the maximum possible measure, and can therefore be readily used
to compare systems with a different number of users. They proposed using the value of the
expression given by:
d(D) =
S(D)
,
log(n)
as the measure of anonymity. Clearly, this is always a real value in the closed interval [0, 1].
As B0n (1) = − log(n), in our framework this normalized metric is essentially the ratio
of the slopes of the curves BD (x) and Bn (x), at the value x = 1, i.e,
B0D (1)
.
d(D) = 0
Bn (1)
Figure 3.2 shows the Bn (x) curves, for the values n = 1, 2, and 3, and graphically illustrates
the expected change of the normalization factor, Bn0 (1), as n increases.
3.3
Maximal Probability Metric
Tóth, Hornák and Vajda [8] argued adopting the maximal probability contained in
any distribution D, i.e,
MAX(D) = max{σ ∈ [0, 1] | D(σ) ≥ 1},
as an anonymity metric because, from the users’ point of view this worst-case measure may
be more important than the average-case considered by the above metrics based on Shannon
entropy.
12
B1_ (x)
B2_ (x)
B3_ (x)
x
1
Figure 3.2: Bn (x) curves, for n = 1, 2, and 3.
3.3.1
Norm-Profiles
In order to place this metric within our framework, we first define the norm-profile
of any distribution D as the function ND : R → R, given by:
ND (x) = (BD (x))1/x .
We now show that the maximal probability metric is the limit value of the norm-profile.
Proposition 3.3.1 For any distribution D ∈ ∆n ,
MAX(D) = lim ND (x).
x→∞
Proof Let MAX(D) = m. By definition,
P
ND (x) = ( {σ x · D(σ) | σ ∈ [0, 1]})1/x .
Since m is the largest value in D, and
P
{D(σ) | σ ∈ [0, 1]} = |D| = n, we have that for all
x ≥ 1,
ND (x) ≤ (mx · n)1/x = mn1/x .
Thus, limx→∞ ND (x) ≤ limx→∞ mn1/x = m.
Now, since D(m) ≥ 1, we have that for all x > 0,
m = (mx )1/x ≤ (mx · D(m))1/x ≤ ND (x).
13
Thus, m ≤ limx→∞ ND (x), and the proposition holds.
Figure 3.3 shows the above result graphically, along with another important relationship between the two profiles, described below.
N^n(x) = B^n(x)
1
ND(x)
MAX(D)
0
BD(x)
x
0
1
Figure 3.3: Base- and Norm-profiles for n
b, and an arbitrary D ∈ ∆n .
The following proposition contains some easy to verify, yet important properties of
the two kinds of profiles in our framework:
Proposition 3.3.2 For any distribution D ∈ ∆n :

 1 if D = n
b,
(a) limx→0 ND (x) =
 ∞ otherwise.
(b) ND (1) = 1, i.e, just as base-profiles, all norm-profiles also intersect at x = 1, with
value 1.
(c) B0D (1) = N0D (1), i.e, the metrics of Section 3.2 based on Shannon entropy can be
expressed and viewed in terms of the norm-profile as well:
S(D) = −B0D (1) = −N0D (1), and
0
d(D) = BD0 (1)
Bn (1)
:
14
0
= ND0 (1) .
Nn (1)
Our proof relies on the following:
Lemma 3.3.1 Suppose h(x) = f (x)g(x) , where f (x) is strictly positive. Then
dg(x)
g(x) df (x)
d
g(x)
h(x) = f (x)
·
ln f (x) +
dx
dx
f (x) dx
Proof We have that
h(x) = f (x)g(x)
⇐⇒
⇐⇒
⇐⇒
⇐⇒
ln h(x) = g(x) · ln f (x)
d
d
ln h(x) =
[g(x) · ln f (x)]
dx
dx
1 dh(x)
dg(x)
g(x) df (x)
=
ln f (x) +
h(x) dx
dx
f (x) dx
dg(x)
g(x) df (x)
dh(x)
= h(x) ·
ln f (x) +
dx
dx
f (x) dx
Making the substitution h(x) = f (x)g(x) gives the desired result.
Proof (of Proposition 3.3.2)
(a) First, suppose D = n
b. Then BD (x) = Bnb (x) = 1 for all x 6= 0. So, limx→0 Nnb (x) =
limx→0 [BD (x)]1/x = limx→0 11/x = 1.
Next, suppose D 6= n
b and put k = limx→0 BD (x). As proved above, k is the number of
nonzero values in D; if D 6= n̂, we have k > 1. So, limx→0 ND (x) = limx→0 [BD (x)]1/x =
limx→0 k 1/x . Since 1/x grows arbitrarily large as x → 0 and since k > 1, the limit is
infinite.
(b) As proved above, BD (1) = 1. Hence, since ND (x) =
p
x
BD (x), ND (x) = 1 as required.
(c) Put f (x) = BD (x) and g(x) = 1/x. Then ND (x) = f (x)g(x) , and by the above lemma,
h
i
1
0
we have ND0 (x) = BD (x)1/x · −1
ln
B
(x)
+
B
(x)
. Plugging in x = 1 gives
D
x2
x·B (x) D
h
i D
0
ND0 (1) = BD (1) · − ln BD (1) + BD1(1) BD
(1) . As proved above, BD (1) = 1, giving
0
ND0 (1) = BD
(1).
15
3.4
Metrics Based on Rényi Entropies
Clauß and Schiffner [9] proposed, as a framework for some anonymity metrics, the
parametric family of entropy measures of Rényi [11]. For any distribution D ∈ ∆n , this
family is given by:
X
1
α
Rα (D) =
log
{σ · D(σ) | σ ∈ [0, 1]} ,
1−α
where α ∈ [0, 1) ∪ (1, ∞) is a real-valued parameter of the family.
The maximum value of Rα (D) is when α = 0. It is immediately seen that R0 (D) =
log(n). Thus,
R0 (D) = −Bn0 (1) = −Nn0 (1).
It is well-known that Shannon entropy is a special case of Rényi entropies, as α approaches 1,
i.e, limα→1 Rα (D) = S(D). A proof of this using l’Hôpital’s rule in given in [9]. We therefore
have that:
lim Rα (D) = −B0D (1) = −N0D (1).
α→1
The minimum value of Rα (D) takes place when α approaches ∞. In order to represent this
value in our framework, we first need to extend our functions Bn (x) and Nn (x), which were
earlier defined only for all natural numbers n ≥ 1. Now, for any real value µ > 0, we define:
Bµ (x) =
1
µx
· µ,
Nµ (x) = (Bµ (x))1/x .
We make two observations about these generalized definitions. First, although the above
expression defining Bµ (x) can be simplified to 1/µx−1 , we intentionally define it as above
to make it easy to see that these definitions coincide with the earlier definitions, if µ is an
integer. Second, note that the profile functions were originally defined just for distributions,
and µ is a distribution only when µ > 0 is an integer. However, the Bµ (x) and Nµ (x)
functions, as defined above, make perfect sense even for non-integral values of µ. Given
16
these two observations, we adopt these definitions, albeit at the expense of slight abuse of
notation.
Just as shown in Proposition 3.3.1, it can be shown that:
MAX(D) = lim N1/MAX (D) (x).
x→∞
Also,
N01/MAX (D) (1) = − log(MAX (D)).
The minimum value of Rα (D), i.e, limα→∞ Rα (D), is shown in Clauß and Schiffner [9] to be
− log(MAX(D)). We thus obtain,
0
(1) = −N01/MAX (D) (1).
lim Rα (D) = −B1/MAX
(D)
α→∞
Figure 3.4 shows the range of all Rényi entropies for D. As is evident from the figure, all
1
=0

1
______________
N1/MAX(D)(x)
ND(x)
MAX(D)
Nn_ (x)
1/n
x
1
Figure 3.4: Range of all Rényi entropies for a distribution D ∈ ∆n .
entropy measures in this parametric family are negations of the slopes at value x = 1 of
some norm-profile curves that are in the vicinity of ND (x). Shannon entropy, a particular
member of this family, is the negation of the slope of ND (x) at x = 1.
3.5
Euclidean Distance Metric
Andersson and Lundin [10] suggested as metric, the Euclidean distance between the
distributions D ∈ ∆n and n, when these distributions are viewed as points in the space Rn .
17
This distance is given by:
v
(
)
u
2
uX 1
t
σ−
· D(σ) | σ ∈ [0, 1] .
n
By applying algebraic simplification and definitions of profiles, the above expression can be
seen to be:
q
BD (2) − lim Nn (x).
x→∞
The asterisk (∗) in Figure 3.5 shows graphically the argument of the square-root in the above
expression. It illustrates that this metric depends on just the value of the base-profile BD (x),
1
BD(2)
1/n
_
Nn(x)
()
BD(x)
1
x
2
Figure 3.5: Square of the Euclidean distance metric, shown as asterisk (∗).
at the value x = 2.
18
CHAPTER 4
AN ANALYSIS OF EXISTING METRICS
In the previous chapter, we saw that existing anonymity metrics essentially consider
just some local aspects of a distribution’s base- and/or norm-profile. We now show that such
local observation is insufficient and, by not taking entire profile contours into consideration,
these metrics end up producing counterintuitive and incorrect evaluations in some cases.
We begin by summarizing the formulations of the existing anonymity metrics within
our framework, constructed in chapter 3 for any given distribution D ∈ ∆n .
Anonymity set size metric of Chaum [4]: Number of users in the system, given by:
n = BD (0).
Reduced anonymity set size metric of Kesdogan, Egner and Büschkes [5]: Number of users in the system with a nonzero probability in D, given by:
lim BD (x).
x→0
Shannon entropy based metric of Serjantov and Danezis [6]: Negation of slope of
profiles of D at x = 1, given by:
S(D) = −B0D (1) = −N0D (1).
Normalized Shannon entropy based metric of Diaz et al. [7]: Ratio of slope of
profiles of D and n at x = 1, given by:
d(D) =
B0D (1)
N0D (1)
=
.
B0n (1)
Nn0 (1)
Maximal probability metric of Tóth, Hornák and Vajda [8]: Largest probability in
D, given by:
MAX(D) = lim ND (x).
x→∞
19
Rényi entropies based metric family of Clauß and Schiffner [9]: Negation of slopes
at x = 1 of profiles close to the profiles of D, given by:
R0 (D) = −Bn0 (1) = −Nn0 (1),
lim Rα (D) = −B0D (1) = −N0D (1),
α→1
lim Rα (D) = −B01/MAX (D) (1) = −N01/MAX (D) (1).
α→∞
Euclidean distance metric of Andersson and Lundin [10]: Euclidean distance between D and n, when viewed as points in the space Rn , given by:
q
BD (2) − lim Nn (x).
x→∞
As is evident from the above summary, each of these metrics measures some local
property of one or both profiles of the distribution D. From an anonymity level point of
view, the information in a distribution is far richer than what can be captured by such a local
observation, dictating the need for a metric that measures some global profile property. It is
for this reason that, for any of these metrics, examples of distributions exist in the literature
where that metric gives counterintuitive results and is out-of-line with one or more of the
other metrics. For instance, consider the following example, adapted from [8].
Example 4.0.1 For a system with n = 101 users, let A and B be the following distributions:



1
if σ = 0.5,


A(σ) =
100 if σ = 0.005,



 0
otherwise.



10 if σ = 0.086518,


B(σ) =
91 if σ = 0.001481,



 0 otherwise.
Then, d(A) ≈ d(B) ≈ 0.649, whereas MAX(A) = 0.5 0.086518 = MAX(B), i.e, normalized Shannon entropy of A and B are same, although A exposes a single user to a far greater
extent than B does.
20
The phenomenon highlighted by the above example is depicted in Figure 4.1 where,
while N0A (1) = N0B (1), we have that limx→∞ NA (x) and limx→∞ NB (x) are significantly far
apart.
1
NA(x)
0.5
NB(x)
0.086518
x
1
Figure 4.1: Two distributions with same normalized Shannon entropy, but significantly
different maximal probabilities.
In fact, as we saw in Section 3.4 on Rényi entropies, for any given combination of n,
s and µ, such that log(1/n) ≤ s ≤ log(µ), there exists a distribution D ∈ ∆n , for which
MAX(D) = µ and B0D (1) = N0D (1) = s. In other words, countless distributions usually
exist, all with the same Shannon entropy, but over a wide range of maximal probabilities
that, as in Example 4.0.1, makes us assign to them disturbingly different intuitive anonymity
levels. Furthermore, as shown later in Sections 4.1 and 4.2, it is even possible for Shannon
entropy to suggest higher anonymity for distributions that one would intuitively assign a
lower reading to (in comparison with another distribution).
4.1
Base-Profile Contours
It is instructive to recapitulate that the base-profiles of all distributions in ∆n lie
somewhere between the extreme profiles Bnb (x) and Bn (x). Loosely speaking, the closer the
21
base-profile BD (x), of a given distribution D, is to Bn (x), the higher the anonymity level
that should be associated with D. The same observations hold for norm-profiles as well.
We just saw that existing anonymity metrics essentially attempt to measure this
closeness of BD (x) to Bn (x) by observing some local property of one or both profiles of D.
We now show that as profile contours fluctuate with x, such local observation is inadequate.
In the next chapter, we will construct a metric that takes the entire profiles of D into account.
Such a global consideration of profiles results in a metric that is consistently correct.
We begin by studying some fundamental properties of profiles. The following proposition about their contours follows immediately from the definitions:
Proposition 4.1.1 For any distribution D ∈ ∆n :
(a) For all x > 0, B0D (x) ≤ 0 and N0D (x) ≤ 0, i.e, both profiles are monotonically nonincreasing.
(b) For all x > 0, B00D (x) ≥ 0 and N00D (x) ≥ 0, i.e, curves of both profiles are concave
upwards.
Proof
(a) Since BD (x) =
P
0
{σ x D(σ) | σ ∈ [0, 1]}, BD
(x) =
P x
{σ ln σ · D(σ)}. For any x > 0
and σ ∈ [0, 1], ln σ is negative, σ x is positive, and D(σ) is a nonnegative integer; hence
the entire sum is negative so that BD (x) is monotonically nonincreasing for x > 0.
Now suppose 0 < x1 ≤ x2 . As shown above, BD (x1 ) ≥ BD (x2 ). Since x2 > 0, it follows
that [BD (x1 )]1/x2 ≥ [BD (x1 )]1/x2 . But since x2 ≥ x1 , we also have that [BD (x1 )]1/x1 ≥
[BD (x1 )]1/x2 . Combining these two inequalities gives BD (x1 )1/x1 ≥ BD (x2 )1/x2 , i.e,
ND (x1 ) ≥ ND (x2 ). Hence, ND (x) is monotonically nonincreasing for x > 0.
(b) Since BD (x) =
P x
P
00
{σ D(σ) | σ ∈ [0, 1]}, BD
(x) = {ln2 σ · σ x · D(σ)}, and the above
00
0
argument shows that this is positive; hence, BD
(x) is nonnegative for x > 0, i.e, BD
(x)
is nondecreasing there.
22
To show that ND00 (x) ≥ 0, it suffices to show that ND0 (x) is nondecreasing, i.e, for
0 < x1 < x2 , ND0 (x2 ) − ND0 (x1 ) > 0. By Lemma 3.3.1, we have that
0
ln BD (x)
BD (x)
0
−
.
ND (x) = ND (x) ·
xBD (x)
x2
Hence,
ND0 (x2 )
−
ND0 (x1 )
= ND (x2 ) ·
h
0 (x )
BD
2
x2 BD (x2 )
−
ln BD (x2 )
x22
i
− ND (x1 ) ·
h
0 (x )
BD
1
x1 BD (x1 )
−
ln BD (x1 )
x21
i
Because ND (x) is decreasing, it follows that
0
0
BD (x2 )
ln BD (x2 )
BD
ln BD (x1 )
(x1 )
0
0
−
−
+
ND (x2 ) − ND (x1 ) ≥ ND (x2 ) ·
x2 BD (x2 )
x22
x1 BD (x1 )
x21
Now put
0
0
(x1 )
BD
BD
(x2 )
α=
−
x2 BD (x2 ) x1 BD (x1 )
and
ln BD (x1 ) ln BD (x2 )
β=
−
.
x21
x22
We have just shown that ND0 (x2 ) − ND0 (x1 ) ≥ ND0 (x2 )(α + β). Now, α ≥ 0. Indeed,
0
0
BD (x2 )
(x1 )
BD
α =
−
x B (x ) x1 BD (x1 )
2 0D 2
0
BD (x1 )
BD
(x1 )
−
≥
x2 BD (x2 ) x1 BD (x1 )
1
1
0
= BD (x1 )
−
x2 BD (x2 ) x1 BD (x1 )
≥ 0,
0
where the second step follows from the fact that BD
(x), as we have shown, is nonde-
creasing, and the last step follows from the fact that BD (x) is nonincreasing. We also
have that β ≥ 0. Indeed, since 1/x2 and ln (x) are nondecreasing for x > 0, we have
that
ln BD (x1 ) ln BD (x2 )
β =
−
x21
x22
ln BD (x1 ) ln BD (x2 )
≥
−
x22
x22
≥ 0.
23
1
1
B^n(x)
BD(x)
BE(x)
BE(x)
B_n(x)
B_n(x)
1
B^n(x)
BD(x)
1
x
(a)
(b)
x
Figure 4.2: Base-profiles intersecting at x > 1: (a) One such intersection, (b) Two such
intersections.
Hence, α and β are both nonnegative. Since ND (x) is similarly nonnegative, it follows
that ND0 (x2 ) − ND0 (x1 ) = ND (x2 )(α + β) is nonnegative, which completes the proof. We already know that all profile values are 1 at x = 1. An important characteristic of
contours with the properties given in Proposition 4.1.1 is that different such contours can
intersect with each other for values x > 1. In fact, they can intersect a multiple number of
times. The following is a simple, yet informative example.
Example 4.1.1 Consider the following distributions:
7 7 1
D = [[ 16
, 16 , 8 ]] and
E = [[ 13
, 11 , 11 ]].
24 48 48
Then, BD (x) and BE (x) intersect at x = 2, i.e, BD (2) = BE (2), because:
2
7
16
2
2 2
2
1
13
11
+
=
+2
.
8
24
48
Figure 4.2(a) depicts such a situation. In general, if D, E ∈ ∆n are distinct distributions,
neither of which is any of the extreme distributions, n
b or n, then their base-profiles, BD (x)
and BE (x), may intersect zero or more times for some values of x > 1. Figure 4.2(b) shows a
case with two such intersections. Which of these two intersecting profiles is closer to Bn (x)
changes at each intersection.
24
While comparing such distributions, the metrics based on Shannon and Rényi entropies base their final determination of closeness to Bn (x) on just the slopes of the distribution profiles at the x = 1 intersection point. This misses the bigger picture by ignoring
the possibility of other intersections at values x > 1 that always alter the closeness relation7 7 1
ship. Example 4.1.1, given earlier in this chapter, with D = [[ 16
, 16 , 8 ]] and E = [[ 13
, 11 , 11 ]],
24 48 48
illustrates this phenomenon. In it, there is one intersection at x = 2, such as the situation
shown in Figure 4.2(a). The entropy-based metrics declare E as the distribution with higher
anonymity because B0E (1) is closer to Bn0 (1). But it is clear that the attack resulting in E
is stronger than the one resulting in D, because the most suspicious user in E is in a class
by itself and is exposed more than the two most suspicious users in D. Thus, D should be
assigned a higher anonymity. The metric we propose in chapter 5 correctly achieves this by
taking into account all such later intersections, and beyond.
The Euclidean distance metric suffers from this tunnel-vision phenomenon too, as its
decision is based on just the value at x = 2 of base-profiles. For the same example, it assigns
the same anonymity level to D as well as E.
4.2
Norm-Profile Contours
With the exception of their asymptotic values as x → ∞, norm-profiles of distribu-
tions have many similarities with base-profiles. First, recall that the slopes at x = 1 of both
profiles of any distribution are identical. Also, as stated in Proposition 4.1.1, norm-profiles
are monotonically nonincreasing and concave upwards, just as base-profiles. We now make
the following important observation:
Proposition 4.2.1 For any distributions D, E ∈ ∆n and x > 0,
ND (x) < NE (x) iff BD (x) < BE (x).
25
Proof We have that
BD (x) < BE (x)
⇐⇒
⇐⇒
⇐⇒
BD (x)
<1
BE (x)
1/x
BD (x)
< 11/x = 1
BE (x)
[BD (x)]1/x
[BE (x)]1/x
<1
⇐⇒
[BD (x)]1/x < [BE (x)]1/x
⇐⇒
ND (x) < NE (x).
Thus, between D and E, the norm-profile of a distribution is closer to Nn (x) for exactly
those values of x at which its base-profile is closer to Bn (x). In particular, ND (x) and NE (x)
intersect at exactly those values of x, where BD (x) and BE (x) intersect.
Norm-profiles provide a better setup for a closer study of intersections, especially
when these profiles are considered for representations of distributions written in our [[. . .]]
notation, instead of the actual distributions. In this case, they exhibit properties of Lpnorms on vector spaces (see Trefethen and Bau [13]), and it is for this reason that we chose
the term norm-profile for them.
Let [[∆n ]] denote the set of all representations of distributions in ∆n . Note that [[∆n ]]
is a subspace of the n-dimensional real unit cube [0, 1]n , and permuted representations of the
same distribution, for example [[0.2, 0.3, 0.5]] and [[0.3, 0.5, 0.2]], are distinct points in [[∆n ]].
For any given distribution in ∆n , there may be anywhere from one to n! representations in
b has n.
[[∆n ]]. In particular, n has just one representation, while n
It can also be seen that [[∆n ]] is the convex hull of all representations of the distribution
n
b, i.e, the smallest convex set of points in [0, 1]n containing these representations. For
example, in case of n = 2, all legal representations are points on the line segment that
connects [[0, 1]] and [[1, 0]]. And for n = 3, they are points on the triangular face that
connects [[1, 0, 0]], [[0, 1, 0]], and [[0, 0, 1]].
26
We originally defined norm-profiles for just distributions, but that definition can be
trivially extended for all points in the cube [0, 1]n . Now, for all x ≥ 1 and α ∈ [0, 1], we
define the family of sets Nxα as follows:
Nxα = {P ∈ [0, 1]n | NP (x) = α}.
Then, Nxα ∩ [[∆n ]] is the set of those representations, norm-profiles of whose underlying
distributions intersect at x, with the value α. When points in Nxα are plotted graphically, the
shape of that locus of points depends upon x, and its size depends upon α. Figure 4.3 shows
some of these sets for the 2-dimensional case. It shows [[∆2 ]] and, for an arbitrary value of α,
╓ ╖= N 1
1
╙ 2╜
N1

N3/2
╓ 0,1 ╖
╙ ╜
N2

N4
N

╓ 1,0 ╖
╙ ╜
Figure 4.3: [[∆2 ]] and loci Nxα for an arbitrary value of α, and some example values of x.
sets Nxα for a few example values of x. Let us consider any one of these loci, say N2α . If α is
sufficiently large, S = N2α ∩ [[∆2 ]] will be nonempty (with at most two points). Distributions
corresponding to representations in this set are those whose norm-profiles intersect at x = 2
(with value α). Existing anonymity metrics will make a misjudgment while comparing two
0
such distributions. Furthermore, if α is now reduced slightly, to say α0 , some locus Nxα0 for
27
an x0 > x = 2 will have a nonempty intersection with S. Norm-profiles of distributions
0
denoted by representations in S ∩ Nxα0 intersect with each other at x as well as x0 . And so
on.
The astute reader may have noticed that, in two dimensions, any locus has at most two
common points with [[∆2 ]] and, due to symmetry, both of these points represent the same
distribution. Therefore, in the 2-dimensional case, norm-profiles of distinct distributions
never intersect and all existing anonymity metrics work fine. This observation is correct. We
simply used this case to introduce the loci and illustrate the phenomenon of their intersection
with the set of all representations.
In all higher dimensions, however, this problem with existing metrics is very real. Figure 4.4 shows the 3-dimensional situation. Due to the difficulty in displaying 3-dimensional
╓ 0,1,0 ╖
╙
╜
╓ ╖= N 1
1
╙ 3╜
╖
N3/2 ∩ ╓
╙ 3 ╜

╖
N2  ∩ ╓
╙ 3 ╜

╖
N4 ∩ ╓
╙ 3 ╜
╖
N ∩ ╓
╙ 3 ╜
╓ 0,0,1 ╖
╙
╜
╓ 1,0,0 ╖
╙
╜
Figure 4.4: [[∆3 ]] and its intersections with some loci.
loci, unlike Figure 4.3 where entire loci were displayed, Figure 4.4 shows only their intersections with [[∆3 ]]. The set [[∆3 ]] is the triangular face shown and N2α is a sphere centered at the
origin. If the radius of this sphere (i.e, α) is large enough, the sphere intersects the triangle
in a circle, shown in the figure as N2α ∩ [[∆3 ]]. This circle has an infinite number of points on
it that, despite symmetry, collectively represent an infinite number of distinct distributions.
28
7 7 1
13 11 11
Points [[ 16
, 16 , 8 ]] and [[ 24
, 48 , 48 ]] of Example 4.1.1, given earlier, lie on one such circle (for
p
α = 51/128 ≈ 0.63). As these points represent distinct distributions, existing anonymity
metrics give incorrect results for them – an observation we already made in Section 4.1.
29
CHAPTER 5
A NEW, GLOBAL ANONYMITY METRIC
We now present a new anonymity metric that does not err due to a narrow view,
which becomes inevitable if only some local aspects of distribution profiles are looked at. By
being sensitive to entire profile contours and, in particular, to intersections between them,
the new metric always correctly determines the underlying system’s degree of anonymity.
The fundamental premise upon which our metric is constructed is that profiles are
infinite objects, and the complete profiles of distributions n
b and n should be considered as the
two extremes, for no anonymity and full anonymity cases, respectively. Thus, the anonymity
measure assigned to a given distribution D ∈ ∆n should be based upon how close the profiles
b.
of D are, on the whole, to those of n. Alternatively, how far from those of n
Given Proposition 4.2.1, it seems we can have the freedom of working with either baseprofiles or norm-profiles, because from the point of view of closeness to the corresponding
profile of n, both profiles exhibit similar behavior. We choose to adopt base-profiles, because
this choice requires less algebraic simplification in arriving at the final metric expression.
5.1
Raw Metric
We propose to use the reciprocal of the area under the base-profile of a distribution
as the raw measure of the anony-mity left in the system. For any distribution D ∈ ∆n , the
system’s raw degree of anonymity is thus given by:
R(D) = R ∞
1
1
.
BD (x) dx
Using x = ∞ as the upper limit of the above definite integral ensures that the entire baseprofile contour is considered before arriving at the metric value – the crux of our thesis. The
value x = 1 as the lower limit suffices because, as established in Proposition 4.2.1, baseprofiles of all distributions meet at that value of x. The differences among contours of the
30
base-profiles begin to manifest from thereon, due to differences in the amount of anonymity
associated with their corresponding distributions.
By applying the definition of base-profiles and elementary rules of integration, the
above expression simplifies to:
R(D) = P n
σ·D(σ)
log(σ)
−1
o.
| σ ∈ [0, 1]
On the surface, there are some similarities between the above raw, area-based metric and
the metric of Serjantov and Danezis [6] based on Shannon entropy, reproduced below:
P
S(D) = − {σ · log(σ) · D(σ) | σ ∈ [0, 1]}.
First, the same subexpressions appear in expressions of both R(D) as well as S(D), albeit
in a different arrangement. Second, both metrics range between 0 and log(n). The following
proposition follows from the properties of distributions and logarithms:
Proposition 5.1.1 For any distribution D ∈ ∆n ,
R(D) ≤ S(D),
with equality iff D is one of the extreme distributions, i.e, n
b or n.
Proof We have that
R(D) ≤ S(D)
⇐⇒
⇐
⇐⇒
⇐⇒
⇐⇒
1
∗ S(D) ≥ 1
R(D)
n
n
n
X
X
X
pj
(
pi log pi ) · (
)≥(
pi )2
log pj
i=1
j=1
i=1
i=1
n
X
i=1
n
X
n
XX
log pi
pi p j (
)≥
pi pj
log pj
j=1
i=1 j=1
2
n
XX
2
i6=j
n
XX
pi +
pi +
i=1
⇐⇒
n
n X
n
X
i<j
n
XX
i<j
n
n
X
XX
log pi
)≥
pi 2 +
pi pj (
p i pj
log pj
i=1
i6=j
n
n
X
XX
log pi log pj
+
)≥
pi 2 + 2 ·
pi pj
pi pj (
log pj
log pi
i=1
i<j
n
XX
log pi log pj
pi p j (
+
)≥
(pi pj · 2)
log pj
log pi
i<j
31
log pi
log pj
+
log pj
log pi
is form of (a + a1 ). For any a > 0, (a + a1 ) ≥ 2.
In other words, our area-based metric is more conservative than the one based on
Shannon entropy. Figure 5.1 gives an idea of the values of these metrics, when viewed on
[[∆2 ]]. The situation is similar in all other dimensions.
log(2)
S(D)
R(D)
╓ ╖

╙ 2╜
╓ ╖
╙ 0,1 ╜
╓ ╖
1,0
╙ ╜
Figure 5.1: Metrics S(D) and R(D) on [[∆2 ]].
Note that, as the metric R exhibits a more traditional bell-curve, the metric S is
far quicker than R to assign higher anonymity values to points close to the zero anonymity
extreme points, i.e, near all representations of n
b. At points close to full anonymity, however,
S is less sensitive to change than R. Stated precisely, the slope of S(D) at D = [[0, 1]] is
∞, whereas at D = [[ 21 , 12 ]] is 0. Such an extreme disparity in sensitivity to small changes in
distributions at these points is difficult to explain intuitively. On the other hand, the slope
of R(D) at these points is 1 and 0, respectively.
5.2
Normalized Metric
Just as the metric of Diaz et al. [7], given by d(D) = S(D)/ log(n), is a normalized
version of that of Serjantov and Danezis [6], we define a normalized version of our raw metric
as the ratio of the areas under the profiles Bn (x) and BD (x). For any distribution D ∈ ∆n ,
32
the system’s normalized degree of anonymity is given by:
R∞
Bn (x) dx
.
a(D) = R 1∞
B
(x)
dx
D
1
Figure 5.2 shows the two areas mentioned in the above expression. It is clear from this figure
B^n(x)
1
BD(x)
Bn_(x)
0
x
1
Figure 5.2: Areas considered for system’s normalized degree of anonymity.
that as D varies from n to n
b, a(D) varies from 1 (for full anonymity) to 0 (for no anonymity).
The above expression can also be simplified to see that a(D) = R(D)/ log(n), i.e,
a(D) =
log(n) ·
−1
P n σ·D(σ)
log(σ)
o.
| σ ∈ [0, 1]
The relationship between metrics a and d is thus the same as between R and S, already
depicted in Figure 5.1.
5.3
Differences from Metrics Based on Shannon Entropy
Figure 5.1 fails to capture all the ramifications of the differences between our area-
based metrics and the existing metrics based on Shannon entropy. First, due to the different
rationales adopted by these metrics behind their methods for judging closeness of distributions in ∆n to n, the main differences between these metrics manifest especially when they
33
determine, given two distributions in ∆n , which of those distributions is closer to n. Second, as we already saw in Section 4.2, these differences can only be observed in dimensions
higher than two, i.e, in systems with more than two users. We start by revisiting our earlier
examples.
For the distributions A and B of Example 4.0.1, we saw that d(A) ≈ d(B), i.e, normalized Shannon entropy of A and B are nearly same, despite the fact that A exposes one of
the users to a far greater extent than B exposes any user. Upon computing our normalized,
area-based metric, we see that a(A) ≈ 0.26563, which is less than half of a(B) ≈ 0.579053.
By taking the entire base-profile contours of A and B into account, instead of just the slopes
of these profiles at x = 1, the new metric performs a more accurate comparison of the
anonymity levels underlying these distributions.
7 7 1
The distribution D = [[ 16
, 16 , 8 ]] of Example 4.1.1 is assigned a lower anonymity level
by metrics based on Shannon entropy than the distribution E = [[ 13
, 11 , 11 ]] of the same
24 48 48
example, as shown by the computation:
d(D) ≈ 0.895 < 0.917 ≈ d(E).
As stated earlier, this is counterintuitive because the most suspicious user in E is in a class
by itself and is exposed more than the two most suspicious users in D. The base-profiles
of these distributions intersect as shown in Figure 4.2(a). The area-based metric correctly
takes that phenomenon into account and declares D as having higher anonymity than E:
a(D) ≈ 0.814 > 0.762 ≈ a(E).
In general, while comparing distributions D, E ∈ ∆n , the chance of disagreement
between metrics a and d increases with n. We already know that for n = 2, there is never a
disagreement between them, i.e, a(D) < a(E) if and only if d(D) < d(E), for all D, E ∈ ∆2 .
Our pseudo-exhaustive simulation results have shown that for n = 3, these two metrics
disagree in about 7.5% of all cases. In other words, for about 7.5% of pairs of distributions
hD, Ei ∈ ∆3 × ∆3 , the two metrics disagree on which, among D and E, is the distribution
34
closer to n. For n = 4, the disagreement rate was found by our exhaustive simulation to be
about 9%.
Intersections between profiles of D and E are the root cause of disagreements between
these metrics. If 1 ≤ x1 < x2 , the interval (x1 , x2 ) is said to be a (maximal) region of
dominance of D over E, if BD (x1 ) = BE (x1 ), BD (x2 ) = BE (x2 ) and, for all x ∈ (x1 , x2 ),
BD (x) < BE (x). The regions of dominance of D and E alternate, as shown in Figure 5.3,
at each intersection of their profiles. The area gains achieved by D in any of its regions of
dominance over E is the difference between the areas under their base-profiles in that region.
Our area-based metrics assigns a higher anonymity level to D if its total area gains from
all its regions of dominance over E is higher than its total area losses in all other regions.
On the other hand, metrics based on Shannon entropy assign a higher anonymity level to
the distribution with the sharper profile slope at x = 1. A disagreement between these
metrics arises when these two measures are not both pointing in the same direction, i.e, the
distribution with a more acute slope at x = 1 turns out to have more total area losses than
gains.
1
BD(x)
x1
x2
x3
BE(x)
Figure 5.3: Initial base-profile slopes and alternating regions of dominance of distributions
D and E.
We end this chapter with an analogy of a race between two runners D and E. Suppose,
the race begins at x = 1 and ends at x = ∞. For any given value of x0 between these
35
extremes, the speed of runner D at x0 is a function of the value BD (x0 ). Similarly, for E.
The distance between D and E at x0 corresponds to the difference between its area gains
and losses from 1 to x0 . Metrics based on Shannon entropy determine the winner based upon
which runner got out of the blocks faster at the start of the race, whereas our area-based
metrics determine the winner based upon who is ahead at the end of the race.
36
CHAPTER 6
CONCLUSIONS
Finding the right metric for measuring the amount of anonymity left in an anonymity
system in the aftermath of an attack has been a goal almost ever since a need for such systems
was first recognized. Although several metrics to this end have been proposed in the past,
none has yet demonstrated its ability to completely foot the bill. For any existing metric,
examples of probability distributions resulting from attacks can be found in the literature
for which that metric performs counterintuitive anonymity evaluation. As conformance to
human intuition is the ultimate standard of correctness of a metric, such examples have
always hinted at the need for a better metric.
In this thesis, we first constructed a graphical framework, within which we represented existing metrics, including the currently popular ones based on Shannon entropy.
Our framework is made primarily of two profiles of probability distributions, which we call
base-profile and norm-profile, that are functions over the infinite domain of nonnegative real
numbers. By placing the existing metrics in our framework, we showed that these metrics
base their anonymity evaluation on just some local aspects of profiles, such as a profile’s value
at some point, or its slope at another point, etc. We also showed that as profile contours
can fluctuate, especially causing intersections between them, any such local observation to
determine anonymity level inevitably ignores profile intersections, and is thus inadequate.
This explains in a graphic way, why existing metrics fail to give a good, intuitive evaluation
in some cases.
We then proposed a new metric that evaluates the level of anonymity based upon a
global property of the underlying profile, namely the area swept under an entire base-profile
contour. By doing so, our new metric avoids the pitfall of a local profile aspect not accurately
reflecting its global contour. Just as for the popular metric based on Shannon entropy, we
gave two versions of our metric, a raw and a normalized one.
37
We compared our approach with the Shannon entropy approach for measuring anonymity, and noted that differences between these approaches are best appreciated not on any
single probability distribution, but by observing how these approaches determine which
of some two given distributions reflects higher anonymity. We identified conditions under
which their results disagree. For systems with few senders, we also gave an estimate obtained
from our pseudo-exhaustive simulation, of the percentage of cases that involve disagreement.
While there is no disagreement for systems with two users, there is about 7.5% disagreement
rate on systems with three users, and about 9% for four users. We have left determination
of the disagreement rate between our metrics for systems with a larger number of users, as
future work.
Although we presented our technique only in the context of measuring sender anonymity, it can be used as well for receiver anonymity, or for any other situation with a need to
measure the amount of uncertainty contained in a probability distribution. Recently, Bagai
et al. [14] gave a system-wide metric based on distributions over perfect matchings between
a system’s input and output messages. Gierlichs et al. [15] considered multiple messages
sent or received by system users. These are some examples of works that employed Shannon
entropy, and whose results can be improved by our area-based metric.
Finally, we believe that while efficient computability of an anonymity metric is a welcome bonus, intuitiveness of its anonymity evaluations is essential. We feel that a systematic
study of that aspect of all available metrics, including the one we proposed in this thesis, is
an important subject for future research.
38
REFERENCES
39
LIST OF REFERENCES
[1] C. Diaz. “Anonymity and privacy in electronic services.” PhD thesis, Katholieke Univesiteit Leuven, Belgium, 2005.
[2] D. Kelly, R. Raines, M. Grimaila, R. Baldwin, and B. Mullins. “A survey of state-ofthe-art in anonymity metrics.” In Proceedings of the 1st ACM Workshop on Network
Data Anonymization, pages 31–39, 2008.
[3] R. Bagai and N. Jiang. “Profiling probability distributions for measuring anonymity.”
Submitted to ACM CCS, 2011.
[4] D. Chaum. “The dining cryptographers problem: Unconditional sender and recipient
untraceability.” Journal of Cryptology, 1(1):65–75, 1988.
[5] D. Kesdogan, J. Egner, and R. Büschkes. “Stop-and-go- MIXes providing probabilistic
anonymity in an open system.” In Proceedings of the International Information Hiding
Workshop, pages 83–98. Lecture Notes in Computer Science - 1525, 1998.
[6] A. Serjantov and G. Danezis. “Towards an information theoretic metric for anonymity.”
In Proceedings of the 2nd International Privacy Enhancing Technologies Symposium
(PETS), pages 41–53. Lecture Notes in Computer Science - 2482, 2002.
[7] C. Diaz, S. Seys, J. Claessens, and B. Preneel. “Towards measuring anonymity.” In Proceedings of the 2nd International Privacy Enhancing Technologies Symposium (PETS),
pages 54–68. Lecture Notes in Computer Science - 2482, 2002.
[8] G. Tóth, Z. Hornák, and F. Vajda. “Measuring anonymity revisited.” In Proceedings
of the 9th Nordic Workshop on Secure IT Systems, pages 85–90, 2004.
[9] S. Clauß and S. Schiffner. “Structuring anonymity metrics.” In Proceedings of the ACM
Workshop on Digital Identity Management, pages 55–62, 2006.
[10] C. Andersson and R. Lundin. “On the fundamentals of anonymity metrics.” In The Future of Identity in the Information Society, edited by S. Fischer-Hubner, P. Duquenoy,
A. Zuccato, and L. Martucci, volume 262 of IFIP International Federation for Information Processing, pages 325–341. Springer Boston, 2008.
[11] A. Rényi. “On measures of entropy and information.” In Proceedings of the 4th Berkeley
Symposium on Mathematical Statistics and Probability, pages 547–561, 1961.
[12] C. Shannon. “A mathematical theory of communication.” Bell System Technical Journal, 27:379–423 and 623–656, 1948.
[13] L. Trefethen and D. Bau, III. Numerical Linear Algebra. SIAM Publishers, 1997.
40
LIST OF REFERENCES (continued)
[14] R. Bagai, H. Lu, R. Li, and B. Tang. “An accurate system-wide anonymity metric
for probabilistic attacks.” In Proceedings of the 11th International Privacy Enhancing
Technologies Symposium (PETS), 2011. Pages 117-133.
[15] B. Gierlichs, C. Troncoso, C. Diaz, B. Preneel, and I. Verbauwhede. “Revisiting a
combinatorial approach toward measuring anonymity.” In Proceedings of the 7th ACM
workshop on Privacy in the electronic society, pages 111–116, Alexandria, VA, USA,
2008.
41

Download Report

t11068_Jiang.pdf

Paperzz.com

Your Paperzz