AN IMPROVED MEASURE OF NETWORK ANONYMITY USING PROFILE FUNCTIONS A Thesis by Nan Jiang MS, Xi’an University of Sci. and Tech., China, 2008 Submitted to the Department of Electrical Engineering and Computer Science and the faculty of the Graduate School of Wichita State University in partial fulfillment of the requirements for the degree of Master of Science July 2011 c Copyright 2011 by Nan Jiang All Rights Reserved AN IMPROVED MEASURE OF NETWORK ANONYMITY USING PROFILE FUNCTIONS The following faculty members have examined the final copy of this thesis for form and content, and recommend that it be accepted in partial fulfillment of the requirement for the degree of Master of Science with a major in Computer Science. Rajiv Bagai, Committee Chair Bin Tang, Committee Member Tianshi Lu, Committee Member iii DEDICATION To my parents and my family who have supported me throughout my life. Without their help, it would not have been possible to finish my thesis. iv ACKNOWLEDGEMENTS I would like to thank my advisor Dr. Rajiv Bagai who has made it possible for me to complete this thesis. His support, knowledge, and patience have guided me from very beginning to the end. I would like to thank Dr. Bin Tang and Dr. Tianshi Lu as well, for their kind help and for being my committee members. I would like to thank Dr. Buma Fridman of the Department of Mathematics and Statistics at the Wichita State University for helpful discussions on profile intersections. I would like to thank my friend Dylan Holmes, for his patient help on my English. v ABSTRACT We present a graphical framework containing certain infinite profiles of probability distributions that result from an attack on an anonymity system. We represent currently popular anonymity metrics within our framework to show that existing metrics base their decisions on just some small piece of information contained in a distribution. This explains the counterintuitive, thus unsatisfactory, anonymity evaluation performed by any of these metrics for carefully constructed examples in literature. We then propose a new anonymity metric that takes entire profiles into consideration in arriving at the degree of anonymity associated with a probability distribution. The comprehensive approach of our metric results in correct measurement. A detailed comparison of our new metric, especially with the popular metrics based on Shannon entropy, gives the rationale and degree of disagreement between these approaches. vi TABLE OF CONTENTS Chapter Page 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 PRELIMINARIES AND NOTATIONS . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Multisets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 Base-Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 A COMMON GRAPHICAL FRAMEWORK FOR EXISTING ANONYMITY METRICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1 Anonymity Set Size Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 Metrics Based on Shannon Entropy . . . . . . . . . . . . . . . . . . . . . . . 10 3.3 Maximal Probability Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3.1 Norm-Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 Metrics Based on Rényi Entropies . . . . . . . . . . . . . . . . . . . . . . . . 16 3.5 Euclidean Distance Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4 AN ANALYSIS OF EXISTING METRICS . . . . . . . . . . . . . . . . . . . . . . 19 4.1 Base-Profile Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2 Norm-Profile Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5 A NEW, GLOBAL ANONYMITY METRIC . . . . . . . . . . . . . . . . . . . . . 30 5.1 Raw Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 30 TABLE OF CONTENTS (continued) Chapter Page 5.2 Normalized Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.3 Differences from Metrics Based on Shannon Entropy . . . . . . . . . . . . . . 33 6 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 viii LIST OF FIGURES Figure Page 2.1 Base-profiles of three distributions in ∆n , namely n b, n, and an arbitrary D ∈ ∆n . 3.1 A graphical representation of the entropy-based metric of Serjantov and Danezis 9 [6] as −B0D (1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Bn (x) curves, for n = 1, 2, and 3. . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Base- and Norm-profiles for n b, and an arbitrary D ∈ ∆n . . . . . . . . . . . . . 14 3.4 Range of all Rényi entropies for a distribution D ∈ ∆n . . . . . . . . . . . . . . 17 3.5 Square of the Euclidean distance metric, shown as asterisk (∗). . . . . . . . . . 18 4.1 Two distributions with same normalized Shannon entropy, but significantly different maximal probabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 21 Base-profiles intersecting at x > 1: (a) One such intersection, (b) Two such intersections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3 [[∆2 ]] and loci Nxα for an arbitrary value of α, and some example values of x. . . 27 4.4 [[∆3 ]] and its intersections with some loci. . . . . . . . . . . . . . . . . . . . . . 28 5.1 Metrics S(D) and R(D) on [[∆2 ]]. . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.2 Areas considered for system’s normalized degree of anonymity. . . . . . . . . . 33 5.3 Initial base-profile slopes and alternating regions of dominance of distributions D and E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 35 CHAPTER 1 INTRODUCTION A fundamental problem in the area of anonymity systems is to measure the amount of sender anonymity that remains for a particular message sent via a system in the aftermath of an attack. It has become customary to consider attacks that result in probabilities for each system user of being the actual sender of that message. Over the years, several metrics have been proposed, ranging from simple ones based on the size of the underlying anonymity set to information-theoretic ones based on entropy. Some surveys of metrics can be found in Diaz [1] and Kelly et al. [2]. While each metric has usually arrived with a more convincing rationale than the ones preceding it, truth be told, neither one of them has yet been completely satisfactory. Irksome instances of probability distributions still exist for any of these metrics, for which the result of that metric does not conform with our intuition of the system’s level of anonymity. 1.1 Thesis Contributions Contributions of this thesis are two-fold. We first construct a graphical framework, suitable for placement of all popular anonymity metrics to date. By placing these metrics in our framework, we illustrate that in their attempt to measure anonymity, the existing metrics base their decisions on just some finite aspects of a given probability distribution, whose information content is in fact potentially infinite. This provides an understanding, in fact a graphic visualization, of why none of the existing metrics works correctly for all probability distributions. We then propose a new anonymity metric that arrives at its values after taking into account the entire information content of a probability distribution. The comprehensive approach of our metric results in correct measurement. The results of this thesis have been submitted in Bagai and Jiang [3]. The specific existing metrics analyzed in this thesis are the following: 1 (a) the anonymity set size metric of Chaum [4]; (b) the reduced anonymity set size metric of Kesdogan, Egner and Büschkes [5]; (c) the Shannon entropy based metric of Serjantov and Danezis [6]; (d) the normalized Shannon entropy based metric of Diaz et al. [7]; (e) the maximal probability metric of Tóth, Hornák and Vajda [8]; (f) the Rényi entropies based metric family of Clauß and Schiffner [9]; and (g) the Euclidean distance metric of Andersson and Lundin [10]. The metrics based on Shannon entropy (items c and d in the above list) have been the most popular for a number of years. Much of our analysis and comparison of our own metric is thus with them. Structures at the core of our graphical framework are two related functions, over the infinite domain of nonnegative real values, that are induced by a given probability distribution. We call these functions base-profile and norm-profile of the distribution. Our analysis shows that all above metrics essentially consider some local properties of these profiles, and do not take their entire contours into account. A global consideration of these contours, especially intersections between them, as we reveal, is essential for a metric to always produce the correct anonymity measure. The new metric we then propose does that and, in that respect, it is an improved metric. 1.2 Thesis Organization The rest of this thesis is organized as follows. Chapter 2 contains mathematical preliminaries and notations used in this thesis. This chapter also introduces base-profiles, one of the two important structures in our graphical framework. Chapter 3 presents an overview of the existing anonymity metrics and places each of these metrics within our common framework. Norm-profiles, the other important structure in our framework, are introduced in this chapter at an appropriate point. Chapter 4 exposes the inadequacy of local observations 2 underlying existing metrics by studying the phenomenon of profile intersections that are inevitably ignored by such observations. Our new metric, based on global observations of profile contours, is then presented in Chapter 5. An analysis of similarities and differences of our proposed metric with existing metrics based on Shannon entropy occupy much of this chapter. Finally, Chapter 6 concludes our results and mentions some directions for future work. 3 CHAPTER 2 PRELIMINARIES AND NOTATIONS In this chapter, we give mathematical preliminaries and notations used in the thesis. Probability distributions are often defined as sequences of nonnegative real values that add up to 1. However, from the point of view of anonymity metrics, the order of values in them is not important. Only the actual values, along with the number of occurrence of each value, is relevant. We therefore immediately depart from standard practice, and define probability distributions as multisets, as that view retains just the amount of information needed for our purpose. 2.1 Multisets Let a multiset M over any set S be a function M : S → N, where N is the set {0, 1, 2, . . .} of natural numbers. For any σ ∈ S, M (σ) is the frequency of σ in the multiset P M . The cardinality of M , denoted |M |, is the sum {M (σ) | σ ∈ S} of frequencies in M of all members of S. We employ double-brackets [[. . .]] as the notation for multisets, in which each member of the underlying set occurs (in any order) exactly as many times as its frequency in the multiset. For example, if the underlying set S = {a, b, c}, then [[a, a, a, c]] denotes the multiset of cardinality 4 in which the frequencies of a, b, and c are 3, 0, and 1, respectively. The same multiset is also denoted by [[a, a, c, a]], [[a, c, a, a]], or [[c, a, a, a]]. 2.2 Probability Distributions A probability distribution (or just distribution) on any set U is a multiset D over the closed interval [0, 1] of real numbers, such that: |U | = |D| and P {σ · D(σ) | σ ∈ [0, 1]} = 1. In the context of anonymity systems, U is typically a set of users, any of whom may have sent a certain message via an anonymity system, and D is a distribution on U , arrived at by 4 an attacker, containing probabilities for each of those users of being the originator of that message. Example 2.2.1 Suppose U contains 6 users. Then, the distribution [[0.5, 0.2, 0.2, 0.1, 0, 0]] assigns 50% probability to one of those users of being the message originator, 20% each to two other users, 10% to another user, and zero probability to the two remaining users. We limit ourselves to systems with a finite number of users, thus to distributions with finite cardinalities. For any natural number n ≥ 1, let ∆n be the set of all distributions of cardinality n. It is easily seen that, ∆1 has just one distribution [[1]] and, for all n > 1, ∆n is uncountably infinite. Two special distributions in ∆n , namely n b and n, are defined as follows: n − 1 if σ = 0, n b(σ) = 1 if σ = 1, 0 otherwise. n if σ = n1 , n(σ) = 0 otherwise. In other words, n b is the peak distribution [[1, 0, 0, . . . , 0]], which contains a single 1 and exactly (n − 1) occurrences of 0. This corresponds to the attacker having determined, with full certainty, the actual sender of the underlying message, out of n potential senders. On the other hand, n is the even distribution [[ n1 , n1 , . . . , n1 ]], in which the value 1 n occurs n times. This distribution corresponds to the attacker having made no headway in determining the actual sender of the message. In general, an attack results in a distribution D ∈ ∆n that lies somewhere in between the two extreme distributions, n (for no information gained by the attack) and n b (for full information gained by the attack). All anonymity metrics considered in this thesis are those that attempt to measure the amount of anonymity left in the system in the aftermath of such an attack. 5 2.3 Base-Profiles A central notion in our graphical framework is that of the base-profile of any distri- bution D ∈ ∆n , which is a function BD : R → R, where R is the set of all real values. It is defined as: BD (x) = P x {σ · D(σ) | σ ∈ [0, 1]}. Example 2.3.1 Let D ∈ ∆6 be the distribution: [[0.5, 0.2, 0.2, 0.1, 0, 0]] of Example 2.2.1. Then, the base-profile function of D is: BD (x) = 0.5x · 1 + 0.2x · 2 + 0.1x · 1 + 0x · 2. Throughout this thesis, we interpret 00 as 1. We will be particularly interested in the values of the base-profile function BD (x) for nonnegative values of x and, from those, mainly for x ≥ 1. An interesting observation is that, in general, this function may be discontinuous at x = 0, as implied by the following proposition, which follows immediately from the definitions so far: Proposition 2.3.1 For any distribution D ∈ ∆n : (a) BD (0) = |D| = n. (b) limx→0 BD (x) is the number of nonzero values in D, i.e, P {D(σ) | σ 6= 0}. (c) BD (1) = 1, i.e, all base-profiles intersect at x = 1, with value 1. 1 if D = n b, (d) limx→∞ BD (x) = 0 otherwise. (e) If D 6= n b and D 6= n, then for all x > 1, Bn (x) < BD (x) < Bnb (x), i.e, all base-profiles lie between the extreme ones Bn (x) and Bnb (x). 6 Proof (a) BD (0) = P 0 P {σ · D(σ) | σ ∈ [0, 1]} = {1 · D(σ) | σ ∈ [0, 1]} = |D|, and |D| = n since D ∈ ∆n . Notice that here we make use of our convention 00 = 1. P x P {σ · D(σ) | σ ∈ [0, 1]} = {D(σ) · limx→0 σ x | σ ∈ [0, 1]}. P We now split the sum into two parts, for σ = 0 and σ 6= 0, yielding D(0) · 0 + {D(σ) · P σ 0 | σ 6= 0} = {D(σ) | σ 6= 0}. (b) limx→0 BD (x) = limx→0 (c) BD (1) = P 1 {σ · D(σ)} which, by the definition of a distribution, is equal to one. (d) First, suppose D = n b. Then, by definition, Bnb (x) = P {σ x · n b(σ) | σ ∈ [0, 1]}. But n b(1) = 1, n b(0) = n − 1, and n b(σ) = 0 otherwise, so this sum reduces to Bnb (x) = 1. Evidently, limx→∞ Bnb (x) = limx→∞ 1 = 1. P Next, suppose D 6= n b. Then D(σ) is nonzero only for σ < 1. So, BD (x) = {σ x · P x P D(σ) | σ ∈ [0, 1]} = {σ · D(σ) | σ < 1}, and limx→∞ BD (x) = limx→∞ {σ x · P D(σ) | σ < 1} = {D(σ) · limx→∞ σ x | σ < 1}. Since limx→∞ σ x = 0 for every 0 ≤ σ < 1, the sum vanishes. P (e) Fix x > 1. First, we show that BD (x) < Bnb (x). Indeed, BD (x) = {σ x · D(σ) | σ ∈ P 1 [0, 1]} < {σ · D(σ) | σ ∈ [0, 1]} = 1. But as proved in (d), Bnb (x) = 1; hence, BD (x) < 1 = Bnb (x). Second, we show that Bn (x) < BD (x). The proof will use the method of Lagrange multipliers to show that n minimizes BD (x) over all distributions D ∈ ∆n . Fix a distribution D = [[p1 , p2 , · · · , pn ]] in ∆n , where the pi are not necessarily distinct. Define f (D) = px1 + · · · + pxn , and g(D) = p1 + p2 + · · · + pn = 1, the normalization constraint. The Lagrange auxiliary function is L(D, λ) = f (D) + λ · (g(D) − 1). In this function L(D, λ), p1 , p2 , · · · , pn and λ are arguments. Getting partial derivations of all arguments and letting them equal 0, we have that 7 ∂L ∂p1 = x · px−1 +λ 1 =0 ∂L ∂p2 = x · px−1 +λ 2 =0 ∂L = x · px−1 +λ 3 =0 ∂p3 ··· ∂L ∂λ = p 1 + p2 + p3 + · · · + pn − 1 = 0 This system of equations has n + 1 equations. Based on 1st to nth equation, we know that: for 1 ≤ i ≤ n, r x−1 pi = − λ x (2.1) then the last equation in the system becomes r r r λ λ λ x−1 x−1 x−1 − + − + − + ··· + x x x r λ − r x λ x−1 n· − r x λ x−1 − x λ − x x−1 = 1 = 1 = = 1 n 1 nx−1 λ = − We can solve this system using (2.1): p1 p 2 p3 ··· λ = 1 n = 1 n = 1 n x nx−1 x = − nx−1 Hence, D = n minimizes f (D) subject to g(D) = 1. Based on the proof of Proposition 2.3.1(e), we have also proved the following proposition: Proposition 2.3.2 For all n ≥ 1 and x > 0, Bnb (x) = 1. 8 Figure 2.1 shows the base-profile functions of three distributions in ∆n , namely n b, n, and an arbitrary D ∈ ∆n . From Proposition 2.3.1, we know that BD (x) lies between the extreme Bn_(0) = BD(0) = B^n(0) n B^n(x) 1 BD(x) Bn_(x) 0 x 0 1 Figure 2.1: Base-profiles of three distributions in ∆n , namely n b, n, and an arbitrary D ∈ ∆n . base-profiles Bnb (x) and Bn (x), and intersects with any one of them at any x > 1 if and only if the profiles are identical, i.e, D is that extreme distribution (b n or n). 9 CHAPTER 3 A COMMON GRAPHICAL FRAMEWORK FOR EXISTING ANONYMITY METRICS In this chapter, we show that well-known techniques for measuring a system’s degree of anonymity, given a distribution D ∈ ∆n resulting from an attack, in essence just look at different local aspects of the base-profile BD of that distribution. 3.1 Anonymity Set Size Metrics One of the first anonymity metrics, proposed by Chaum [4], was the anonymity set size, i.e, the size of the set of users that could have sent a particular message. As stated in Proposition 2.3.1, in our framework this value is: BD (0), for any given distribution D. Proposition 2.3.1 also shows that the above value is in fact independent of D. Kesdogan, Egner and Büschkes [5] improved upon the above metric by defining the anonymity set as containing just those users for whom the probability in D is nonzero. The size of this set is then used as the measure of anonymity. Again, as seen in Proposition 2.3.1, within our framework this value is: lim BD (x). x→0 Both of the above metrics ignore the potentially different probabilities contained in D for different members of the anonymity set. The metrics considered in the following chapters are well-known to provide a more accurate measure of anonymity by taking those probabilities into account. 3.2 Metrics Based on Shannon Entropy Anonymity metrics based on Shannon entropy [12] of the discrete random variable with probability distribution D have been popular for a number of years. In their well-known 10 work, Serjantov and Danezis [6] proposed adopting this entropy value, i.e, P S(D) = − {σ · log(σ) · D(σ) | σ ∈ [0, 1]}, as a measure of anonymity. It is worth mentioning that the base of the logarithm in the above expression is not important, as its choice corresponds only to the choice of a unit of measurement. Also, 0 · log(0) is interpreted consistently as 0. In our framework of base-profiles, the above is simply the negation of the slope of the curve BD (x), at the value x = 1, as shown below. Since, for any constant σ, d x σ dx = σ x · log(σ), the derivative of the base-profile of D is: B0D (x) = P d BD (x) = {σ x · log(σ) · D(σ) | σ ∈ [0, 1]}. dx By substituting x = 1 in the above, we get: S(D) = −B0D (1). For any distribution D ∈ ∆n , the minimum value of S(D) is −B0nb (1) = 0, while its maximum value is −Bn0 (1), which is easily seen to be log(n). Figure 3.1 graphically depicts this anonymity metric of Serjantov and Danezis [6]. BD(x) x 1 Figure 3.1: A graphical representation of the entropy-based metric of Serjantov and Danezis [6] as −B0D (1). 11 A shortcoming of this metric is that it is based only on the nonzero values contained in a distribution. For example, let D1 = [[ 12 , 12 ]], and D2 = [[ 12 , 12 , 0]]. Then, S(D1 ) = S(D2 ) = log(2). A completely toothless attack that makes no inroads at breaking anonymity will result in D1 . On the other hand, an attack that results in D2 has achieved partial success towards its goal by completely ruling out one of the three possible users, thus lowering anonymity. A metric that results in the same anonymity measure in both cases is subject to misinterpretation, if the maximum possible measures of the respective systems are not taken into account. In another well-known work, Diaz et al. [7] proposed a similar metric, which is additionally normalized by the maximum possible measure, and can therefore be readily used to compare systems with a different number of users. They proposed using the value of the expression given by: d(D) = S(D) , log(n) as the measure of anonymity. Clearly, this is always a real value in the closed interval [0, 1]. As B0n (1) = − log(n), in our framework this normalized metric is essentially the ratio of the slopes of the curves BD (x) and Bn (x), at the value x = 1, i.e, B0D (1) . d(D) = 0 Bn (1) Figure 3.2 shows the Bn (x) curves, for the values n = 1, 2, and 3, and graphically illustrates the expected change of the normalization factor, Bn0 (1), as n increases. 3.3 Maximal Probability Metric Tóth, Hornák and Vajda [8] argued adopting the maximal probability contained in any distribution D, i.e, MAX(D) = max{σ ∈ [0, 1] | D(σ) ≥ 1}, as an anonymity metric because, from the users’ point of view this worst-case measure may be more important than the average-case considered by the above metrics based on Shannon entropy. 12 B1_ (x) B2_ (x) B3_ (x) x 1 Figure 3.2: Bn (x) curves, for n = 1, 2, and 3. 3.3.1 Norm-Profiles In order to place this metric within our framework, we first define the norm-profile of any distribution D as the function ND : R → R, given by: ND (x) = (BD (x))1/x . We now show that the maximal probability metric is the limit value of the norm-profile. Proposition 3.3.1 For any distribution D ∈ ∆n , MAX(D) = lim ND (x). x→∞ Proof Let MAX(D) = m. By definition, P ND (x) = ( {σ x · D(σ) | σ ∈ [0, 1]})1/x . Since m is the largest value in D, and P {D(σ) | σ ∈ [0, 1]} = |D| = n, we have that for all x ≥ 1, ND (x) ≤ (mx · n)1/x = mn1/x . Thus, limx→∞ ND (x) ≤ limx→∞ mn1/x = m. Now, since D(m) ≥ 1, we have that for all x > 0, m = (mx )1/x ≤ (mx · D(m))1/x ≤ ND (x). 13 Thus, m ≤ limx→∞ ND (x), and the proposition holds. Figure 3.3 shows the above result graphically, along with another important relationship between the two profiles, described below. N^n(x) = B^n(x) 1 ND(x) MAX(D) 0 BD(x) x 0 1 Figure 3.3: Base- and Norm-profiles for n b, and an arbitrary D ∈ ∆n . The following proposition contains some easy to verify, yet important properties of the two kinds of profiles in our framework: Proposition 3.3.2 For any distribution D ∈ ∆n : 1 if D = n b, (a) limx→0 ND (x) = ∞ otherwise. (b) ND (1) = 1, i.e, just as base-profiles, all norm-profiles also intersect at x = 1, with value 1. (c) B0D (1) = N0D (1), i.e, the metrics of Section 3.2 based on Shannon entropy can be expressed and viewed in terms of the norm-profile as well: S(D) = −B0D (1) = −N0D (1), and 0 d(D) = BD0 (1) Bn (1) : 14 0 = ND0 (1) . Nn (1) Our proof relies on the following: Lemma 3.3.1 Suppose h(x) = f (x)g(x) , where f (x) is strictly positive. Then dg(x) g(x) df (x) d g(x) h(x) = f (x) · ln f (x) + dx dx f (x) dx Proof We have that h(x) = f (x)g(x) ⇐⇒ ⇐⇒ ⇐⇒ ⇐⇒ ln h(x) = g(x) · ln f (x) d d ln h(x) = [g(x) · ln f (x)] dx dx 1 dh(x) dg(x) g(x) df (x) = ln f (x) + h(x) dx dx f (x) dx dg(x) g(x) df (x) dh(x) = h(x) · ln f (x) + dx dx f (x) dx Making the substitution h(x) = f (x)g(x) gives the desired result. Proof (of Proposition 3.3.2) (a) First, suppose D = n b. Then BD (x) = Bnb (x) = 1 for all x 6= 0. So, limx→0 Nnb (x) = limx→0 [BD (x)]1/x = limx→0 11/x = 1. Next, suppose D 6= n b and put k = limx→0 BD (x). As proved above, k is the number of nonzero values in D; if D 6= n̂, we have k > 1. So, limx→0 ND (x) = limx→0 [BD (x)]1/x = limx→0 k 1/x . Since 1/x grows arbitrarily large as x → 0 and since k > 1, the limit is infinite. (b) As proved above, BD (1) = 1. Hence, since ND (x) = p x BD (x), ND (x) = 1 as required. (c) Put f (x) = BD (x) and g(x) = 1/x. Then ND (x) = f (x)g(x) , and by the above lemma, h i 1 0 we have ND0 (x) = BD (x)1/x · −1 ln B (x) + B (x) . Plugging in x = 1 gives D x2 x·B (x) D h i D 0 ND0 (1) = BD (1) · − ln BD (1) + BD1(1) BD (1) . As proved above, BD (1) = 1, giving 0 ND0 (1) = BD (1). 15 3.4 Metrics Based on Rényi Entropies Clauß and Schiffner [9] proposed, as a framework for some anonymity metrics, the parametric family of entropy measures of Rényi [11]. For any distribution D ∈ ∆n , this family is given by: X 1 α Rα (D) = log {σ · D(σ) | σ ∈ [0, 1]} , 1−α where α ∈ [0, 1) ∪ (1, ∞) is a real-valued parameter of the family. The maximum value of Rα (D) is when α = 0. It is immediately seen that R0 (D) = log(n). Thus, R0 (D) = −Bn0 (1) = −Nn0 (1). It is well-known that Shannon entropy is a special case of Rényi entropies, as α approaches 1, i.e, limα→1 Rα (D) = S(D). A proof of this using l’Hôpital’s rule in given in [9]. We therefore have that: lim Rα (D) = −B0D (1) = −N0D (1). α→1 The minimum value of Rα (D) takes place when α approaches ∞. In order to represent this value in our framework, we first need to extend our functions Bn (x) and Nn (x), which were earlier defined only for all natural numbers n ≥ 1. Now, for any real value µ > 0, we define: Bµ (x) = 1 µx · µ, Nµ (x) = (Bµ (x))1/x . We make two observations about these generalized definitions. First, although the above expression defining Bµ (x) can be simplified to 1/µx−1 , we intentionally define it as above to make it easy to see that these definitions coincide with the earlier definitions, if µ is an integer. Second, note that the profile functions were originally defined just for distributions, and µ is a distribution only when µ > 0 is an integer. However, the Bµ (x) and Nµ (x) functions, as defined above, make perfect sense even for non-integral values of µ. Given 16 these two observations, we adopt these definitions, albeit at the expense of slight abuse of notation. Just as shown in Proposition 3.3.1, it can be shown that: MAX(D) = lim N1/MAX (D) (x). x→∞ Also, N01/MAX (D) (1) = − log(MAX (D)). The minimum value of Rα (D), i.e, limα→∞ Rα (D), is shown in Clauß and Schiffner [9] to be − log(MAX(D)). We thus obtain, 0 (1) = −N01/MAX (D) (1). lim Rα (D) = −B1/MAX (D) α→∞ Figure 3.4 shows the range of all Rényi entropies for D. As is evident from the figure, all 1 =0 1 ______________ N1/MAX(D)(x) ND(x) MAX(D) Nn_ (x) 1/n x 1 Figure 3.4: Range of all Rényi entropies for a distribution D ∈ ∆n . entropy measures in this parametric family are negations of the slopes at value x = 1 of some norm-profile curves that are in the vicinity of ND (x). Shannon entropy, a particular member of this family, is the negation of the slope of ND (x) at x = 1. 3.5 Euclidean Distance Metric Andersson and Lundin [10] suggested as metric, the Euclidean distance between the distributions D ∈ ∆n and n, when these distributions are viewed as points in the space Rn . 17 This distance is given by: v ( ) u 2 uX 1 t σ− · D(σ) | σ ∈ [0, 1] . n By applying algebraic simplification and definitions of profiles, the above expression can be seen to be: q BD (2) − lim Nn (x). x→∞ The asterisk (∗) in Figure 3.5 shows graphically the argument of the square-root in the above expression. It illustrates that this metric depends on just the value of the base-profile BD (x), 1 BD(2) 1/n _ Nn(x) () BD(x) 1 x 2 Figure 3.5: Square of the Euclidean distance metric, shown as asterisk (∗). at the value x = 2. 18 CHAPTER 4 AN ANALYSIS OF EXISTING METRICS In the previous chapter, we saw that existing anonymity metrics essentially consider just some local aspects of a distribution’s base- and/or norm-profile. We now show that such local observation is insufficient and, by not taking entire profile contours into consideration, these metrics end up producing counterintuitive and incorrect evaluations in some cases. We begin by summarizing the formulations of the existing anonymity metrics within our framework, constructed in chapter 3 for any given distribution D ∈ ∆n . Anonymity set size metric of Chaum [4]: Number of users in the system, given by: n = BD (0). Reduced anonymity set size metric of Kesdogan, Egner and Büschkes [5]: Number of users in the system with a nonzero probability in D, given by: lim BD (x). x→0 Shannon entropy based metric of Serjantov and Danezis [6]: Negation of slope of profiles of D at x = 1, given by: S(D) = −B0D (1) = −N0D (1). Normalized Shannon entropy based metric of Diaz et al. [7]: Ratio of slope of profiles of D and n at x = 1, given by: d(D) = B0D (1) N0D (1) = . B0n (1) Nn0 (1) Maximal probability metric of Tóth, Hornák and Vajda [8]: Largest probability in D, given by: MAX(D) = lim ND (x). x→∞ 19 Rényi entropies based metric family of Clauß and Schiffner [9]: Negation of slopes at x = 1 of profiles close to the profiles of D, given by: R0 (D) = −Bn0 (1) = −Nn0 (1), lim Rα (D) = −B0D (1) = −N0D (1), α→1 lim Rα (D) = −B01/MAX (D) (1) = −N01/MAX (D) (1). α→∞ Euclidean distance metric of Andersson and Lundin [10]: Euclidean distance between D and n, when viewed as points in the space Rn , given by: q BD (2) − lim Nn (x). x→∞ As is evident from the above summary, each of these metrics measures some local property of one or both profiles of the distribution D. From an anonymity level point of view, the information in a distribution is far richer than what can be captured by such a local observation, dictating the need for a metric that measures some global profile property. It is for this reason that, for any of these metrics, examples of distributions exist in the literature where that metric gives counterintuitive results and is out-of-line with one or more of the other metrics. For instance, consider the following example, adapted from [8]. Example 4.0.1 For a system with n = 101 users, let A and B be the following distributions: 1 if σ = 0.5, A(σ) = 100 if σ = 0.005, 0 otherwise. 10 if σ = 0.086518, B(σ) = 91 if σ = 0.001481, 0 otherwise. Then, d(A) ≈ d(B) ≈ 0.649, whereas MAX(A) = 0.5 0.086518 = MAX(B), i.e, normalized Shannon entropy of A and B are same, although A exposes a single user to a far greater extent than B does. 20 The phenomenon highlighted by the above example is depicted in Figure 4.1 where, while N0A (1) = N0B (1), we have that limx→∞ NA (x) and limx→∞ NB (x) are significantly far apart. 1 NA(x) 0.5 NB(x) 0.086518 x 1 Figure 4.1: Two distributions with same normalized Shannon entropy, but significantly different maximal probabilities. In fact, as we saw in Section 3.4 on Rényi entropies, for any given combination of n, s and µ, such that log(1/n) ≤ s ≤ log(µ), there exists a distribution D ∈ ∆n , for which MAX(D) = µ and B0D (1) = N0D (1) = s. In other words, countless distributions usually exist, all with the same Shannon entropy, but over a wide range of maximal probabilities that, as in Example 4.0.1, makes us assign to them disturbingly different intuitive anonymity levels. Furthermore, as shown later in Sections 4.1 and 4.2, it is even possible for Shannon entropy to suggest higher anonymity for distributions that one would intuitively assign a lower reading to (in comparison with another distribution). 4.1 Base-Profile Contours It is instructive to recapitulate that the base-profiles of all distributions in ∆n lie somewhere between the extreme profiles Bnb (x) and Bn (x). Loosely speaking, the closer the 21 base-profile BD (x), of a given distribution D, is to Bn (x), the higher the anonymity level that should be associated with D. The same observations hold for norm-profiles as well. We just saw that existing anonymity metrics essentially attempt to measure this closeness of BD (x) to Bn (x) by observing some local property of one or both profiles of D. We now show that as profile contours fluctuate with x, such local observation is inadequate. In the next chapter, we will construct a metric that takes the entire profiles of D into account. Such a global consideration of profiles results in a metric that is consistently correct. We begin by studying some fundamental properties of profiles. The following proposition about their contours follows immediately from the definitions: Proposition 4.1.1 For any distribution D ∈ ∆n : (a) For all x > 0, B0D (x) ≤ 0 and N0D (x) ≤ 0, i.e, both profiles are monotonically nonincreasing. (b) For all x > 0, B00D (x) ≥ 0 and N00D (x) ≥ 0, i.e, curves of both profiles are concave upwards. Proof (a) Since BD (x) = P 0 {σ x D(σ) | σ ∈ [0, 1]}, BD (x) = P x {σ ln σ · D(σ)}. For any x > 0 and σ ∈ [0, 1], ln σ is negative, σ x is positive, and D(σ) is a nonnegative integer; hence the entire sum is negative so that BD (x) is monotonically nonincreasing for x > 0. Now suppose 0 < x1 ≤ x2 . As shown above, BD (x1 ) ≥ BD (x2 ). Since x2 > 0, it follows that [BD (x1 )]1/x2 ≥ [BD (x1 )]1/x2 . But since x2 ≥ x1 , we also have that [BD (x1 )]1/x1 ≥ [BD (x1 )]1/x2 . Combining these two inequalities gives BD (x1 )1/x1 ≥ BD (x2 )1/x2 , i.e, ND (x1 ) ≥ ND (x2 ). Hence, ND (x) is monotonically nonincreasing for x > 0. (b) Since BD (x) = P x P 00 {σ D(σ) | σ ∈ [0, 1]}, BD (x) = {ln2 σ · σ x · D(σ)}, and the above 00 0 argument shows that this is positive; hence, BD (x) is nonnegative for x > 0, i.e, BD (x) is nondecreasing there. 22 To show that ND00 (x) ≥ 0, it suffices to show that ND0 (x) is nondecreasing, i.e, for 0 < x1 < x2 , ND0 (x2 ) − ND0 (x1 ) > 0. By Lemma 3.3.1, we have that 0 ln BD (x) BD (x) 0 − . ND (x) = ND (x) · xBD (x) x2 Hence, ND0 (x2 ) − ND0 (x1 ) = ND (x2 ) · h 0 (x ) BD 2 x2 BD (x2 ) − ln BD (x2 ) x22 i − ND (x1 ) · h 0 (x ) BD 1 x1 BD (x1 ) − ln BD (x1 ) x21 i Because ND (x) is decreasing, it follows that 0 0 BD (x2 ) ln BD (x2 ) BD ln BD (x1 ) (x1 ) 0 0 − − + ND (x2 ) − ND (x1 ) ≥ ND (x2 ) · x2 BD (x2 ) x22 x1 BD (x1 ) x21 Now put 0 0 (x1 ) BD BD (x2 ) α= − x2 BD (x2 ) x1 BD (x1 ) and ln BD (x1 ) ln BD (x2 ) β= − . x21 x22 We have just shown that ND0 (x2 ) − ND0 (x1 ) ≥ ND0 (x2 )(α + β). Now, α ≥ 0. Indeed, 0 0 BD (x2 ) (x1 ) BD α = − x B (x ) x1 BD (x1 ) 2 0D 2 0 BD (x1 ) BD (x1 ) − ≥ x2 BD (x2 ) x1 BD (x1 ) 1 1 0 = BD (x1 ) − x2 BD (x2 ) x1 BD (x1 ) ≥ 0, 0 where the second step follows from the fact that BD (x), as we have shown, is nonde- creasing, and the last step follows from the fact that BD (x) is nonincreasing. We also have that β ≥ 0. Indeed, since 1/x2 and ln (x) are nondecreasing for x > 0, we have that ln BD (x1 ) ln BD (x2 ) β = − x21 x22 ln BD (x1 ) ln BD (x2 ) ≥ − x22 x22 ≥ 0. 23 1 1 B^n(x) BD(x) BE(x) BE(x) B_n(x) B_n(x) 1 B^n(x) BD(x) 1 x (a) (b) x Figure 4.2: Base-profiles intersecting at x > 1: (a) One such intersection, (b) Two such intersections. Hence, α and β are both nonnegative. Since ND (x) is similarly nonnegative, it follows that ND0 (x2 ) − ND0 (x1 ) = ND (x2 )(α + β) is nonnegative, which completes the proof. We already know that all profile values are 1 at x = 1. An important characteristic of contours with the properties given in Proposition 4.1.1 is that different such contours can intersect with each other for values x > 1. In fact, they can intersect a multiple number of times. The following is a simple, yet informative example. Example 4.1.1 Consider the following distributions: 7 7 1 D = [[ 16 , 16 , 8 ]] and E = [[ 13 , 11 , 11 ]]. 24 48 48 Then, BD (x) and BE (x) intersect at x = 2, i.e, BD (2) = BE (2), because: 2 7 16 2 2 2 2 1 13 11 + = +2 . 8 24 48 Figure 4.2(a) depicts such a situation. In general, if D, E ∈ ∆n are distinct distributions, neither of which is any of the extreme distributions, n b or n, then their base-profiles, BD (x) and BE (x), may intersect zero or more times for some values of x > 1. Figure 4.2(b) shows a case with two such intersections. Which of these two intersecting profiles is closer to Bn (x) changes at each intersection. 24 While comparing such distributions, the metrics based on Shannon and Rényi entropies base their final determination of closeness to Bn (x) on just the slopes of the distribution profiles at the x = 1 intersection point. This misses the bigger picture by ignoring the possibility of other intersections at values x > 1 that always alter the closeness relation7 7 1 ship. Example 4.1.1, given earlier in this chapter, with D = [[ 16 , 16 , 8 ]] and E = [[ 13 , 11 , 11 ]], 24 48 48 illustrates this phenomenon. In it, there is one intersection at x = 2, such as the situation shown in Figure 4.2(a). The entropy-based metrics declare E as the distribution with higher anonymity because B0E (1) is closer to Bn0 (1). But it is clear that the attack resulting in E is stronger than the one resulting in D, because the most suspicious user in E is in a class by itself and is exposed more than the two most suspicious users in D. Thus, D should be assigned a higher anonymity. The metric we propose in chapter 5 correctly achieves this by taking into account all such later intersections, and beyond. The Euclidean distance metric suffers from this tunnel-vision phenomenon too, as its decision is based on just the value at x = 2 of base-profiles. For the same example, it assigns the same anonymity level to D as well as E. 4.2 Norm-Profile Contours With the exception of their asymptotic values as x → ∞, norm-profiles of distribu- tions have many similarities with base-profiles. First, recall that the slopes at x = 1 of both profiles of any distribution are identical. Also, as stated in Proposition 4.1.1, norm-profiles are monotonically nonincreasing and concave upwards, just as base-profiles. We now make the following important observation: Proposition 4.2.1 For any distributions D, E ∈ ∆n and x > 0, ND (x) < NE (x) iff BD (x) < BE (x). 25 Proof We have that BD (x) < BE (x) ⇐⇒ ⇐⇒ ⇐⇒ BD (x) <1 BE (x) 1/x BD (x) < 11/x = 1 BE (x) [BD (x)]1/x [BE (x)]1/x <1 ⇐⇒ [BD (x)]1/x < [BE (x)]1/x ⇐⇒ ND (x) < NE (x). Thus, between D and E, the norm-profile of a distribution is closer to Nn (x) for exactly those values of x at which its base-profile is closer to Bn (x). In particular, ND (x) and NE (x) intersect at exactly those values of x, where BD (x) and BE (x) intersect. Norm-profiles provide a better setup for a closer study of intersections, especially when these profiles are considered for representations of distributions written in our [[. . .]] notation, instead of the actual distributions. In this case, they exhibit properties of Lpnorms on vector spaces (see Trefethen and Bau [13]), and it is for this reason that we chose the term norm-profile for them. Let [[∆n ]] denote the set of all representations of distributions in ∆n . Note that [[∆n ]] is a subspace of the n-dimensional real unit cube [0, 1]n , and permuted representations of the same distribution, for example [[0.2, 0.3, 0.5]] and [[0.3, 0.5, 0.2]], are distinct points in [[∆n ]]. For any given distribution in ∆n , there may be anywhere from one to n! representations in b has n. [[∆n ]]. In particular, n has just one representation, while n It can also be seen that [[∆n ]] is the convex hull of all representations of the distribution n b, i.e, the smallest convex set of points in [0, 1]n containing these representations. For example, in case of n = 2, all legal representations are points on the line segment that connects [[0, 1]] and [[1, 0]]. And for n = 3, they are points on the triangular face that connects [[1, 0, 0]], [[0, 1, 0]], and [[0, 0, 1]]. 26 We originally defined norm-profiles for just distributions, but that definition can be trivially extended for all points in the cube [0, 1]n . Now, for all x ≥ 1 and α ∈ [0, 1], we define the family of sets Nxα as follows: Nxα = {P ∈ [0, 1]n | NP (x) = α}. Then, Nxα ∩ [[∆n ]] is the set of those representations, norm-profiles of whose underlying distributions intersect at x, with the value α. When points in Nxα are plotted graphically, the shape of that locus of points depends upon x, and its size depends upon α. Figure 4.3 shows some of these sets for the 2-dimensional case. It shows [[∆2 ]] and, for an arbitrary value of α, ╓ ╖= N 1 1 ╙ 2╜ N1 N3/2 ╓ 0,1 ╖ ╙ ╜ N2 N4 N ╓ 1,0 ╖ ╙ ╜ Figure 4.3: [[∆2 ]] and loci Nxα for an arbitrary value of α, and some example values of x. sets Nxα for a few example values of x. Let us consider any one of these loci, say N2α . If α is sufficiently large, S = N2α ∩ [[∆2 ]] will be nonempty (with at most two points). Distributions corresponding to representations in this set are those whose norm-profiles intersect at x = 2 (with value α). Existing anonymity metrics will make a misjudgment while comparing two 0 such distributions. Furthermore, if α is now reduced slightly, to say α0 , some locus Nxα0 for 27 an x0 > x = 2 will have a nonempty intersection with S. Norm-profiles of distributions 0 denoted by representations in S ∩ Nxα0 intersect with each other at x as well as x0 . And so on. The astute reader may have noticed that, in two dimensions, any locus has at most two common points with [[∆2 ]] and, due to symmetry, both of these points represent the same distribution. Therefore, in the 2-dimensional case, norm-profiles of distinct distributions never intersect and all existing anonymity metrics work fine. This observation is correct. We simply used this case to introduce the loci and illustrate the phenomenon of their intersection with the set of all representations. In all higher dimensions, however, this problem with existing metrics is very real. Figure 4.4 shows the 3-dimensional situation. Due to the difficulty in displaying 3-dimensional ╓ 0,1,0 ╖ ╙ ╜ ╓ ╖= N 1 1 ╙ 3╜ ╖ N3/2 ∩ ╓ ╙ 3 ╜ ╖ N2 ∩ ╓ ╙ 3 ╜ ╖ N4 ∩ ╓ ╙ 3 ╜ ╖ N ∩ ╓ ╙ 3 ╜ ╓ 0,0,1 ╖ ╙ ╜ ╓ 1,0,0 ╖ ╙ ╜ Figure 4.4: [[∆3 ]] and its intersections with some loci. loci, unlike Figure 4.3 where entire loci were displayed, Figure 4.4 shows only their intersections with [[∆3 ]]. The set [[∆3 ]] is the triangular face shown and N2α is a sphere centered at the origin. If the radius of this sphere (i.e, α) is large enough, the sphere intersects the triangle in a circle, shown in the figure as N2α ∩ [[∆3 ]]. This circle has an infinite number of points on it that, despite symmetry, collectively represent an infinite number of distinct distributions. 28 7 7 1 13 11 11 Points [[ 16 , 16 , 8 ]] and [[ 24 , 48 , 48 ]] of Example 4.1.1, given earlier, lie on one such circle (for p α = 51/128 ≈ 0.63). As these points represent distinct distributions, existing anonymity metrics give incorrect results for them – an observation we already made in Section 4.1. 29 CHAPTER 5 A NEW, GLOBAL ANONYMITY METRIC We now present a new anonymity metric that does not err due to a narrow view, which becomes inevitable if only some local aspects of distribution profiles are looked at. By being sensitive to entire profile contours and, in particular, to intersections between them, the new metric always correctly determines the underlying system’s degree of anonymity. The fundamental premise upon which our metric is constructed is that profiles are infinite objects, and the complete profiles of distributions n b and n should be considered as the two extremes, for no anonymity and full anonymity cases, respectively. Thus, the anonymity measure assigned to a given distribution D ∈ ∆n should be based upon how close the profiles b. of D are, on the whole, to those of n. Alternatively, how far from those of n Given Proposition 4.2.1, it seems we can have the freedom of working with either baseprofiles or norm-profiles, because from the point of view of closeness to the corresponding profile of n, both profiles exhibit similar behavior. We choose to adopt base-profiles, because this choice requires less algebraic simplification in arriving at the final metric expression. 5.1 Raw Metric We propose to use the reciprocal of the area under the base-profile of a distribution as the raw measure of the anony-mity left in the system. For any distribution D ∈ ∆n , the system’s raw degree of anonymity is thus given by: R(D) = R ∞ 1 1 . BD (x) dx Using x = ∞ as the upper limit of the above definite integral ensures that the entire baseprofile contour is considered before arriving at the metric value – the crux of our thesis. The value x = 1 as the lower limit suffices because, as established in Proposition 4.2.1, baseprofiles of all distributions meet at that value of x. The differences among contours of the 30 base-profiles begin to manifest from thereon, due to differences in the amount of anonymity associated with their corresponding distributions. By applying the definition of base-profiles and elementary rules of integration, the above expression simplifies to: R(D) = P n σ·D(σ) log(σ) −1 o. | σ ∈ [0, 1] On the surface, there are some similarities between the above raw, area-based metric and the metric of Serjantov and Danezis [6] based on Shannon entropy, reproduced below: P S(D) = − {σ · log(σ) · D(σ) | σ ∈ [0, 1]}. First, the same subexpressions appear in expressions of both R(D) as well as S(D), albeit in a different arrangement. Second, both metrics range between 0 and log(n). The following proposition follows from the properties of distributions and logarithms: Proposition 5.1.1 For any distribution D ∈ ∆n , R(D) ≤ S(D), with equality iff D is one of the extreme distributions, i.e, n b or n. Proof We have that R(D) ≤ S(D) ⇐⇒ ⇐ ⇐⇒ ⇐⇒ ⇐⇒ 1 ∗ S(D) ≥ 1 R(D) n n n X X X pj ( pi log pi ) · ( )≥( pi )2 log pj i=1 j=1 i=1 i=1 n X i=1 n X n XX log pi pi p j ( )≥ pi pj log pj j=1 i=1 j=1 2 n XX 2 i6=j n XX pi + pi + i=1 ⇐⇒ n n X n X i<j n XX i<j n n X XX log pi )≥ pi 2 + pi pj ( p i pj log pj i=1 i6=j n n X XX log pi log pj + )≥ pi 2 + 2 · pi pj pi pj ( log pj log pi i=1 i<j n XX log pi log pj pi p j ( + )≥ (pi pj · 2) log pj log pi i<j 31 log pi log pj + log pj log pi is form of (a + a1 ). For any a > 0, (a + a1 ) ≥ 2. In other words, our area-based metric is more conservative than the one based on Shannon entropy. Figure 5.1 gives an idea of the values of these metrics, when viewed on [[∆2 ]]. The situation is similar in all other dimensions. log(2) S(D) R(D) ╓ ╖ ╙ 2╜ ╓ ╖ ╙ 0,1 ╜ ╓ ╖ 1,0 ╙ ╜ Figure 5.1: Metrics S(D) and R(D) on [[∆2 ]]. Note that, as the metric R exhibits a more traditional bell-curve, the metric S is far quicker than R to assign higher anonymity values to points close to the zero anonymity extreme points, i.e, near all representations of n b. At points close to full anonymity, however, S is less sensitive to change than R. Stated precisely, the slope of S(D) at D = [[0, 1]] is ∞, whereas at D = [[ 21 , 12 ]] is 0. Such an extreme disparity in sensitivity to small changes in distributions at these points is difficult to explain intuitively. On the other hand, the slope of R(D) at these points is 1 and 0, respectively. 5.2 Normalized Metric Just as the metric of Diaz et al. [7], given by d(D) = S(D)/ log(n), is a normalized version of that of Serjantov and Danezis [6], we define a normalized version of our raw metric as the ratio of the areas under the profiles Bn (x) and BD (x). For any distribution D ∈ ∆n , 32 the system’s normalized degree of anonymity is given by: R∞ Bn (x) dx . a(D) = R 1∞ B (x) dx D 1 Figure 5.2 shows the two areas mentioned in the above expression. It is clear from this figure B^n(x) 1 BD(x) Bn_(x) 0 x 1 Figure 5.2: Areas considered for system’s normalized degree of anonymity. that as D varies from n to n b, a(D) varies from 1 (for full anonymity) to 0 (for no anonymity). The above expression can also be simplified to see that a(D) = R(D)/ log(n), i.e, a(D) = log(n) · −1 P n σ·D(σ) log(σ) o. | σ ∈ [0, 1] The relationship between metrics a and d is thus the same as between R and S, already depicted in Figure 5.1. 5.3 Differences from Metrics Based on Shannon Entropy Figure 5.1 fails to capture all the ramifications of the differences between our area- based metrics and the existing metrics based on Shannon entropy. First, due to the different rationales adopted by these metrics behind their methods for judging closeness of distributions in ∆n to n, the main differences between these metrics manifest especially when they 33 determine, given two distributions in ∆n , which of those distributions is closer to n. Second, as we already saw in Section 4.2, these differences can only be observed in dimensions higher than two, i.e, in systems with more than two users. We start by revisiting our earlier examples. For the distributions A and B of Example 4.0.1, we saw that d(A) ≈ d(B), i.e, normalized Shannon entropy of A and B are nearly same, despite the fact that A exposes one of the users to a far greater extent than B exposes any user. Upon computing our normalized, area-based metric, we see that a(A) ≈ 0.26563, which is less than half of a(B) ≈ 0.579053. By taking the entire base-profile contours of A and B into account, instead of just the slopes of these profiles at x = 1, the new metric performs a more accurate comparison of the anonymity levels underlying these distributions. 7 7 1 The distribution D = [[ 16 , 16 , 8 ]] of Example 4.1.1 is assigned a lower anonymity level by metrics based on Shannon entropy than the distribution E = [[ 13 , 11 , 11 ]] of the same 24 48 48 example, as shown by the computation: d(D) ≈ 0.895 < 0.917 ≈ d(E). As stated earlier, this is counterintuitive because the most suspicious user in E is in a class by itself and is exposed more than the two most suspicious users in D. The base-profiles of these distributions intersect as shown in Figure 4.2(a). The area-based metric correctly takes that phenomenon into account and declares D as having higher anonymity than E: a(D) ≈ 0.814 > 0.762 ≈ a(E). In general, while comparing distributions D, E ∈ ∆n , the chance of disagreement between metrics a and d increases with n. We already know that for n = 2, there is never a disagreement between them, i.e, a(D) < a(E) if and only if d(D) < d(E), for all D, E ∈ ∆2 . Our pseudo-exhaustive simulation results have shown that for n = 3, these two metrics disagree in about 7.5% of all cases. In other words, for about 7.5% of pairs of distributions hD, Ei ∈ ∆3 × ∆3 , the two metrics disagree on which, among D and E, is the distribution 34 closer to n. For n = 4, the disagreement rate was found by our exhaustive simulation to be about 9%. Intersections between profiles of D and E are the root cause of disagreements between these metrics. If 1 ≤ x1 < x2 , the interval (x1 , x2 ) is said to be a (maximal) region of dominance of D over E, if BD (x1 ) = BE (x1 ), BD (x2 ) = BE (x2 ) and, for all x ∈ (x1 , x2 ), BD (x) < BE (x). The regions of dominance of D and E alternate, as shown in Figure 5.3, at each intersection of their profiles. The area gains achieved by D in any of its regions of dominance over E is the difference between the areas under their base-profiles in that region. Our area-based metrics assigns a higher anonymity level to D if its total area gains from all its regions of dominance over E is higher than its total area losses in all other regions. On the other hand, metrics based on Shannon entropy assign a higher anonymity level to the distribution with the sharper profile slope at x = 1. A disagreement between these metrics arises when these two measures are not both pointing in the same direction, i.e, the distribution with a more acute slope at x = 1 turns out to have more total area losses than gains. 1 BD(x) x1 x2 x3 BE(x) Figure 5.3: Initial base-profile slopes and alternating regions of dominance of distributions D and E. We end this chapter with an analogy of a race between two runners D and E. Suppose, the race begins at x = 1 and ends at x = ∞. For any given value of x0 between these 35 extremes, the speed of runner D at x0 is a function of the value BD (x0 ). Similarly, for E. The distance between D and E at x0 corresponds to the difference between its area gains and losses from 1 to x0 . Metrics based on Shannon entropy determine the winner based upon which runner got out of the blocks faster at the start of the race, whereas our area-based metrics determine the winner based upon who is ahead at the end of the race. 36 CHAPTER 6 CONCLUSIONS Finding the right metric for measuring the amount of anonymity left in an anonymity system in the aftermath of an attack has been a goal almost ever since a need for such systems was first recognized. Although several metrics to this end have been proposed in the past, none has yet demonstrated its ability to completely foot the bill. For any existing metric, examples of probability distributions resulting from attacks can be found in the literature for which that metric performs counterintuitive anonymity evaluation. As conformance to human intuition is the ultimate standard of correctness of a metric, such examples have always hinted at the need for a better metric. In this thesis, we first constructed a graphical framework, within which we represented existing metrics, including the currently popular ones based on Shannon entropy. Our framework is made primarily of two profiles of probability distributions, which we call base-profile and norm-profile, that are functions over the infinite domain of nonnegative real numbers. By placing the existing metrics in our framework, we showed that these metrics base their anonymity evaluation on just some local aspects of profiles, such as a profile’s value at some point, or its slope at another point, etc. We also showed that as profile contours can fluctuate, especially causing intersections between them, any such local observation to determine anonymity level inevitably ignores profile intersections, and is thus inadequate. This explains in a graphic way, why existing metrics fail to give a good, intuitive evaluation in some cases. We then proposed a new metric that evaluates the level of anonymity based upon a global property of the underlying profile, namely the area swept under an entire base-profile contour. By doing so, our new metric avoids the pitfall of a local profile aspect not accurately reflecting its global contour. Just as for the popular metric based on Shannon entropy, we gave two versions of our metric, a raw and a normalized one. 37 We compared our approach with the Shannon entropy approach for measuring anonymity, and noted that differences between these approaches are best appreciated not on any single probability distribution, but by observing how these approaches determine which of some two given distributions reflects higher anonymity. We identified conditions under which their results disagree. For systems with few senders, we also gave an estimate obtained from our pseudo-exhaustive simulation, of the percentage of cases that involve disagreement. While there is no disagreement for systems with two users, there is about 7.5% disagreement rate on systems with three users, and about 9% for four users. We have left determination of the disagreement rate between our metrics for systems with a larger number of users, as future work. Although we presented our technique only in the context of measuring sender anonymity, it can be used as well for receiver anonymity, or for any other situation with a need to measure the amount of uncertainty contained in a probability distribution. Recently, Bagai et al. [14] gave a system-wide metric based on distributions over perfect matchings between a system’s input and output messages. Gierlichs et al. [15] considered multiple messages sent or received by system users. These are some examples of works that employed Shannon entropy, and whose results can be improved by our area-based metric. Finally, we believe that while efficient computability of an anonymity metric is a welcome bonus, intuitiveness of its anonymity evaluations is essential. We feel that a systematic study of that aspect of all available metrics, including the one we proposed in this thesis, is an important subject for future research. 38 REFERENCES 39 LIST OF REFERENCES [1] C. Diaz. “Anonymity and privacy in electronic services.” PhD thesis, Katholieke Univesiteit Leuven, Belgium, 2005. [2] D. Kelly, R. Raines, M. Grimaila, R. Baldwin, and B. Mullins. “A survey of state-ofthe-art in anonymity metrics.” In Proceedings of the 1st ACM Workshop on Network Data Anonymization, pages 31–39, 2008. [3] R. Bagai and N. Jiang. “Profiling probability distributions for measuring anonymity.” Submitted to ACM CCS, 2011. [4] D. Chaum. “The dining cryptographers problem: Unconditional sender and recipient untraceability.” Journal of Cryptology, 1(1):65–75, 1988. [5] D. Kesdogan, J. Egner, and R. Büschkes. “Stop-and-go- MIXes providing probabilistic anonymity in an open system.” In Proceedings of the International Information Hiding Workshop, pages 83–98. Lecture Notes in Computer Science - 1525, 1998. [6] A. Serjantov and G. Danezis. “Towards an information theoretic metric for anonymity.” In Proceedings of the 2nd International Privacy Enhancing Technologies Symposium (PETS), pages 41–53. Lecture Notes in Computer Science - 2482, 2002. [7] C. Diaz, S. Seys, J. Claessens, and B. Preneel. “Towards measuring anonymity.” In Proceedings of the 2nd International Privacy Enhancing Technologies Symposium (PETS), pages 54–68. Lecture Notes in Computer Science - 2482, 2002. [8] G. Tóth, Z. Hornák, and F. Vajda. “Measuring anonymity revisited.” In Proceedings of the 9th Nordic Workshop on Secure IT Systems, pages 85–90, 2004. [9] S. Clauß and S. Schiffner. “Structuring anonymity metrics.” In Proceedings of the ACM Workshop on Digital Identity Management, pages 55–62, 2006. [10] C. Andersson and R. Lundin. “On the fundamentals of anonymity metrics.” In The Future of Identity in the Information Society, edited by S. Fischer-Hubner, P. Duquenoy, A. Zuccato, and L. Martucci, volume 262 of IFIP International Federation for Information Processing, pages 325–341. Springer Boston, 2008. [11] A. Rényi. “On measures of entropy and information.” In Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability, pages 547–561, 1961. [12] C. Shannon. “A mathematical theory of communication.” Bell System Technical Journal, 27:379–423 and 623–656, 1948. [13] L. Trefethen and D. Bau, III. Numerical Linear Algebra. SIAM Publishers, 1997. 40 LIST OF REFERENCES (continued) [14] R. Bagai, H. Lu, R. Li, and B. Tang. “An accurate system-wide anonymity metric for probabilistic attacks.” In Proceedings of the 11th International Privacy Enhancing Technologies Symposium (PETS), 2011. Pages 117-133. [15] B. Gierlichs, C. Troncoso, C. Diaz, B. Preneel, and I. Verbauwhede. “Revisiting a combinatorial approach toward measuring anonymity.” In Proceedings of the 7th ACM workshop on Privacy in the electronic society, pages 111–116, Alexandria, VA, USA, 2008. 41
© Copyright 2025 Paperzz