Optimal Risk-Utility Trade-Off: a Decision Problem With Multiple Objectives Il compromesso ottimale tra rischio e utilità: un problema di decisione con obiettivi multipli Mario Trottini Departamento de Estadı́stica e I.O., Universidad de Alicante e-mail: [email protected] Riassunto: In questo lavoro si presenta una formalizzazione del problema della tutela della riservatezza dei dati statistici come problema di decisione con obiettivi multipli evidenziando le relazioni, i vantaggi e i limiti dell’approccio proposto con la metodologia correntemente in uso. Keywords: confidentiality, risk-utility trade-off, decisions with multiple objectives 1. Introduction Statistical Disclosure Control (SDC) denotes a set of tools aimed to design and implement data dissemination strategies for statistical data collected under a pledge of confidentiality. The problem is not simple. An ideal data dissemination procedure, in fact, should: (i) allow legitimate data users to perform the statistical analyses of interest as if they were using the data set originally collected; and (ii) reduce the risk of misuses of the data by potential intruders aimed to disclose confidential information about individual respondents. This identifies two conflicting objectives (that we call “maximize safety” and “maximize usefulness”) that no data dissemination procedure can fully achieve simultaneously. Improvement in one of the two objectives usually requires to reduce achievement in the other and there is no data dissemination procedure which is obviously the best. In addition the above objectives are too ambiguous to be of operational use and there is no obvious measure that can be used to quantify the extent to which they are achieved by different candidates for data dissemination. Even assuming that we have defined suitable measures S and V that quantify achievement of the “maximize safety” and “maximize usefulness” objectives, still S and V are usually expressed in different units and have very different meaning so that it is not trivial to compare arbitrary pairs (s, v), and (s0 , v 0 ). The problem is even more complex due to the fact that S and V , from the statistical agency perspective, are random variables. In fact the value of S and V depend on the users’ actions that are only partially known to the statistical agency (that for example has uncertainty about users targets, prior information, the estimation procedures that they use, etc). Thus each data dissemination strategy induces a distribution over the space of consequences and choosing among alternative strategies is equivalent to choose among alternative lotteries for (S, V ), a much more difficult task than just express preferences over pairs (s, v). The research literature and current practice in data disclosure limitation, have addressed these issues only in part and to a different extent. Decision theory, we believe, might provide a suitable framework to think about these problems. Within this framework – 285 – a sensible choice of the data dissemination procedure requires the agency/ies responsible for it to: a) Identify a set of suitable alternatives; b) Defining the fundamental objectives in more operational terms; c) Define suitable attributes that can measure the extent to which objectives are achieved when an arbitrary alternative is considered; d) Assess the trade-off between the fundamental objectives of the problem. This means that for any arbitrary subset of the objectives the agency has to make a decision about how much of those objectives is willing to sacrifice in order to improve achievement in the others. Decision theory provides guidelines for the implementation of the four-steps decision analysis described above. Because of page limit constraints in this paper we restrict our discussion to the trade-off assessment (step d)(1) . In section 2 we review the main results of the theory concerning the so called trade-off under certainty and we discuss their relationship with selection criteria of common use in SDC. Section 3 presents a more general framework for trade-off assessment that allows to take into account the agency’s uncertainty about model’s inputs. The relevance of the proposed framework for increasing the values of existing research efforts in statistical confidentiality is also discussed here. Section 4 summarizes the main results of the paper and outlines ideas of future work. 2. Trade-off under certainty Suppose that for a given decision problem a class of alternatives M = {Mk , k ∈ E}; objectives, {Oj , j = 1, . . . , m}; and attributes {Xj , j = 1, . . . , m} , Xj ∈ Xj , have been specified and are appropriate for the problem. Selecting the “best” alternative in M requires the decision maker trading-off the conflicting objectives. In decision theory this can be done in two ways. The simplest approach, also known as trade-off assessment under certainty, assumes that consequences of actions in M in terms of the objectives are deterministic. The set of attributes maps any action in M into a point in the consequences space X . The decision maker’s problem, in this case, is to choose the action M ∗ ∈ M such that it will be happiest with the consequence {Xj (M ∗ ), j = 1, . . . , m}. Current selection criteria for the best data dissemination strategy in data disclosure limitation are special cases of this simpler approach. In particular, assuming that a class M = {Mk , k ∈ E} of competing data dissemination strategies has been identified and that achievement of the objectives “maximize safety” and “maximize usefulness” can be described in terms of multidimensional attributes (measures of disclosure risk and data utility) S and V , S ∈ S, V ∈ V, existing criteria for the selection of the best masking, assume that attributes S and V map each Mk ∈ M into a point (S(Mk ), V (Mk )) in S ×V. The best data dissemination strategy can be selected from M according to one of three criteria: C1 , maximize V subject to a lower bound for S (this requires S and V to be scalars); C2 , restrict the selection problem to the subset M 0 of strategies in M which belong to the efficient frontiers (that is that are not dominated by other strategies in M). The best strategy is then selected in M 0 according to some subjective criteria; C3 , define an index or score based on S and V . Criterion C1 is the standard in the current practice of data disclosure limitation. Duncan and Keller-McNulty (2001) have proposed a graphical representation tool, which they (1) A full discussion of the four-steps procedure can be found in Trottini (2008). Sections 1 and 2 of the present work are an extract of sections 1 and 5 of Trottini (2008). – 286 – refer to as the R-U confidentiality map, that provides an implementation of this approach and that has became quite popular among users. The use of a threshold value for safety avoids the problem of different scales for S and V . At the same time, however, it can result in a too “rigid” selection criterion. According to C1 , in fact, given a threshold t, a pair (S = t + , V = v) is always preferred to a pair (S = t − , V = v + ∆) ∀ , ∆ ∈ (0, ∞), i.e. an arbitrary large increase ∆ in data validity is not worth an infinitesimal decrease in data safety. Another limitation of the C1 -criterion is that it requires S and V to be scalars, which is difficult taking into account the complexity of the objectives “maximize safety” and “maximize usefulness”. The criterion C2 has received some attention only in the recent past (Karr et al. 2006), while very few instances of C3 have been discussed in the research literature on statistical confidentiality. An example is the score criterion proposed by Domingo-Ferrer and Torra (2001). Being a special case of the trade-off under certainty approach one could ask whether and under which conditions the standard criteria C1 -C3 described above correspond to optimal procedures under the trade-off under certainty approach. The next subsections answer these questions. 2.1. Value functions The most ambitious solution, within the trade-off under certainty approach, is to formalize the agency’s preference structure over the consequence space specifying a scalar-valued function ν(·), ν : S × V → <, with the property that for arbitrary pairs (s, v), (s0 , v 0 ) in the consequence space, ν(s, v) ≥ ν(s0 , v 0 ) ⇔ (s, v) (s0 , v 0 ), (1) where the symbol means “preferred or indifferent to”. Following Keeney and Raiffa (1976) we refer to the function ν(·) in (1), that associates to each pair (s, v) in the consequence space a scalar index of preferability, as a multiattribute value function. Existing score criteria (i.e. C3 -criteria of section 2) are, in fact, value functions. The function that defines the score is usually an additive function of a data safety and data validity measures where the weights are chosen in an ad-hoc way based on heuristic considerations (see, for example, the score criterion proposed by Domingo-Ferrer and Torra 2001). Unfortunately assessment of a suitable value function, is not that simple. It requires the specification of the agency’s preferential ordering over all possible points in the consequence space S × V. Heuristic solutions that fail to take this into account are likely to produce value functions that formalize agency’s preference structures that probably no agency would agree with (see Trottini 2008). As an alternative to heuristic proposals, standard results in multiple objective decision theory, can be used in data disclosure limitation as a tool to define sensible value functions in agreement with the actual agency’s preferences structure providing, at the same time, a useful framework to better understand heuristic score criteria. Much of the work in multiple objective decision theory has focused on identifying features in the decision maker’s preference structure that constrain the form of the value function and allow to decompose the assessment of a multiattribute value function into simpler problems with single-attribute value functions to be assessed and scaled consistently (see Keeney and Raiffa, 1976, chapter 3). In the two-attribute case the key feature of the decision maker’s preference structure is the so called corresponding tradeoffs condition. Roughly speaking, the condition says that – 287 – agency’s preferences for increments in data safety do not depend in relative sense on the level of data validity and viceversa. If an increment of data safety from s1 to s01 is considered as valuable as an increment of data safety from s2 to s02 when validity is held at a fixed level v, then the two increments should be considered equally valuable whatever is the level of validity, although for different levels of validity the “price” in data validity units that the agency is willing to pay might be different. The same should hold when we interchanging the role of data safety and data validity. The trade-off condition is particularly important in SDC for at least three reasons. First of all, it is a reasonable assumption, at least approximately, for a large number of disclosure limitation problems. Most statistical agencies, in fact, would probably be comfortable in defining equally preferable increments in terms of data safety (data validity) for a given level of data validity (data safety) without knowing the actual level of data validity (data safety). In addition, there exist simple tests that can be used to check whether the assumption is appropriate for the problem (see Trottini 2008 for an illustration). Finally, if it is ascertained that the corresponding tradeoffs condition between safety and validity holds, then assessment of a value function is a feasible task. Under the corresponding tradeoffs condition, in fact, the following theorem can be applied to derive a two-attribute value function. Theorem 1: A preference structure is additive and therefore has an associated value function of the form: ν(s, v) = νS (s) + νV (v), where νS (·) and νV (·) are value functions (expressing the decision maker’s preferences over S and V respectively) if and only if the corresponding tradeoffs condition is satisfied. Theorem 1 provides a necessary and sufficient condition for the value function to be additive. Both implications are important. Necessity: If we express the agency’s preferences over possible pairs (s, v) using an additive function then we are implicitly assuming that the corresponding tradeoffs condition holds. Thus for example any score criterion that can be expressed in terms of the sum of a function of S and a function of V implicitly assumes that the corresponding tradeoffs condition between S and V holds. Sufficiency: If the corresponding tradeoffs condition is satisfied, by Theorem 1, we can decompose the original problem of assessing a bivariate value function into two simpler problems involving the assessment of univariate value functions. Trottini (2008) uses the necessity condition of Theorem 1 to review the score criteria of Domingo Ferrer and Torra (2001) and the sufficient condition to develop a bivariate value function in a simplified disclosure scenario. 2.2. Selection procedures that do not formalize preference structure When the formalization of the agency’s preference structure through a value function is too difficult (e.g., because the assumptions that allow decomposing the problem of assessing a multiattribute value function into several lower dimensional assessments do not hold) then the decision maker might want to adopt alternative procedures that do not require formalizing preference structures. These procedures have the advantage that can be applied in virtually any disclosure limitation problem. Although they do not usually provide an “optimal” solution (with respect to the unknown agency’s preference structure) they allow the agency to explore a set of available solutions (masked data in M) that yield a “satisfactory” balance between data safety and data validity. These procedures refer to – 288 – the notion of “dominance” that we introduce next. Let Mi and Mj be two arbitrary masked data sets in M and let (si , vi ) and (sj , vj ) be the consequences of the release of Mi and Mj in terms of the agency’s conflicting objectives “maximize safety” and “maximize validity”. In the easiest case where S and V are univariate we say that Mi dominates Mj if and only if si sj and vi vj . In selecting the “best” data dissemination strategy we can restrict our attention to those strategies not dominated by any other strategy in M. Such a set is called the efficient frontier. When S and V are univariate the efficient frontier can be drawn and it may be relatively easy to select the best data dissemination in M. However when the sum of the dimension of the safety measure S and validity measure V is greater than three, we can not display the efficient frontier and the graphical interpretation is not feasible. We must rely on some alternative method to “explore” the efficient frontier. One of such methods makes use of artificial constraints and consists of 5 steps: step1) The agency sets “aspiration levels” for all the components of the data safety and data validity vectors but one; step 2) The analyst determines the set C temp that consists of all the masked data in M that satisfies the artificial constraints defined at step 1; step 3) If C temp is empty, the analyst then repeats step 1 and step 2 changing the aspirations levels in step 1 until C temp 6= ∅; step 4) The analyst determines the masked data set M temp in C temp that maximizes data validity; step 5) The decision maker has to decide either to remain satisfied with the current solution M temp or to explore further the efficient frontier changing some of the aspiration levels and repeat steps 1-4. Note that in the context of data disclosure limitation problems the aspiration levels in step 1 have a natural interpretation as thresholds for minimum tolerable safety and minimum tolerable validity. In particular, the iterative procedure described above, in the simplest case where both data safety and data validity are univariate, yields the well known criterion for data masking selection that chooses as the best masked data the one that maximizes validity subject to a constraint of minimum safety (criterion C1 in section 2). There is, however, one important difference. The standard criterion that maximizes validity under a constraint of minimum safety does not incorporate step 5. This step is crucial since it allows us changing the aspiration level (i.e., the threshold for the minimum tolerable data safety) and exploring the existence of better solutions. Omitting this step might lead to a “myopic” selection criterion (where an arbitrary large increase in data utility is not worth an arbitrary small decrease in data safety if the new safety value is below the pre-specified threshold). In the more general case when S and V are multivariate, the iterative procedure that we just described provide a useful algorithm to implement the C2 -type criteria that we discussed in section 2. 3. Trade-off under uncertainty Being special cases of the trade-off assessment under certainty approach, existing criteria underestimate the actual uncertainty in the problem. As commented in the introduction, for a given data dissemination strategy achievement of the “maximize safety” and “maximize usefulness” objectives depends on several features of data users inferences that are only partially known to the agency. Thus from the agency’s perspective S and V are random variables. Each data dissemination strategy induces a distribution over the space of consequences and choosing among alternative strategies is equivalent to choose among alternative lotteries for (S, V ). Note that the trade-off under certainty approach can – 289 – take uncertainty into account but the result is, usually, a very conservative procedure, i.e. a data dissemination strategy of very limited usefulness for legitimate data users. When uncertainty is present, in fact, the standard solution in SDC (within the trade-off under certainty approach) is to consider a “worst case scenario” which is obtained choosing for each of the components of S and V the least favorable value. A better approach to the problem, that might result in less conservative procedures, is to define criteria that allow one to compare distributions over S × V. One way of defining such criteria is by assessing a suitable multiattribute utility function, u : S × V → <, with the property that given two probability distributions on S × V, Gi and Gj , Gi is preferred to Gj if and only if the expected value of u(·) under Gi is greater than the expected value of u(·) under Gj . In the next subsections we describe standard techniques in multiattribute utility theory that we can use to assess a multiattribute utility function. For simplicity, we consider the case where a two-attribute utility function needs to be assessed. The general idea and results have a straightforward generalization to the case of three or more attributes(2) . 3.1. Eliciting a two-attribute utility function Suppose that we have identified (univariate) attributes S and V for the two objectives “maximize safety”, “maximize validity” and they are appropriate for the problem. The goal is to build a two-attribute utility function u : S × V → <. If the space of possible consequences S × V contains few points (say less than 50) then direct assessment of the utility function is possible using certainty equivalents (see Keeney and Raiffa 1976, p. 222). Unfortunately, in most data disclosure limitation problems S × V is too big for a direct assessment to be feasible. As for the assessment of value functions described in section 2, the idea, in these cases, is to try to identify relevant features of the decision maker’s preferences that allow us to put strong constraints on the form of the multiattribute utility function. One feature that is of particular importance is utility independence. It has been shown that if certain utility independence assumptions hold than the multiattribute utility function must be of a specified form. What makes utility independence operational in data disclosure limitation (as well in several real applications) is that: (i) Utility independence assumptions in several real disclosure limitation problems are ascertainable and are verifiable in practice;(ii) Utility independence allows for great variability in the final form of the multiattribute utility function, that is, it can be used to formalize many different preferences structures for data release; (iii) Under utility independence the assessment of multiattribute utility is relatively easy (it is equivalent to assess several lower dimensional (conditional) utility functions with proper scaling). In the next subsection, we present the main utility independence definitions and their consequences in terms of the form of the multiattribute utility function. 3.2. Utility independence A first notion of utility independence relevant is SDC is the one that occurs when safety is utility independent of validity. Definition 1: We say that S is utility independent of V if and only if preferences for (2) Our review of the results in multiple attribute utility theory is a very short summary of a more detailed discussion on the topic by Keeney and Raiffa (1976, chapter 5). – 290 – lotteries on S given V = v do not depend on the particular level v. Our claim is that in several real applications, it seems reasonable to assume that S is utility independent of V . This independence assumption essentially reflects the dominant role that safety plays in data disclosure limitation problems. Most of the statistical agencies appear comfortable in choosing among alternative data dissemination strategies that yield the same validity v 0 (but different lotteries for safety) without knowing the actual value of v 0 . If this is the case, we can apply the results in the following theorem to derive the multiattribute utility function. Theorem 2 (Keeney and Raiffa, 1976, page 244): If S is utility independent of V , then u(s, v) = u(s0 , v)[1 − u(s, v0 )] + u(s1 , v)u(s, v0 ) where: u(s, v) is normalized by u(s0 , v0 ) = 0, and u(s1 , v0 ) = 1; u(s0 , v), u(s1 , v), and u(s, v0 ) are conditional utility functions of V given S = s0 , V given S = s1 and of S given V = v0 respectively. Note that under utility independence in order to assess the multiattribute utility function it is sufficient to specify three (lower dimensional) conditional utility functions and scale them properly (for an illustration of such an assessment in a simplified disclosure scenario see Trottini 2004, chapter 6). Suppose, now, that an analyst, A, has ascertained that in the decision maker’s preferences S is utility independent of V . As a second step in the utility assessment, A could verify whether S and V are additive independent. Definition 3: Attributes S and V are additive independent if the paired comparison of any two lotteries, defined by two joint probability distributions on S × V, depends only on the marginal probability distributions. If this is the case, then the result in the next theorem greatly simplifies the assessment of multiattribute utility function. Theorem 3 (Keeney and Raiffa 1976, page 231): Attributes S and V are additive independent if and only if the utility function is additive. The additive forms might be written as u(s, v) = u(s, v0 ) + u(s0 , v), where u(s, v) is normalized by u(s0 , v0 ) = 0 and u(s1 , v1 ) = 1 for arbitrary s1 and v1 such that (s1 , v0 ) (s0 , v0 ) and (s0 , v1 ) (s0 , v0 ). Note that additive independence assumes no interaction of the decision maker’s preferences for different amounts of the two attributes. This assumption is too restrictive in many real applications. It is often the case that desirability of one attribute increases (or decreases) with the level of the other. For data disclosure limitation problems, in particular, it seems reasonable to expect that desirability of safety (validity) increases with the level of validity (safety) and S and V are not utility independent. Note also that Theorem 3 provides a necessary and sufficient condition for additive independence. This means that multiattribute utility functions that can be expressed as an additive function of a measure of safety, S, and a measure of validity, V , implicitly assume that S and V are additively independent. Keller-McNulty et al. 2005, for example, propose to use an additive function as in Theorem 3 using univariate utility functions for safety and validity – 291 – defined in terms of Shannon’s entropy, thus implicitly assuming that S and V are additive independent. Although we consider here the simplest case where S and V are univariate, the conditional utility functions that appear in Theorems 2 and 3 might be either multidimensional or unidimensional in more general situations and the arguments s and v can be scalars or vectors. If they are unidimensional, standard techniques in univariate utility theory (monotonicity, risk aversion, etc.) are appropriate. If they are vectors, it may be possible to use independence properties of the components of s and v to decompose the assessment of the multiattribute conditional utility into simpler assessments of lower dimensional conditional utilities for the components of s and v (for an illustration of such an assessment in a simplified disclosure scenario see Trottini 2004, chapter 6). 4. Conclusions In this work we have presented a preliminary attempt to use decision theory as an operational tool in data disclosure limitation. In our opinion the decision-theoretic framework discussed here is relevant in SDC because introduces: (i) a clear distinction of the agency and users perspectives; (ii) an explicit modeling of these perspectives through multiattribute value (or utility) functions; (iii) an explicit formalization of the assumptions underlying the modeling; (iv) a natural way to take into account the different sources of uncertainty in the problem. Existing applications, of the proposed framework, however, refer to oversimplified disclosure scenarios (see Keller-McNulty et al. 2005 and Trottini 2004, 2008). Only applications to real problems will tell us the extent to which decision theory has a future in the field. References Domingo-Ferrer J., Torra V. (2001) Disclosure control methods and information loss for microdata, in: Confidentiality, Disclosure and Data Access. Theory and Practical Applications for Statistical Agencies, Doyle P., Lane J., Theeuwes J., & Zayatz L. (Eds.), North-Holland, 91–110. Duncan G.T., Keller-McNulty S. (2001) Disclosure risk vs. data utility: the R-U confidentiality map, Technical Report, LA-UR-01-6428, Los Alamos National Laboratory. Karr A.F, Kohnen C.N., Oganian A., Reiter J.P., Sanil A.P. (2006) A framework for evaluating the utility of data altered to protect confidentiality, The American Statistician, 60, 1–9. Keeney R.L., Raiffa H. (1976) Decisions with Multiple Objectives, Wiley, New York. Keller-McNulty S., Nakhlel C.W., Singpurwalla N.D. (2005) A paradigm for masking (camouflaging) information, International Statistical Review, 73, 331–349. Trottini M. (2004) Decision Models for Data Disclosure Limitation, Ph.D. thesis, Carnegie Mellon University Trottini M. (2008) Data disclosure limitation as a decision problem, Metron, to appear. – 292 –
© Copyright 2026 Paperzz