Toward practicing privacy

Research and applications
Toward practicing privacy
Cynthia Dwork,1 Rebecca Pottenger2
▸ An additional appendix is
published online only. To view
this file please visit the journal
online (http://dx.doi.org/10.
1136/amiajnl-2012-001047).
1
Microsoft Research, Mountain
View, California, USA
2
Department of Electrical
Engineering and Computer
Science, EECS, UC Berkeley,
Berkeley, California, USA
Correspondence to
Dr Cynthia Dwork, SVC,
Microsoft Research, 1065 La
Avenida, Mountain View, CA
94043, USA;
[email protected]
Written while visiting Microsoft
Research, SVC
Received 24 April 2012
Revised 2 November 2012
Accepted 9 November 2012
ABSTRACT
Private data analysis—the useful analysis of confidential
data—requires a rigorous and practicable definition of
privacy. Differential privacy, an emerging standard, is the
subject of intensive investigation in several diverse
research communities. We review the definition, explain
its motivation, and discuss some of the challenges to
bringing this concept to practice.
THE SANITIZATION PIPE DREAM
In the sanitization pipe dream, a magical procedure
is applied to a database full of useful but sensitive
information. The result is an ‘anonymized’ or even
‘synthetic’ database that ‘looks just like’ a real
database for the purposes of data analysis, but
which maintains the privacy of everyone in the
original database. The dream is seductive: the
implications of success are enormous and the
problem appears simple. However, time and again,
both in the real world and in the scientific literature, sanitization fails to protect privacy. Some
failures are so famous they have short names:
‘Governor Weld,’1 ‘AOL debacle,’2 ‘Netflix Prize,’3
and ‘GWAS allele frequencies.’4 5 Some are more
technical, such as the composition attacks that can
be launched against k-anonymity, l-diversity, and
m-invariance,6 when two different, but overlapping, databases undergo independent sanitization
procedures.
The mathematical reality is that anything yielding ‘overly accurate’ answers to ‘too many’ questions is, in fact, blatantly non-private: suppose we
are given a database of n rows, where each row is
the data of a single patient, and each row contains,
among other data, a bit indicating the patient’s
HIV status. Suppose further that we require a sanitization that yields answers to questions of the following type: a subset S of the database is selected
via a combination of attributes, for example, ‘over
5 feet tall and under 140 pounds’ or ‘smokes one
pack of cigarettes each day’i and the query is,
‘How many people in the subset S are
HIV-positive?’ How accurate can the answers to
these counting queries be, while providing even
minimal privacy? Perhaps not accurate at all! Any
sanitization ensuring that answers to all possible counting queries are within error E permits a reconstruction
of the private bits that is correct on all but at most 4E
rows.7 For example, if the data set contains
n ¼10 000 rows, and answers to all counting
queries can be estimated to within an error of at
most 415 , n=24, then the reconstruction will be
correct to within 4 415 ¼ 1660 , n=6. If the
estimations were correct to within 10, then reconstruction would be correct on all but 40 rows, an
overall error rate of at most 40=10 000 ¼ 0:4%, no
matter how weakly the secret bit is correlated to the
other attributes.
Formally, blatant non-privacy is defined to be
achieved when the adversary can generate a reconstruction that is correct in all but oðnÞ entries of
the database, where the notation ‘oðnÞ’ denotes
any function that grows more slowly than n=c for
every constant c as n goes to infinity. The theorem
says that if the noise is always oðnÞ then the
system is blatantly non-private.ii
Clearly, the unqualified sanitization pipe dream is
problematic: sanitization is supposed to give relatively accurate answers to all questions, but any
object that does so will compromise privacy. Even
on-line databases, which receive and respond to
questions, and only provide relatively accurate
answersiii to most of the questions actually asked
(see the online supplementary appendix for further
details), risk blatant non-privacy after only a linear
number of random queries.7–9 To avoid the blatant
non-privacy of the reconstruction attacks, the
number of accurate answers must be curtailed; to
preserve privacy, utility must inevitably be sacrificed.
This suggests two questions: (1) How do we
ensure privacy even for few queries?iv and (2) Is
there a mathematically rigorous notion of privacy
that enables us to argue formally about the degree
of risk in a sequence of queries?
Differential privacy is such a definition; differentially private techniques guarantee privacy while
often permitting accurate answers; and the goal of
algorithmic research in differential privacy is precisely to postpone the inevitable for as long as possible. Looking ahead, and focusing, for the sake of
concreteness, on counting queries, the theoretical
work on differential privacy exposes a delicate
balance between the degree of privacy loss,
number of queries, and degree of accuracy (ie, the
distribution on errors).
ON THE PRIMACY OF DEFINITIONS
Differential privacy is inspired by the developments in cryptography of the 1970s and 1980s.
‘Pre-modern’ cryptography witnessed cycles of
‘propose-break-propose-again.’ In these cycles, an
implementation of a cryptographic primitive, say,
a cryptosystem (a method for encrypting and
decrypting messages), would be proposed. Next,
someone would break the cryptosystem, either in
ii
Because 4×o(n) is
pstill
ffiffiffiffiffi o(n).
Errors within 0ð nÞ.
See the online supplementary appendix for the
two-query differencing attack.
iii
i
The specification of the set is abstracted as a subset of
the rows: S # f1; 2; :::; ng.
102
iv
J Am Med Inform Assoc 2013;20:102–108. doi:10.1136/amiajnl-2012-001047
Research and applications
theory or in practice. It would then be ‘fixed’—until the ‘fixed’
version was broken, and so the cycle would go.
Among the stunning contributions of Goldwasser et al10 11
was an understanding of the importance of definitions. What
are the properties we want from a cryptosystem? What does it
mean for an adversary to break the cryptosystem, that is, what
is the goal of the adversary? And what is the power of the
adversary: what are its computational resources; to what information does it have access? Having defined the goals and (it is
hoped, realistically) bounded the capabilities of the adversary,
the cryptographer then proposes a cryptosystem and rigorously
proves its security, formally arguing that no adversary with the
stated resources can break the system.v This methodology converts a break of the implementation into a break of the definition. That is, a break exposes a weakness in the stated
requirements of a cryptosystem. This leads to a strictly stronger
set of requirements. Even if the process repeats, the series of
increasingly strong definitions converts the propose-breakpropose-again cycle into a path of progress.
Sometimes a proposed definition can be too strong. This is
argued by proving that no implementation can exist. In this
case, the definition must be weakened or altered (or the requirements declared unachievable). As we will see, this is what happened with one natural approach to defining privacy for
statistical databases—what we have been calling privacypreserving data analysis.
A natural suggestion for defining privacy in data analysis is a
suitably weakened version of the cryptographic notion of
semantic security. In cryptography this says that the encryption
of a message m leaks no information that cannot be learned
without access to the ciphertext.11 In the context of privacypreserving databases we might hope to ensure that anything
that can be learned about a member of the database via a
privacy-preserving database should also be learnable without
access to the database. And indeed, 5 years before the invention
of semantic security, the statistician Tore Dalenius framed
exactly this goal for private data analysis.12
Unfortunately, Dalenius’ goal provably cannot be
achieved.13 14 The intuition for this can be captured in three
sentences: A Martian has the (erroneous) prior belief that every
Earthling has two left feet. The database teaches that almost
every Earthling has exactly one left foot and one right foot.
Now the Martian with access to the statistical database knows
something about the first author of this note that cannot be
learned by a Martian without access to the database.
DATABASES THAT TEACH
The impossibility of Dalenius’ goal relies on the fact that the
privacy-preserving database is useful. If the database said
nothing, or produced only random noise unrelated to the data,
then privacy would be absolute, but such databases are, of
course, completely uninteresting. To clarify the significance of
this point we consider the following parable.
A statistical database teaches that smoking causes cancer.
The insurance premiums of Smoker S rise (Smoker S is
harmed). But, learning that smoking causes cancer is the whole
v
Modern cryptography is based on the assumption that certain
problems have no computationally efficient solutions. The proofs of
correctness show that any adversary that breaks the system can be
‘converted’ into an efficient algorithm for one of these (believed-to-be)
computationally difficult problems.
J Am Med Inform Assoc 2013;20:102–108. doi:10.1136/amiajnl-2012-001047
point of databases of this type: Smoker S enrolls in a smoking
cessation program (Smoker S is helped).
The important observation about the parable is that the
insurance premiums of Smoker S rise whether or not he is in the
database. That is, Smoker S is harmed (and later helped) by the
teachings of the database. Since the purpose of the database—
public health, medical data mining—is laudable, the goal in differential privacy is to limit the harms to the teachings, and to
have no additional risk of harm incurred by deciding to participate in the data set. Speaking informally, differential privacy
says that the outcome of any analysis is essentially equally
likely, independent of whether any individual joins, or refrains
from joining, the data set. Here, the probability space is over
the coin flips of the algorithm (which the algorithm implementors, ie, the ‘good guys,’ control). The guarantee holds for all
individuals simultaneously, so the definition simultaneously
protects all members of the database.
An important strength of the definition is that any differentially private algorithm, which we call a mechanism, is differentially private based on its own merits. That is, it remains
differentially private independent of what any privacy adversary knows, now or in the future, and to which other databases
the adversary may have access, now or in the future. Thus, differentially private mechanisms hide the presence or absence of
any individual even in the face of linkage attacks, such as the
Governor Weld, AOL, Netflix Prize, and GWAS attacks mentioned earlier. For the same reason, differential privacy protects
an individual even when all the data of every other individual
in the database are known to the adversary.
Formal definition of differential privacy
A database is modeled as a collection of rows, with each row
containing the data of a different individual. Differential
privacy will ensure that the ability of an adversary to inflict
harm (or good, for that matter)—of any sort, to any set of
people—should be essentially the same, independent of
whether any individual opts in to, or opts out of, the dataset.
This is done indirectly, simultaneously addressing all possible
forms of harm and good, by focusing on the probability of any
given output of a privacy mechanism and how this probability
can change with the addition or deletion of any one person.
Thus, we concentrate on pairs of databases ðx; x0Þ differing
only in one row, meaning one is a subset of the other and the
larger database contains just one additional row. (Sometimes it
is easier to think about pairs of databases of the same size, say,
n, in which case they agree on n 1 rows but one person in x
has been replaced, in x0, by someone else.) We will allow our
mechanisms to flip coins; such algorithms are said to be
randomized.
Definition 3.114 15 A randomized mechanism M gives
ð1; 0Þ-differential privacy if for all data sets x and x0 differing
on at most one row, and all S # RangeðMÞ,
Pr½MðxÞ [ S expð1Þ Pr½Mðx0 Þ [ S;
where the probability space in each case is over the coin flips
of M.
In other words, consider any possible set S of outputs that
the mechanism might produce. Then the probability that the
mechanism produces an output in S is essentially the same—
specifically, to within an e1 factor on any pair of adjacent databases. (When 1 ¼ 1; e1 ð1 þ 1Þ.) This means that, from the
output produced by M, it is hard to tell whether the database
103
Research and applications
is x which, say, contains the reader’s data, or the adjacent database x0 , which does not contain the reader’s data. The intuition
is: if the adversary can’t even tell whether or not the database
contains the reader’s data, then the adversary can’t learn anything sui generis about the reader’s data.
Properties of differential privacy
Differential privacy offers several benefits beyond immunity to
auxiliary information.
Addresses arbitrary risks
Differential privacy ensures that even if the participant
removed her data from the data set, no outputs (and thus consequences of outputs) would become significantly more or less
likely. For example, if the database were to be consulted by an
insurance provider before deciding whether or not to insure a
given individual, then the presence or absence of any individual’s data in the database will not significantly affect her
chance of receiving coverage. Protection against arbitrary risks
is, of course, a much stronger promise than the often-stated
goal in sanitization of protection against re-identification. And
so it should be! Suppose that in a given night every elderly
patient admitted to the hospital is diagnosed with one of
tachycardia, influenza, broken arm, and panic attack. Given the
medical encounter data from this night, without re-identifying
anything, an adversary could still learn that an elderly neighbor
was taken to the emergency room (the ambulance was
observed) with one of these four complaints (and moreover she
seems fine the next day, ruling out influenza and broken arm).
Automatic and oblivious composition
Given two differentially private databases, where one is
ð11 ; 0Þ-differentially private and the other is ð12 ; 0Þ-differentially private, a simple composition theorem shows that the
cumulative risk endured by participating in (or opting out of )
both databases is at worst ð11 þ 12 ; 0Þ-differentially private.15
This is true even if the curators of the two databases are mutually unaware and make no effort to coordinate their responses.
In fact, a more sophisticated composition analysis shows that
the composition of k mechanisms, each of which is
ðp
1; 0Þ-differentially
private, is, for all d . 0, at most
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð 2k lnð1=dÞ1 þ k1ðe1 1Þ; dÞ-differentially private. (ð1; dÞ-differential privacy says that with probability
1 d the privacy
pffiffiffi
17
loss is bounded
by
1
.)
When
1
1=
k
,
this
replacing of k1
pffiffiffi
by roughly k1 can represent very large savings, translating
into much improved accuracy.
Group privacy
Differential privacy automatically yields group privacy: any
ð1; 0Þ-differentially private algorithm is automatically
ðk1; 0Þ-differentially private for any group of size k; that is, it
hides the presence or absence of all groups of size k, but the
quality of the hiding is only k1.
Building complex algorithms from primitives
The quantitative, rather than binary, nature of differential
privacy enables the analysis of the cumulative privacy loss of
complex algorithms constructed from differentially private primitives, or ‘subroutines.’
Measuring utility
Worst-case guarantees
Differential privacy is a worst-case, rather than average-case,
guarantee. That means it protects privacy even on ‘unusual’
databases and databases that do not conform to ( possibly erroneous) assumptions about the distribution from which the
rows are learned. Thus, differential privacy provides protection
for outliers, even if they are strangely numerous—independent
of whether this apparent strangeness is just ‘the luck of the
draw’ or in fact arises because the algorithm designer’s expectations are simply wrong.vi
Quantification of privacy loss
Privacy loss is quantified by the maximum, over all
C # RangeðMÞ, and all adjacent databases x; x0 , of the ratio
Pr½MðxÞ [ C
:
ð1Þ
Pr½Mðx0Þ [ C
In particular, ð1; 0Þ-differential privacy ensures that this privacy
loss is bounded by 1. This quantification permits comparison of
algorithms: given two algorithms with the same degree of
accuracy (quality of responses), which one incurs smaller
privacy loss? Or, given two algorithms with the same bound
on privacy loss, which permits the more accurate responses?
Sometimes a little thought can lead to a much more accurate
algorithm with the same bound on privacy loss; therefore, the
failure of a specific ð1; 0Þ-differentially private algorithm to
deliver a desired degree of accuracy does not mean that differential privacy is inconsistent with accuracy.
ln
There is no special definition of utility for differentially private
algorithms. For example, if we wish to achieve ð1; 0Þ-differential
privacy for a single counting query, and we choose to use the
Laplace mechanism described in the ‘Differentially private algorithms’ section below, then the distribution on the (signed)
error is given by the Laplace distribution with parameter 1=1,
which has mean zero, variance 1=212 , and probability density
function pðxÞ ¼ ð1=2Þejxj1 . If we build a differentially private
estimation procedure, for example, an M-estimator,18 19 we
prove theorems about the probability distribution of the (l1 or
l2 , say) errors, as a function of 1 (in addition to any other parameters, such as size of the data set and confidence requirements).15 19 20 If we use a differentially private algorithm to
build a classifier, we can prove theorems about its classification
accuracy or its sample complexity, again as a function of 1
together with the usual parameters.21 If we use a differentially
private algorithm to learn a distribution, we can evaluate the
output in terms of its KL divergence from the true distribution
or the empirical distribution, as a function of 1 and the other
usual parameters.22 23
DIFFERING INTELLECTUAL TRADITIONS
The medical informatics and theoretical computer science communities have very differing intellectual traditions. We highlight some differences that we have observed between the
perspectives of the differential privacy researchers and some
other works on privacy in the literature. The online supplementary appendix contains an expanded version of this section.
Defense against leveraging attacks
vi
This is in contrast to what happens with average case notions of
privacy, such as the one in Agrawal et al.16
104
An attacker with partial information about an individual can
leverage this information to learn more details. Knowing the
J Am Med Inform Assoc 2013;20:102–108. doi:10.1136/amiajnl-2012-001047
Research and applications
dates of hospital visits of a neighbor or co-worker, or the medications taken by a parent or a lover, enables a leveraging attack
against a medical database. To researchers approaching privacy
from a cryptographic perspective, failure to protect against
leveraging attacks is unthinkable.
Learning distributions is the whole point
Learning distributions and correlations between publicly
observable and hidden attributes is the raison d’être of data analysis.vii The view that it is necessarily a privacy violation to
learn the underlying distribution from which a database is
drawn25 26 appears to the differential privacy researchers to be
at odds with the concept of privacy-preserving data analysis—
what are the utility goals in this view?
Re-identification is not the issue
We have argued that preventing re-identification does not
ensure privacy (see the ‘Formal definition of differential
privacy’ section). In addition, the focus on re-identification
does not address privacy compromises that can result from the
release of statistics (of which reconstruction attacks are an
extreme case).
‘Who would do that?’
Some papers in the (non-differential) privacy literature outline
assumptions about the adversary’s motivation and skill,
namely that it is safe to assume relatively low levels for both.
However, there will always be snake oil salesmen who prey on
the fear and desperation of the sick. Blackmail and exploitation
of despair will provide economic motives for compromising the
privacy of medical information, and we can expect sophisticated attacks. ‘PhD-level research,’ said Udi Manber, a former
professor of computer science, commenting on spammers’ techniques, when he was CEO of Amazon subsidiary A9.27
The assumption of an adversarial world
All users of the database are considered adversarial, and they
are assumed to conspire. There are two reasons for this. In a
truly on-line system, such as the American Fact Finder,viii there
is no control over who may pose questions to the data, or the
extent to which apparently different people, say, with different
IP addresses, are in fact the same person. Second, the mathematics of information disclosure are complicated, and even wellintentioned data analysts can publish results that, upon careful
study, compromise privacy. So even two truly different users
can, with the best of intentions, publish results that, when
taken together, could compromise privacy. A general technical
solution is to view all users as part of a giant adversary.
A corollary of the adversarial view yields one of the most
controversial aspects of differential privacy: the analyst is not
given access to raw data. Even a well-intentioned analyst may
inadvertently compromise an individual’s privacy simply in the
choice of statistic to be released; the data of a single outlier can
cause the analyst to behave differently. Analysts find this
aspect very difficult to accept, and ongoing work is beginning
vii
This is consistent with the Common Rule position that an IRB
‘should not consider possible long-range effects of applying knowledge
gained in the research.’ (‘The Common Rule is a federal policy
regarding Human Subjects Protection that applies to 17 Federal
agencies and offices.’24)
viii
Accessible at http://factfinder2.census.gov/faces/nav/jsf/pages/index.
xhtml.
J Am Med Inform Assoc 2013;20:102–108. doi:10.1136/amiajnl-2012-001047
to address this problem for the special case in which the
analyst is trusted to adhere to a protocol.28
DIFFERENTIALLY PRIVATE ALGORITHMS
Privacy by process
Randomized response is a technique developed in the social
sciences to evaluate the frequency of embarrassing or even
illegal behaviors.29 Let XYZ be such an activity. Faced with the
query, ‘Did you XYZ last night?’ the respondent is instructed
to perform the following steps:
1. Flip a coin.
2. If tails, then respond truthfully.
3. If heads, then flip a second coin and respond ‘Yes’ if heads
and ‘No’ if tails.
The intuition behind randomized response is that it provides
‘plausible deniability’: a response of, say, ‘Yes,’ may have been
offered because the first and second coin flips were both heads,
which occurs with probability 1=4. In other words, privacy is
obtained by process. There are no ‘good’ or ‘bad’ responses; the
process by which the responses are obtained affects how they
may legitimately be interpreted. Randomized response thwarts
the reconstruction attacks described in the ‘The sanitization
pipe dream’ section because a constant, randomly selected, fraction of the inputs are replaced by random values.
Claim 5.1
Randomized response is ðln 3; 0Þ-differentially private.ix
Proof
The proof is a simple case analysis. Details are in the online
supplementary appendix.
Differential privacy is ‘closed under post-processing,’
meaning that if a statistic is produced via an ð1; dÞ-differentially
private method, then any computation on the released statistic
has no further cost to privacy. Thus, if a data curator creates a
database by obtaining randomized responses from, say, n
respondents, the curator can release any statistic computed
from these responses, with no further privacy cost. In particular, the curator can release an estimate of the fraction who
XYZ ed last night (#XYZ=n) from the fraction who replied
‘Yes’ (#Yes=n) via the formula
#XYZ
#Yes
1
¼2
n
n
2
while maintaining ðln 3; 0Þ-differential privacy.
A general method
We now turn to the computation of arbitrary statistics on arbitrarily complex data, showing that the addition of random
noise drawn from a carefully chosen distribution ensures differential privacy: if the analysts wishes to learn fðxÞ where f is
some statistic and x is the database, the curator can choose a
random value, say v, whose distribution depends only on f and 1,
and can respond with fðxÞ þ v. This process (it is a process
because it involves the generation of a random value) will be
ð1; 0Þ-differentially private.
Recall that the goal of the random noise is to hide the presence or absence of a single individual, so what matters is how
much, in the worst case, an individual’s data can change the
statistic. For one-dimensional statistics (counting queries,
ix
We can obtain better privacy (reduce 1) by modifying the randomized
response protocol to decrease the probability of a non-random answer.
105
Research and applications
medians, and so on) this difference is measured in absolute
value jfðxÞ fðx0Þj. For higher dimensional queries, such as
regression analyses, we use the so-called L1 norm of the difference, defined by
jjðw1 ; w2 ; . . . ; wd Þjj1 ¼ jw1 j þ jw2 j þ . . . þ jwd j
where d is the dimension of the query.
Definition 5.2
(L1 Sensitivity) For f : D ! Rd, the L1 sensitivity of f is
Df ¼maxx;x0 jjf ðxÞ f ðx0 Þjj1
¼ max
0
x;x
d
X
jfðxÞi fðx0 Þi j
ð2Þ
i¼1
0
for all x; x differing in at most one row.
The Laplace mechanism.
The Laplace distribution with parameter b, denoted LapðbÞ, has
density function PrðzjbÞ ¼ 1=2b expðjzj=bÞ; that is, the probability of choosing a value z falls off exponentially in the absolute
jzj. The variance of the distribution is 2b2 , so its SD
pffiffivalue
ffi
is 2b; we say it is scaled to b. Taking b ¼ Df=1 we have that
the density at z is proportional to e1jzj=Df . This distribution
has highest density at 0 (good for accuracy), and for any z; z0
such that jz z0 j Df the density at z is at most e1 times the
density at z0 . The distribution gets flatter as Df increases. This
makes sense, since large Df means that the data of one person
can make a big difference in the outcome of the statistic, and
therefore there is more to hide. Finally, the distribution gets
flatter as 1 decreases. This is correct: smaller 1 means better
privacy, so the noise density should be less ‘peaked’ at 0 and
change more gradually as the magnitude of the noise increases.
A fundamental technique for achieving differential privacy is
to add appropriately scaled Laplacian noise to the result of a
computation. Again, the curator performs the desired computation, generates noise, and reports the result. The data are not perturbed. The correct distribution for the noise is the Laplace
distribution, where the parameter depends on the sensitivity Df
(higher sensitivity will mean larger expected noise magnitude)
and the privacy goal (smaller 1 will mean larger expected noise
magnitude).
Theorem 5.315 For f : D ! Rd, the mechanism M that adds
independently generated noise with distribution LapðDf=1Þ to
each of the d output terms is ð1; 0Þ-differentially private .
Let us focus on the case d ¼ 1. The sensitivity of a counting
query, such as ‘How many people in the database are suitable
for the ABC patient study?’ is 1, so by Theorem 5.3 it suffices
to add noise drawn from Lapð1=1Þ to ensure ð1; 0Þ-differential
privacy
pffiffiffi for this query. If 1 ¼ ln 3 it suffices to add noise with
SD 2 ln 3. Note that this expected error is much better than
thepffiffiffiexpected
error with randomized response, which is
pffiffiffi
Qð nÞ? 2 ln 3, so the Laplace mechanism is better than randomized response for this purpose.
If instead we want privacy
loss 1 1=10 then we would add
pffiffiffi
noise with variance 10 2 14:1. In general, the sensitivity of
k counting queries is at most k. So, taking 1 ¼ 1, the simple
compositionptheorem
tells us that adding noise with expected
ffiffiffi
magnitude 2k is sufficient. For k ¼ 10 this is about 14; for
k ¼ 100 it would be about 140. However, for large k we can
begin to use some of the more advanced techniques p
toffiffiffianalyze
composition, reducing the factor of k to roughly k. These
106
techniques give errors approaching the lower bounds on noise
needed to avoid reconstruction attacks.
A major thread of research has addressed handling truly huge
—exponential in the size of the database—numbers of counting or even arbitrary low-sensitivity queries, while still providing reasonably accurate answers.17 30–35x
In the online supplementary appendix we discuss two computations for which Theorem 5.3 yields surprisingly small
expected error: histogram release and finding a popular value.
Both problems seem to require many counts, but a more
careful analysis shows sensitivity Df ¼ 1, and hence may be
addressed with low distortion.
The privacy budget
We have defined the privacy loss quantity, and noted that loss
is cumulative. Indeed a strength of differential privacy is that it
enables analysis of this cumulative loss. We can therefore set a
privacy budget for a database and the curator can keep track of
the privacy loss incurred since the beginning of the database
operation. This is publicly computable, since it depends only
on the queries asked and the value of 1 used for each query; it
does not depend on the database itself. This raises the policy
question, ‘How should the privacy budget be set?’ The mathematics allows us to makes statements of the form:
If every database consumes a total privacy budget of
10 ¼ 1=801 then, with probability at least 1 e32 , any individual whose data are included in at most 10 000 databases will
incur a cumulative lifetime privacy loss of 1 ¼ 1, where the
probability space is over the coin flips of the algorithms run by
the differentially private database algorithms.xi
Were we able to get the statistical accuracy we desired with
such protective individual databases, that is, with 10 ¼ 1=801
(which, as of this writing, we typically are not!), and were we
to determine that 10 000 is a reasonable upper bound on the
total number of databases in which an individual would participate in his/her lifetime, then these numbers would seem
like a fine choice: the semantics of the calculation are that the
entire sequence of outputs produced by databases ever containing
data of the individual, would be at most a factor of e more or
less likely were the data of this individual to have been
included in none of the databases. Do the semantics cover
what we want? Are we too demanding? For example, should
we consider only the cumulative effects of, say, 100 or even just
50 databases? Should we consider much larger cumulative lifetime losses?
The differential privacy literature does not distinguish
between degrees of social sensitivity of data: enjoying Mozart
is treated as equally deserving of privacy as being in therapy.
Setting 1 differently for different types of data, or even different types of queries against data, may make sense, but these
are policy questions that the math does not attempt to
address.xii
Differential privacy protects against arbitrary risks; specifically, via the parameter 1 it controls the change in risk occurring
x
The dependence on the database size is a bit larger than n½; there are
also polylogarithmic dependencies on the size of the data universe and
the number of queries.
xi
These numbers come from setting d ¼ e32 and k ¼ 10; 000 in the
composition theorem mentioned in the ‘Properties of differential
privacy’ section and proved in Dwork et al.17
xii
Protecting ‘innocuous’ attributes such as taste in music can help to
protect an individual against leveraging attacks that exploit knowledge
of these attributes.
J Am Med Inform Assoc 2013;20:102–108. doi:10.1136/amiajnl-2012-001047
Research and applications
as a consequence of joining or refraining from joining a dataset.
However, what constitutes an acceptable rise in risk may vary
according to the nature of the risk, or the initial probability of
the risk. For example, if the probability of a ‘bad’ outcome is
exceedingly low, say, 1=1 000 000, then an increase by a factor
of 10, or even 1000, might be acceptable. Thus, the policy
choice for 1 for a given data set may depend on how the database is to be used. Caveat emptor: beware of mission creep (and
see Dwork et al36)!
If policy dictates that different types of data and/or databases should be protected by different choices of 1, say 11 for
music databases and 12 for medical databases, and so on, then
it will become even harder to interpret theorems about the
cumulative loss being bounded by a function of the 1i ’s.
As of this writing, there is no magic differential privacy
bullet, and finding methods for any particular analytical
problem or sequence of computational steps that yield good
utility on data sets of moderate size can be very difficult. By
engaging with researchers in differential privacy the medical
informatics community can and should influence the research
agenda of this dynamic and talented community.
Competing interests None.
Provenance and peer review Not commissioned; externally peer reviewed.
REFERENCES
1.
2.
3.
TOWARD PRACTICING PRIVACY
Differential privacy was motivated by a desire to utilize data
for the public good, while simultaneously protecting the
privacy of individual data contributors. The hope is that, to
paraphrase Alfred Kinsey, better privacy means better data.
The initial positive results that ultimately led to the development of differential privacy (and that we now understand to
provide ð1; dÞ-differential privacy for negligibly small d) were
designed for internet-scale databases, where reconstruction
attacks requiring n log2 n (or even n) queries, each requiring
roughly n time to process, are infeasible because n is so huge.37
In this realm, the error introduced for protecting privacy in
counting queries is much smaller than the sampling error, and
privacy is obtained ‘for free,’ even against an almost linear
number of queries. Moreover, counting queries provide a powerful computational interface, and many standard data mining
tasks can be carried out by an appropriately chosen set of
counting queries.38
Inspired by this paradigm at least two groups have produced
programming platforms39 40 enforcing differentially private
access to data. This permits programmers with no understanding of differential privacy or the privacy budget to carry out
data analytics in a differentially private manner.
Translating an approach designed for internet-scale data sets
to data sets of modest size is extremely challenging, and yet
substantial progress has been made; see, for example, the
experimental results in Hardt et al22 on several data sets: an epidemiological study of car factory workers (1841 records, six
attributes per record), a genetic study of barley powdery
mildew isolates (70 records, six attributes), data relating household characteristics (665 records, eight attributes), and the
National Long Term Care Survey (21 574 records, 16
attributes).
As with many things, special-purpose solutions may be of
higher quality than general solutions. Thus, the Laplace mechanism is a general solution for real-valued or vector-valued
functions, but the so-called objective perturbation methods have
yielded better results (lower error, smaller sample complexity)
for certain techniques in machine learning such as logistic
regression.21 More typically, researchers in differentially private
algorithms produce complex algorithms involving many differentially private steps (some of which may use any of the
Laplace mechanism, the Gaussian mechanism (ie, addition of
Gaussian noise, which yields ð1; dÞ-differential privacy), the
exponential mechanism,41 the propose-test-release paradigm,18
sample and aggregate,42 etc, as computational primitives). The
goal is always to improve utility beyond that obtained with
naive application of the Laplace mechanism.
J Am Med Inform Assoc 2013;20:102–108. doi:10.1136/amiajnl-2012-001047
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
Sweeney L. k-anonymity: a model for protecting privacy. Int’l J Uncertainty,
Fuzziness and Knowledge-based Systems 2002;10:557–70.
Barbaro M, Z T Jr. A face is exposed for AOL searcher no. 441779, 8/9/2006.
Narayanan A, Shmatikov V. Robust de-anonymization of large sparse datasets
(how to break anonymity of the Netflix Prize dataset). In Proceedings of the 29th
IEEE Symposium on Security and Privacy, 2008.
Homer N, Szelinger S, Redman M, et al. Resolving individuals contributing trace
amounts of DNA to highly complex mixtures using high-density SNP genotyping
microarrays. PLoS Genet 2008;4:e1000167.
Consortium P, Church G, Heeney C, et al. Public access to genome-wide data: Five
views on balancing research with privacy and protection. PLoS Genet 2009;5.
Ganta SR, Kasiviswanathan SP, Smith A. Composition attacks and auxiliary
information in data privacy. In ACM SIGKDD Conference on Data Mining and
Knowledge Discovery, 2008, 265–73.
Dinur I, Nissim K. Revealing information while preserving privacy. In Proceedings of
the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of
Database Systems, 2003, 202–10.
De A. Lower bounds in differential privacy. In: Cramer R. TCC, volume 7194 of
lecture notes in computer science. Springer, 2012, 321–38.
Dwork C, McSherry F, Talwar K. The price of privacy and the limits of lp decoding.
In Proceedings of the 39th ACM Symposium on Theory of Computing, San Diego,
CA: ACM, 2007, 85–94.
Goldwasser S, Micali S, Rivest R. A digital signature scheme secure against
adaptive chosen-message attacks. SIAM J Comput 1988;17:281–308.
Goldwasser S, Micali S. Probabilistic encryption. JCSS 1984;28:270–99.
Dalenius T. Towards a methodology for statistical disclosure control. Statistik
Tidskrift 1977;15:429–44.
Dwork C, Naor M. On the difficulties of disclosure prevention in statistical
databases or the case for differential privacy. Manuscript, 2008.
Dwork C. Differential privacy. In Proceedings of the 33rd International Colloquium on
Automata, Languages and Programming 2006, 1–12.
Dwork C, McSherry F, Nissim K, et al. Calibrating noise to sensitivity in private data
analysis. In Proceedings of the 3rd Theory of Cryptography Conference, 2006,
265–84.
Agrawal D, Aggarwal C. On the design and quantification of privacy preserving
data mining algorithms. In: Proceedings of the 20th ACM SIGMOD-SIGACT-SIGART
Symposium on Principles of Database Systems (PODS), 2001.
Dwork C, Rothblum GN, Vadhan SP. Boosting and Differential Privacy. In 51st Annual
Symposium on Foundations of Computer Science, 2010, 51–60.
Dwork C, Lei J. Differential privacy and robust statistics. In Proceedings of the
2009 International ACM Symposium on Theory of Computing, 2009:371–80.
Smith A. Privacy-preserving statistical estimation with optimal convergence rates.
In Proceedings of the ACM-SIGACT Symposium on Theory of Computing,
2011:813–22.
Barak B, Chaudhuri K, Dwork C, et al. Privacy, accuracy, and consistency too: A
holistic solution to contingency table release. In: Proceedings of the 26th
Symposium on Principles of Database Systems, 2007, 273–82.
Chaudhuri K, Monteleoni C, Sarwate A. Differentially private empirical risk
minimization. JMLR 2011;12:1069–109.
Hardt M, Ligett K, McSherry F. A simple and practical algorithm for differentially
private data release, 2010. CoRR abs/1012.4763.
Pottenger R. Manuscript in preparation, 2012.
http://ori.dhhs.gov/education/products/ucla/chapter2/page04b.htm.
Castellà-Roca J, Sebé F, Domingo-Ferrer J. On the security of noise addition for
privacy in statistical databases. In Proceedings of the Privacy in Statistical Databases
Conference, 2004:149–61.
Datta S, Kargupta H. On the privacy preserving properties of random data
perturbation techniques. In Proceedings of IEEE International Conference on Data
Mining, 2003:99–106.
Manber U. Private communication with the first author, 2003(?).
Dwork C, Naor M, Rothblum G, et al. Work in progress.
Warner S. Randomized response: a survey technique for eliminating evasive answer
bias. JASA 1965;60:63–9.
107
Research and applications
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
108
Blum A, Ligett K, Roth A. A learning theory approach to non-interactive database
privacy. In Proceedings of the 40th ACM SIGACT Symposium on Theory of
Computing, 2008.
Dwork C, Naor M, Reingold O, et al. On the complexity of differentially private data
release: efficient algorithms and hardness results. In Proceedings of the 43rrd ACM
Symposium on Theory of Computing, San Jose, CA: ACM, 2009, 381–90.
Gupta A, Hardt M, Roth A, et al. Privately releasing conjunctions and the statistical
query barrier. In Proceedings of the 43rrd ACM Symposium on Theory of Computing,
San Jose, CA: ACM, 2011, 803–12.
Hardt M, Rothblum GN. A multiplicative weights mechanism for interactive
privacy-preserving data analysis. Proceedings of the 51st Annual Symposium on
Foundations of Computer Science, 2010:61–70.
Hardt M, Rothblum G, Servedio R. Private data release via learning thresholds. In
Proceedings Symposium on Discrete Algorithms, 2012, 168–87.
Roth A, Roughgarden T. The median mechanism: interactive and efficient privacy
with multiple queries. In Proceedings of ACM-SIGACT Symposium on Theory of
Computing, 2010:765–74.
Dwork C, Naor M, Pitassi T, et al. Pan-private streaming algorithms. In Proceedings
of Innovations in Computer Science, 2010:66–80.
Dwork C, Nissim K. Privacy-preserving datamining on vertically partitioned
databases. In Proceedings of CRYPTO 2004, 2004, vol. 3152, 528–44.
Blum A, Dwork C, McSherry F, et al. Practical privacy: The SuLQ framework. In
Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of
Database Systems, 2005:128–38.
McSherry F. Privacy integrated queries (codebase). available on Microsoft Research
downloads website. See also the Proceedings of ACM SIGMOD Conference on
Management of Data 2009:19–30.
Roy I, Setty S, Kilzer A, et al. Airavat: Security and privacy for mapreduce.
In Proceedings of the 7th USENIX conference on Networked systems design and
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
implementation (NSDI), 2010. Link to codebase available at www.cs.utexas.edu/
indrajit/airavat.html.
McSherry F, Talwar K. Mechanism design via differential privacy. In
Proceedings of the 48th Annual Symposium on Foundations of Computer
Science, 2007:94–103.
Nissim K, Raskhodnikova S, Smith A. Smooth sensitivity and sampling in private
data analysis. In Proceedings of the 39th ACM Symposium on Theory of Computing,
2007, 75–84.
Agrawal R, Srikant R. Privacy-preserving data mining. In: Chen W, Naughton JF,
Bernstein PA. Proceedings of ACM SIGMOD Conference on Management of Data.
ACM, 2000, 439–50.
Bleichenbacher D. Chosen ciphertext attacks against protocols based on the rsa
encryption standard pkcs #1. In Proceedings of the 18th Annual International
Cryptology Conference on Advances in Cryptology, 1998:1–12.
Brickell J, Shmatikov V. The cost of privacy: destruction of data-mining utility in
anonymized data publishing. In Proceedings of ACM SIGKDD Conference on
Knowledge Discovey and Data Mining, 2008:70–8.
Dolev D, Dwork C, Naor M. Non-malleable cryptography. In Proceedings of the 23rd
ACM-SIGACT Symposium on Theory of Computing, 1991.
Dolev D, Dwork C, Naor M. Non-malleable cryptography. Siam J on Comput
2000;30:391–437.
Korolova A. Privacy violations using microtargeted ads: a case study. J Privacy and
Confidentiality 2011;3.
Liskov B. Private communication, 2007.
Narayanan A, Shmatikov V. Myths and fallacies of "personally identifiable
information". Communications of the ACM 2010;53:24–6.
Zhou S, Ligett K, Wasserman L. Differential privacy with compression.
In Proceedings of the 2009 IEEE International Symposium on Information
Theory, 2009:2718–22.
J Am Med Inform Assoc 2013;20:102–108. doi:10.1136/amiajnl-2012-001047