Private Query Processing on Inverted Index

Private Query Processing on Inverted Index
Wee Keong Ng
Yonggang Wen
School of Computer Engineering
NTU, Singapore
School of Computer Engineering
NTU, Singapore
Abstractโ€”A private query criteria ๐‘„๐พ is a Boolean logic
expression in โˆง, โˆจ and ¬ of an input set ๐พ. A private query
processing protocol takes as input a private query criteria ๐‘„๐พ
and a public data set ๐‘‘๐ถ and outputs a document ๐‘‘ โˆˆ ๐‘‘๐ถ
such that ๐‘„๐พ (๐‘‘) =1. This paper studies private query processing
protocols in the context of inverted index programs and makes
the following 3-fold contributions:
1) in the first fold, a new notion of private query processing protocols defined over inverted index programs within the MappingReducing-Filtering framework is introduced and formalized; Our
formalization is general and can be applied to the other scenarios
such as private searching on streams, data processing on large
clusters and compressing term positions in web indexes as well;
2) in the second fold, a new implementation of private query
processing protocols based on (๐‘š, ๐‘›)-Bloom filters with storages
and additively homomorphic public-key encryptions is proposed.
The idea behind our implementation is that a map function Map
is activated to generate a matrix ๐‘€ of form (document๐‘— : word๐‘—,1 ,
. . ., word๐‘—,๐‘› , ๐‘— = 1, . . . , ๐‘š). The reduce function Reduce is then
ห† of form (keyword๐‘– ,
invoked to generate an inverted index ๐‘€
document๐‘–,1 , document๐‘–,๐›ผ๐‘– ). Finally, a (๐‘š, ๐‘›)-Bloom-Filter with
storage is activated to generate matched documents according to
the specified query criteria;
3) in the third fold, we show that the proposed query processing
protocol on the inverted index is semantically secure assuming
that the underlying additively homomorphic public-key encryption is semantically secure.
To the best of our knowledge, this is the first semantically
secure query processing protocol defined over the inverted index
programs and we expect more applications to be deployed within
this framework.
I. I NTRODUCTION
The task of retrieving commercial data in the presence
of malicious adversaries falls into the general field of private information retrieval (PIR) which is well studied up
to date [12], [13], [3], [4]. For example, Ostrovsky and
Skeith [12], [13] have already proposed solutions of private
searching on streaming data, where a client ๐‘ƒ queries whether
a server stores the data containing a keyword key, and in case
that a data contains key, ๐‘ƒ would like to obliviously retrieve
this data such that the corrupted server knows nothing about
what is the specified keyword and which data is retrieved.
We however demonstrate that the OS protocols may not work
efficiently in certain applications. For mining data sets in a
Pet Identification scenario, a retriever may be interested in the
owner information of a pet rather than other information. As
a result, many words in a stored document ๐‘‘ can be ignored
during the course of PIR (notice that we do not claim that
c
978-1-4673-0279-1/11/ $26.00 โƒ2011
IEEE
Huafei Zhu
CAS, I2 R
A*STAR, Singapore
the general PIR does not work for mining data sets rather
we emphasize that the general PIR technique may not work
efficiently. this argument applies to the results presented in [3],
[4] as well), one can expect more efficient solutions to these
problems rather than the general methods presented in [12],
[13], [3], [4]. Since no known results deals with the motivation
problem above, we formalize an interesting research problem
below
on input a private query criteria ๐‘„๐พ and a public
set ๐‘‘๐ถ stored even in a possibly corrupted server
๐ถ, how to implement an efficient query processing
protocol such that it outputs a subsect of documents
๐‘‘ โІ ๐‘‘๐ถ satisfying ๐‘„๐พ (๐‘‘) =1 while ๐ถ knows nothing
about what is the specified criteria ๐‘„๐พ and which ๐‘‘
is retrieved from ๐‘‘๐ถ ?
A. This work
This paper intends to provide an efficient implementation
of private processing protocols in the context of inverted
index programs. To help the reader understand the idea of
our implementation, we would like to first sketch the basic
notions of MapReduce and Inverted index and then provide a
high level description of our implementation.
MapReduce: MapReduce introduced by Dean and Ghemawat [6], [7], [8] is a programming model automatically
parallelized and executed on a large cluster of the commodity machines. A MapReduce program supporting distributed
computing on large data sets, consists of two functions: a
map function and a reduce function. A map function Map
transforms a piece of data into (key, value) pairs whereas
a reduce function Reduce merges the emitted values of the
same key into a single result (key: value1 . . . value๐‘› ).
The MapReduce program is general and many interesting programs can be easily expressed as MapReduce computations:
distributed grep, count URL access frequency, reverse weblink graph, term-vector per host, distributed sort and inverted
index [6], [7], [8]. The applications of MapReduce in the
Cloud Computing scenarios are discussed in [1], [10], [11].
Inverted index: An inverted index program is an instance of MapReduce that stores, for each keyword occurring
somewhere in the collection, information about the locations
where it occurs. The map function in the inverted index
problem parses each document, and emits a sequence of
<documentID, word> pairs. The reduce function accepts all
pairs of a given word, sorts the corresponding document IDs
and emits a <word, list(documentID)> pair. The set of all
output pairs forms an inverted index. Numerous applications
of the inverted index programs have been introduced so far.
We refer to the reader [5], [17], [18], [15], [16] for further
reference.
A high-level description of our implementation:
Let
๐‘Š ={0, 1}โˆ— be a universe of words and ๐ท โІ ๐‘Š be a
dictionary such that โˆฃ๐ทโˆฃ = ๐›ผ < โˆž. Let ๐พ = (๐‘˜1 , . . . , ๐‘˜๐›พ )
ห† =
be a set of keywords selected by a query processor while ๐พ
ห†
ห†
(๐‘˜1 , . . . , ๐‘˜๐œ† ) be a set of keyword selected by the inverted index
ห† โІ ๐ท. Let ๐‘‘๐ถ = (๐‘‘1 , . . . , ๐‘‘๐‘™ )
program. We assume that ๐พ โІ ๐พ
be a document set stored in a server ๐ถ. Our implementation
is sketched below
1) Inverted index generation procedure on input ๐‘‘๐ถ ,
the query processor invokes the MapReduce function to
perform the following computations
โˆ™ invoking
a map function Map with input
ห† = (ห†
(๐‘‘1 , . . . , ๐‘‘๐‘™ ) and ๐พ
๐‘˜1 , . . . , ห†
๐‘˜๐œ† ) to generate
a matrix ๐‘€ of form (document, keyword) below
โŽž
โŽ›
๐‘˜1,1 , . . . , ห†
๐‘˜1,๐›ผ1
๐‘‘1 : ห†
โŽœ ๐‘‘2 : ห†
๐‘˜2,1 , . . . , ห†
๐‘˜2,๐›ผ2 โŽŸ
โŽŸ
โŽœ
โŽ ... ..., ..., ... โŽ 
๐‘‘๐‘™ : ห†
๐‘˜๐‘™,1 , . . . , ห†
๐‘˜๐‘™,๐›ผ
๐‘™
โˆ™
invoking a reduce function Reduce with input matrix
ห†
๐‘€ to generate inverted index ๐‘€
โŽ› ห†
โŽž
๐‘˜1 : ๐‘‘1,1 , . . . , ๐‘‘1,๐›ฝ1
โŽœ ๐‘˜ห†2 : ๐‘‘2,1 , . . . , ๐‘‘2,๐›ฝ โŽŸ
2 โŽŸ
โŽœ
โŽ ... ..., ...,
โŽ 
ห†
๐‘˜๐œ† : ๐‘‘๐‘›,1 , . . . , ๐‘‘๐œ†,๐›ฝ๐œ†
ห† , the query
2) Document filtering procedure on input ๐‘€
processor invokes additively homomorphic encryption
scheme (say the Paillierโ€™s encryption scheme) to filter all
documents containing keywords in ๐พ by the following
processing
ห† by invoking Pailโˆ™ the query processor encodes ๐พ
lierโ€™s homomorphic encryption scheme ๐ธ๐‘๐‘˜ () to
ห† Let
generate ciphertexts {๐‘ค
ห†๐‘– }๐œ†๐‘–=1 ) of the set ๐พ.
ห† = ๐ธ๐‘๐‘˜ (0, ๐‘Ÿ๐‘˜ ) if
๐‘ค
ห† = ๐ธ๐‘๐‘˜ (1, ๐‘Ÿ๐‘˜ ) if ๐‘˜ โˆˆ ๐พ and ๐‘ค
ห† โˆ– ๐พ.
๐‘˜โˆˆ๐พ
ห† = {๐‘ค
โˆ™ Let ๐‘Š
ห†๐‘– }๐œ†๐‘–=1 . Let ๐‘‘ห†๐‘– =(๐‘‘๐‘–,1 , . . . , ๐‘‘๐‘–,๐›ฝ๐‘– ) for ๐‘– =
1, . . . , ๐œ†. The query processor generates ๐‘ โ† ๐‘ค
ห†๐‘‘ห†๐‘– ,
๐‘–
๐‘–
and then randomly distributes ๐‘š copies of ๐‘๐‘– into
๐‘›-bin of the (๐‘š, ๐‘›)-Bloom Filter.
Note that we can adjust the system parameter ๐‘  โ‰ฅ
1 in the Paillierโ€™s encryption to guarantee that the
message space is sufficiently large for encrypting
each plaintext ๐‘‘ห†๐‘– . See Section 3 for more details.
3) Basis retrieving procedure the query processor retrieves the matched document ๐‘‘ห†๐พ from ๐‘›-bin of the
ห†,
๐‘˜๐‘– , ๐‘‘ห†๐‘– ) โˆˆ ๐‘€
(๐‘š, ๐‘›)-Bloom Filter, where ๐‘‘ห†๐พ = {๐‘‘ห†๐‘– โˆฃ (ห†
ห†
๐‘˜๐‘– โˆˆ ๐พ}; By the correctness of the protocol in Section 4, we know that the basis are retrievable in the
(๐‘š, ๐‘›)-Bloom Filter with storage with the overwhelming
probability.
4) Off-line document processing procedure Given ๐‘‘ห†๐พ ,
the query processor computes ๐‘‘ from the specified
Boolean logic expression ๐‘„๐พ (๐‘‘ห†1 , . . . , ๐‘‘ห†๐›พ ) by substituting ๐‘˜๐‘– in ๐‘„๐พ with the corresponding document set ๐‘‘ห†๐‘– .
This ends a brief description of our protocol. Clearly, the
query processor in our model only needs to generate a set
ห†
โˆฃ๐พโˆฃ
of ciphertexts {๐‘ค
ห†๐‘– }๐‘–=1 of the common reference keyword set
ห† projected on the specified private keyword set ๐พ. The
๐พ
retrieved basis ๐‘‘ห†๐พ for the Boolean criteria ๐‘„๐พ is sufficient for
the query processor to compute ๐‘‘ =๐‘„๐พ (๐‘‘ห†๐พ ). Consequently,
the computation complexity of query protocol is significantly
reduced (i.e., the size of ciphertexts generated during the
ห† rather than ๐‘‘ as
private query processing is reduced to โˆฃ๐พโˆฃ
ห† โ‰ช โˆฃ๐‘‘โˆฃ.
that presented the OS protocols) in case that โˆฃ๐พโˆฃ
B. The novelty of our query protocol
One can see that the mapping-reducing-filtering model is
ห†
general in the sense that if the MapReduce keyword set ๐พ
is whole dictionary ๐ท and the map function Map and the
reduce function Reduce are both dummy, then the mappingreducing-filtering model is reduced to the MapReduce-free,
filtering-only model [12], [13]; if only the reduce function
is dummy in the mapping-reducing-filtering model, then the
reduced model is equivalent to the mapping-filtering model. As
a result, our framework can be applied to the other scenarios
such as private searching on streams, data processing on large
clusters and compressing term positions in web indexes as
well (see [5], [17], [18], [15], [16] for detail); The mappingreducing-filtering model also benefits an inverted index to
select a common reference word set independent with the
selection of an private keyword set and the dictionary ๐ท. As a
result, such a model allows us to avoid encrypting every word
in the dictionary as the protocols presented [12], [13] and thus
a private query processing protocol defined in the mappingreducing-filtering model is much more efficient and flexible
than that defined in the MapReduce-free, filtering-only model
as well as that defined in the mapping-filtering model.
C. The result
We remark that the private query processing protocol
described above guarantees that non-match documents are
not collected while the matched documents are collected
with overwhelming probability. We show that the proposed
query processing protocol proposed in this paper is privacypreserving assuming that the underlying additively homomorphic public-key encryption is semantically secure. To the best
of our knowledge, this is the first private query processing
protocol defined over the inverted index program in the
Mapping-Reducing-Filtering framework. We therefore expect
more applications to be deployed within this framework.
RoadMap The rest of this paper is organized as follows:
syntax and security definition of query processing protocols
are introduced and formalized in Section 2. In Section 3, building blocks โˆ’ additively homomorphic public-key encryption
scheme (say, the Paillierโ€™s encryption scheme) and (๐‘š, ๐‘›)Bloom Filters with Storages are sketched; An implementation
of private query processing protocol is described and analyzed
in Section 4. We conclude this work in Section 5.
II. P RIVACY- PRESERVING QUERY PROCESSING ON
INVERTED INDEX
A. Syntax
Let ๐‘Š = {0, 1}โˆ— be a universe of words and ๐ท โІ ๐‘Š be a
dictionary. Let ๐พ= (๐‘˜1 , . . . , ๐‘˜๐›พ ) be a set of keywords selected
by a query processor ๐‘ƒ . ๐‘ƒ keeps ๐พ secret (hence ๐พ is called a
ห† = (ห†
๐‘˜๐œ† ) be a set of keyword
private keyword set). Let ๐พ
๐‘˜1 , . . . , ห†
ห† is publicly
selected by the inverted index. The keyword set ๐พ
ห† is called a common reference keyword set).
known (hence ๐พ
ห† โІ ๐ท.
We assume that ๐พ โІ ๐พ
Let ๐’ฌ be a class of query type. A query type ๐’ฌ could be a class
of logical expressions in โˆง, โˆจ and ¬. Let ๐‘‘๐ถ = (๐‘‘1 , . . . , ๐‘‘๐‘™ )
be ๐‘™ documents stored in a server ๐ถ. Given a set of keywords
๐พ โŠ‚ ๐ท and a query ๐‘„ โˆˆ ๐’ฌ, we define ๐‘„๐พ : ๐‘‘ โ†’ {0, 1} that
takes a subset document ๐‘‘ as input and returns 1, if and only
if ๐‘‘ matches the criteria.
Definition 1: (Syntax of query processing) For a query ๐‘„๐พ
over a set of keywords ๐พ, and for a subset ๐‘‘ โІ ๐‘‘๐ถ , we say ๐‘‘
matches query ๐‘„๐พ if and only if ๐‘„๐พ (๐‘‘) =1.
To compute ๐‘‘ from a given set ๐‘‘๐พ ={๐‘‘1 , . . . , ๐‘‘๐›พ }, the query
processor, first of all, maps the private query criteria ๐‘„๐พ to
๐‘‘๐พ by the following procedure:
โˆ™
โˆ™
โˆ™
๐‘˜๐‘– โˆง ๐‘˜๐‘— if and only if ๐‘‘๐‘– โˆง ๐‘‘๐‘— ;
๐‘˜๐‘– โˆจ ๐‘˜๐‘— if and only if ๐‘‘๐‘– โˆจ ๐‘‘๐‘— ;
¬๐‘˜๐‘– if and only ¬๐‘‘๐‘– . Note that ¬๐‘˜๐‘– = ๐พ โˆ– {๐‘˜๐‘– }, it follows
that ¬๐‘‘๐‘– = ๐‘‘๐พ โˆ– {๐‘‘๐‘– }.
The query processor then substitutes ๐‘˜๐‘– in ๐‘„๐พ with the
corresponding document set ๐‘‘ห†๐‘– . Finally, the query processor
obtains ๐‘‘ from the Boolean logic expression ๐‘„๐พ (๐‘‘ห†1 , . . . , ๐‘‘ห†๐›พ ).
B. The correctness
C. The security
Since the off-line document processing procedure is performed by query processor itself, it follows that the definition
of security of a private query processing protocol should be
isolated from the off-line processing procedure. We therefore
consider the following game between an adversary ๐’œ and a
challenger ๐’ฎ.
โˆ™ ๐’ฎ first invokes a key generation algorithm of an additively
homomorphic encryption scheme to obtain (๐‘๐‘˜, ๐‘ ๐‘˜), and
then sends ๐‘๐‘˜ to ๐’œ;
โˆ™ ๐’œ chooses two queries for two sets of keywords
ห† and sends (๐‘„๐พ , ๐‘„๐พ ) to ๐’ฎ;
๐พ0 , ๐พ1 โŠ‚ ๐พ,
0
1
โˆ™ ๐’ฎ chooses a bit ๐‘ and invokes the encryption scheme
ห†๐‘ ={๐‘ค
ห†๐‘ , where ๐‘Š
ห†๐‘,1 , . . . , ๐‘ค
ห†๐‘,๐œ† } and ๐‘ค
ห†๐‘,๐‘—
to generate ๐‘Š
= ๐ธ๐‘๐‘˜ (1, ๐‘Ÿ๐‘,๐‘— ) if ห†
๐‘˜๐‘,๐‘— โˆˆ ๐พ๐‘ and ๐‘ค
ห†๐‘,๐‘— = ๐ธ๐‘๐‘˜ (0, ๐‘Ÿ๐‘,๐‘— ) if
ห†
ห†๐‘ โˆ– ๐พ๐‘ , ๐‘— = 1, . . . , ๐œ†;
๐‘˜๐‘,๐‘— โˆˆ ๐พ
ห†๐‘ , and
โˆ™ ๐’ฎ creates an instance of filtering algorithm with ๐‘Š
sends the state ๐ต of the underlying Blooming filter to ๐’œ;
โˆ™ ๐’œ can experiment with the code of ๐ต and finally outputs
๐‘โ€ฒ โˆˆ {0, 1}
The adversary ๐’œ wins the game if ๐‘โ€ฒ =๐‘ and loses otherwise.
We define the adversaryโ€™s advantage in this game to be
Adv๐’œ (๐‘˜) = โˆฃPr(๐‘โ€ฒ = ๐‘) โˆ’ 1/2โˆฃ
Definition 3: (Semantic security of query processing protocols) A query processing protocol is semantically secure if
for any adversary ๐’œ described in the above game, we have
that Adv๐’œ (๐‘˜) is a negligible function, where the probability
is taken over coin-tosses of the challenger and the adversary.
III. B UILDING BLOCKS
This section sketches building blocks that will be used to
construct private query processing protocols on an inverted
index program: (๐‘š, ๐‘›)-Bloom Filters and additively homomorphic public-key schemes.
A. (๐‘š, ๐‘›)-Bloom filter with Storage
A (๐‘š, ๐‘›)-Bloom Filter consists of an array of ๐‘›-bits
๐ต[1], . . . , ๐ต[๐‘›], initially set to 0 using ๐‘š independent random
hash functions โ„Ž1 , . . . , โ„Ž๐‘š with range [1, . . . , ๐‘›]. This work
will use a variation of a Bloom Filter, called (๐‘š, ๐‘›)-Bloom
Filter with Storage first introduced and formalized in [2]
The correctness of a query processing protocol means
that we must save matched documents with overwhelming
probability and saves non-matched documents with negligible probability. That is, the buffer decryption algorithm can
distinguish collisions in the buffer from the valid documents.
Definition 4: A (๐‘š, ๐‘›)-Bloom Filter with Storage is a collection {โ„Ž๐‘– }๐‘š
๐‘–=1 of functions together with a collection of sets
{๐ต๐‘— }๐‘›๐‘—=1 , where โ„Ž๐‘– : {0, 1}โˆ— โ†’ [1, . . . , ๐‘›]. To insert a pair
(๐‘ข, ๐‘ฃ) into this structure, ๐‘ฃ is added to ๐ตโ„Ž๐‘– (๐‘ข) for all ๐‘– โˆˆ [๐‘š],
where [๐‘š]=[1, . . . , ๐‘š]. To determine whether or not ๐‘ฃ is stored
in a set ๐‘ˆ , one examines all of the sets ๐‘ฃ โˆˆ ๐ตโ„Ž๐‘– (๐‘ข) and returns
true if all checks are valid.
Definition 2: (Correctness of query processing protocol)
Let ๐œˆ(๐‘˜) be a negligible function and ๐‘˜ be a security parameter.
Let ๐‘‘๐ถ be available documents stored at a server ๐’ž. Let ๐ต โˆ—
be a subset of the matching documents. We say that a query
processing protocol is correct if
As usual, we model โ„Ž๐‘– as uniform, independent randomness.
For each ๐‘ข โˆˆ ๐‘ˆ , we define ๐ป๐‘ข = {โ„Ž๐‘– (๐‘ข)โˆฃ๐‘– โˆˆ [๐‘š]}. The
correctness of our construction relies on the following lemma
due to Boneh et al [2].
โˆ—
Pr[๐ต = {๐‘‘ โˆˆ ๐‘‘๐ถ โˆฃ ๐‘„๐พ (๐‘‘) = 1}] > 1 โˆ’ ๐œˆ(๐‘˜)
๐‘›
Lemma 1: Let ({โ„Ž๐‘– }๐‘š
๐‘–=1 , {๐ต๐‘— }๐‘—=1 ) be a (๐‘š, ๐‘›)-Bloom Filter with Storage. Suppose the filter has been initialized to store
some set ๐‘ˆ of size โˆฃ๐‘ˆ โˆฃ and associated values. Suppose also
that ๐‘› = โŒˆ๐‘๐‘šโˆฃ๐‘ˆ โˆฃโŒ‰, where ๐‘ > 1 is a constant. Denote the relationship of element-value associates by ๐‘…(โ‹…, โ‹…). Then for any
๐‘ข โˆˆ ๐‘ˆ , the following statements hold true with probability 1neg(๐‘˜), where the probability is over the uniform randomness
used to model the โ„Ž๐‘– and neg(๐‘˜) is a negligible function
1) ๐‘ข โˆˆ ๐‘ˆ if and only if ( ๐ตโ„Ž๐‘– (๐‘ข) โˆ•= โˆ…, โˆ€ ๐‘– โˆˆ [๐‘š] );
2) โˆฉ๐‘–โˆˆ[๐‘š] ๐ตโ„Ž๐‘– (๐‘ข) = {๐‘ฃโˆฃ๐‘…(๐‘ข, ๐‘ฃ) = 1}
B. Additively homomorphic encryption scheme
Paillier investigated a novel computational problem called
the composite residuosity class problem (CRS), and its applications to public key cryptography in [14]. The decisional
composite residuosity class problem states the following thing:
โˆ—
given ๐‘ง โˆˆ๐‘Ÿ ๐‘๐‘
2 deciding whether ๐‘ง is ๐‘ -th residue or non ๐‘ th residue. The decisional composite residuosity class assumption means that there exists no polynomial time distinguisher
for ๐‘ -th residues modulo ๐‘ 2 .
Paillierโ€™s encryption scheme: The public key is a 2๐‘˜-bit RSA
modulus ๐‘ =๐‘๐‘ž, where ๐‘, ๐‘ž are two large safe primes with
length ๐‘˜ and the secret key is (๐‘, ๐‘ž). The plain-text space is
โˆ—
๐‘๐‘ and the cipher-text space is ๐‘๐‘
2 . To encrypt a message
โˆ—
uniformly at random and
๐‘š โˆˆ ๐‘๐‘ , one chooses ๐‘Ÿ โˆˆ ๐‘๐‘
computes the cipher-text as ๐ธ๐‘ƒ ๐พ (๐‘š, ๐‘Ÿ) = ๐‘” ๐‘š ๐‘Ÿ๐‘ mod ๐‘ 2 ,
โˆ—
where ๐‘” = (1 + ๐‘ ) has order ๐‘ in ๐‘๐‘
2 . The private key
is (๐‘, ๐‘ž). To decrypt a ciphertetxt ๐‘ =(1 + ๐‘ )๐‘š ๐‘Ÿ๐‘ mod ๐‘ 2
with the help of the trapdoor information (๐‘, ๐‘ž), one first
computes ๐‘1 =๐‘ mod ๐‘ , and then computes ๐‘Ÿ from the equa๐‘ โˆ’1 mod๐œ™(๐‘ )
tion ๐‘Ÿ=๐‘1
mod ๐‘ ; Finally, one can compute ๐‘š
from the equation ๐‘๐‘Ÿโˆ’๐‘ mod ๐‘ 2 =1 + ๐‘š๐‘ . The Paillierโ€™s
public-key cryptosystem is homomorphic, i.e., ๐ธ๐‘ƒ ๐พ (๐‘š1 , ๐‘Ÿ1 )
× ๐ธ๐‘ƒ ๐พ (๐‘š2 , ๐‘Ÿ2 ) mod ๐‘ 2 = ๐ธ๐‘ƒ ๐พ (๐‘š1 + ๐‘š2 mod ๐‘ , ๐‘Ÿ1 ×
๐‘Ÿ2 mod ๐‘ ) and it is semantically secure if the decisional
composite residuosity class problem is hard. We refer to the
reader [14] for more details.
The DamgaฬŠrd and Jurik [9] public-key encryption scheme, a
length-flexible Paillierโ€™s encryption schem, will be used when
the size of ๐‘‘ห†๐‘— โˆˆ ๐‘‘ห† is large (e.g., โˆฃ๐‘‘ห†๐‘— โˆฃ > ๐‘ )
โˆ™ The public key is a 2๐‘˜-bit RSA modulus ๐‘ = ๐‘ƒ ๐‘„, where
๐‘ƒ , ๐‘„ are two large safe primes. The plain-text space is
โˆ—
๐‘๐‘ ๐‘  and the cipher-text space is ๐‘๐‘
๐‘ +1 . The private key
is (๐‘ƒ, ๐‘„) and the public key is (๐‘, ๐‘ ), where ๐‘  โ‰ฅ 1.
โˆ—
โˆ™ To encrypt ๐‘š โˆˆ ๐‘๐‘ ๐‘  , one chooses ๐‘Ÿ โˆˆ ๐‘๐‘ uniformly at
random and computes the cipher-text ๐‘ as ๐ธ๐‘ƒ ๐พ (๐‘š, ๐‘Ÿ) =
๐‘ 
(1 + ๐‘ )๐‘š ๐‘Ÿ๐‘ mod ๐‘ ๐‘ +1 .
๐‘š ๐‘๐‘ 
โˆ™ To decrypt a ciphertext ๐‘ =(1 + ๐‘ ) ๐‘Ÿ
mod ๐‘ ๐‘ +1 , the
โ€ฒ
decryption algorithm ๐’Ÿ๐‘ ๐‘˜ first computes ๐‘ =๐‘ mod ๐‘ and
โˆ—
; Once
then using the secret key ๐œ™(๐‘ ) to calculate ๐‘Ÿ โˆˆ ๐‘๐‘
given ๐‘Ÿ, ๐’Ÿ๐‘ ๐‘˜ outputs the message ๐‘š โˆˆ ๐‘๐‘ ๐‘  accordingly.
The plaintext of DamgaฬŠrd and Jurikโ€™s public key encryption
scheme is flexible and thus enables us to encrypt any length of
the documents, say ๐‘ < โˆฃ๐‘‘ห†๐‘— โˆฃ < ๐‘ ๐‘  . We will not distinguish
the Paillierโ€™s encryption from the DamgaฬŠrd and Jurik lengthflexible encryption scheme. Both schemes are denoted by
(๐ธ๐‘๐‘˜ (), ๐ท๐‘ ๐‘˜ ()) uniformly throughout the paper.
IV. P RIVATE QUERY PROCESSING PROTOCOL
In this section, an implementation of private query processing protocol in the mapping-reducing-filtering framework
is introduced and formalized. We show that proposed query
protocol is semantically secure if the underlying homomorphic
public-key encryption scheme is semantically secure.
A. The description
1) The input and output: An input of a query processor
๐‘ƒ is a private keyword set ๐พ =(๐‘˜1 , . . . , ๐‘˜๐›พ ); An input of a
possibly corrupted server ๐‘† is a document set ๐‘‘๐ถ =(๐‘‘1 , . . . , ๐‘‘๐‘™ );
An output of ๐‘ƒ is a document set ๐‘‘๐พ โŠ‚ ๐‘‘๐ถ , where ๐‘‘๐พ
ห†, ห†
={๐‘‘ห†๐‘– โˆฃ(ห†
๐‘˜๐‘– , ๐‘‘ห†๐‘– ) โˆˆ ๐‘€
๐‘˜๐‘– โˆˆ ๐พ}. An output of ๐‘† is โŠฅ.
2) Common reference keyword set: Common-referencestring of query processing protocol: a common reference
ห† =(๐‘˜1 , . . . , ๐‘˜๐œ† ) and a public description of an
keyword set ๐พ
ห† โŠ‚ ๐ท.
inverted index program. We assume that ๐พ โŠ‚ ๐พ
3) An initialization algorithm: The initial algorithm โ„
comprises two PPT algorithms: additively homomorphic encryption generation algorithm โ„1 (an instance of the Paillierโ€™s
encryption scheme throughout the paper) and a keyword hiding
algorithm โ„2 . The details of algorithms are described below
๐‘˜
โˆ™ on input a security parameter 1 , ๐‘ƒ invokes โ„1 to generate
two large safe prime numbers ๐‘ and ๐‘ž such that โˆฃ๐‘โˆฃ = โˆฃ๐‘žโˆฃ
=๐‘˜. Let ๐‘ = ๐‘๐‘ž, ๐‘๐‘˜ =(๐‘, ๐‘ ) (๐‘  โ‰ฅ 1) and ๐‘ ๐‘˜ =(๐‘, ๐‘ž). Let
๐ธ๐‘๐‘˜ () be Paillierโ€™s encryption scheme defined over ๐‘๐‘˜
and ๐ท๐‘ ๐‘˜ () be the corresponding decryption algorithm.
ห† and ๐พ, ๐‘ƒ invokes โ„2 to generate a ciphertext
โˆ™ on input ๐พ
ห†
ห† projected on ๐พ, where ๐‘Š
ห† ={๐‘ค
ห†๐œ† }
set ๐‘Š of ๐พ
ห†1 , . . . , ๐‘ค
ห†
and ๐‘ค
ห†๐‘— = ๐ธ๐‘๐‘˜ (1, ๐‘Ÿ๐‘ค ) if ๐‘˜๐‘— โˆˆ ๐พ and ๐‘ค
ห†๐‘— = ๐ธ๐‘๐‘˜ (0, ๐‘Ÿ๐‘ค ) if
ห†
ห† โˆ– ๐พ, ๐‘— = 1, . . . , ๐œ†;
๐‘˜๐‘— โˆˆ ๐พ
4) An inverted index program: On input ๐‘‘๐ถ , ๐‘ƒ invokes the
ห† below
inverted index program to generate an inverted index ๐‘€
โŽž
โŽ› ห†
๐‘˜1 : ๐‘‘1,1 , . . . , ๐‘‘1,๐›ฝ1
โŽœ ๐‘˜ห†2 : ๐‘‘2,1 , . . . , ๐‘‘2,๐›ฝ โŽŸ
2 โŽŸ
โŽœ
โŽ 
โŽ ... ..., ...,
ห†
๐‘˜๐œ† : ๐‘‘๐œ†,1 , . . . , ๐‘‘๐œ†,๐›ฝ๐œ†
5) A filtering algorithm: A filtering algorithm โ„ฑ comprises
the following three algorithms: a collection algorithm โ„ฑ0 and a
buffer encoding algorithm โ„ฑ1 and a buffer decoding algorithm
โ„ฑ2 . On input a query ๐‘„๐พ the filtering algorithm performs the
following computations
โˆ™ For ๐‘— = 1, . . . , ๐œ†, ๐‘ƒ invokes the collection algorithm โ„ฑ0
ห†
๐‘‘
to construct a temporary collection ๐‘(๐‘—) = ๐‘ค
ห†๐‘— ๐‘— , where ๐‘‘ห†๐‘—
= (๐‘‘๐‘—,1 , . . . , ๐‘‘๐‘—,๐›ฝ๐‘— ) for ๐‘— = 1, . . . , ๐œ†;
โˆ™ for ๐‘— = 1, . . . , ๐œ†, let ๐‘ข(๐‘—) โ† ห†
๐‘˜๐‘— and ๐‘ฃ(๐‘—) โ† ๐‘(๐‘—); Given
(๐‘ข(๐‘—), ๐‘ฃ(๐‘—)), ๐‘ƒ invokes the encoding algorithm โ„ฑ1 to
throw ๐‘š copies of ๐‘ฃ(๐‘—) to ๐‘› bins of the (๐‘š, ๐‘›)-Bloom
Filter with locations {โ„Ž๐‘– (๐‘ข)}๐‘š
๐‘–=1 . Let ๐ต be the current
state of the (๐‘š, ๐‘›)-Bloom Filter.
โˆ™ Given ๐ต, ๐‘ƒ invokes โ„ฑ2 to compute the locations โ„Ž๐‘– (๐‘ข(๐‘—))
(1 โ‰ค ๐‘– โ‰ค ๐‘™, 1 โ‰ค ๐‘— โ‰ค ๐‘™๐‘— ) and then checks each specified
location is stored by some data; if some of the specified
location is empty, โ„ฑ2 outputs 0 indicating the failure of
the buffer storage; In case that the output is 1, โ„ฑ2 checks
?
that ๐‘ข(๐‘—) = ห†
๐‘˜๐‘— ; If the check is valid, โ„ฑ2 decrypts ๐‘(๐‘—) to
ห†
obtain ๐‘‘๐‘— .
ห†๐พ , the query processor computes ๐‘‘ from the
โˆ™ Given ๐‘‘
Boolean logic expression ๐‘„๐พ (๐‘‘ห†1 , . . . , ๐‘‘ห†๐›พ ).
This ends the description of query processing protocol.
B. The correctness
Before providing the security of the scheme, we show the
correctness of the private query processing protocol. Let ๐œ†
be the number of documents (๐‘‘ห†1 , . . . , ๐‘‘ห†๐œ† ) generated by the
inverted index program. Each document has ๐‘š copies that are
thrown into the ๐‘š-out-of-๐‘› bins randomly specified by the
hash functions {โ„Ž๐‘– }๐‘š
๐‘–=1 . Thus we have total ๐‘š๐œ† documents
thrown ๐‘› bins. Borrowing the notation from [12], [13], we
call a document ๐‘‘ห†๐‘— a color ๐ถ๐‘— (๐‘— = 1, . . . , ๐œ†) and call a copy
of the color ๐ถ๐‘— a ball ๐ต(๐‘—, ๐‘˜), where ๐‘˜ = 1, . . . , ๐‘š. Thus,
we have total ๐‘š๐œ† balls that are thrown into ๐‘› bins. We say
a color ๐ถ๐‘— survives if at least one ball of color ๐ถ๐‘— survives.
We say that the color-survival game succeeds of all ๐œ† colors
survives, otherwise, we say that it fails.
Let ๐ธ be an event that a single specified ball survives this
๐‘š๐œ†โˆ’1
process. Then Pr[๐ธ] =( ๐‘›โˆ’1
> โˆš1๐‘’ assuming that ๐‘› โ‰ฅ
๐‘› )
2๐‘š๐œ†. Let ๐ธ๐‘— be an event that the ๐‘—-th ball of a certain color
does not survive. Then the โˆฉ
probability that all ๐‘š balls of this
๐‘š
color does not survive is Pr[ ๐‘—=1 ๐ธ๐‘— ] โ‰ค (1โˆ’ โˆš1๐‘’ )๐‘š < (1/2)๐‘š .
โˆ—
Let ๐ธ be an event that the at least one of the color does not
survive and ๐ธ๐‘—โˆ— be an event that the color ๐ถ๐‘— does not survive.
โˆช๐œ†
โˆ‘๐‘š
Then Pr[๐ธ โˆ— ] โ‰ค Pr[ ๐‘—=1 ๐ธ๐‘—โˆ— ] โ‰ค ๐‘—=1 Pr[๐ธ๐‘—โˆ— ] โ‰ค 2๐œ†๐‘š , which is
clearly negligible in ๐‘š. This means that with overwhelming
probability that all colors survive and hence all ๐œ† documents
(๐‘‘ห†1 , . . . , ๐‘‘ห†๐œ† ) are retrievable in our Bloom-Filter with storage
with the overwhelming probability.
C. The proof of security
Theorem 1: The query processing protocol described above
is semantically secure assuming that the underlying Paillierโ€™s
encryption is semantically secure.
Proof Suppose there exists an adversary ๐’œ that can gain
a non-negligible advantage ๐œ– in our semantic security game
from the definition 4. We will show that ๐’œ can be used to gain
an advantage in breaking semantic security of the underlying
public-key encryption scheme.
A challenger ๐’ฎ is first given an encryption ๐‘ of a message
๐‘š๐‘ โˆˆ {0, 1} chosen uniformly at random, i.e., ๐‘ = ๐ธ๐‘๐‘˜ (๐‘š๐‘ )
(note that the challenger ๐’ฎ is also given the public key ๐‘๐‘˜ but
not the secret key ๐‘ ๐‘˜ of the underlying Paillierโ€™s encryption
scheme).
ห†
The challenger ๐’ฎ is also given two set of keywords ๐พ0 โŠ‚ ๐พ
ห† such that ๐พ0 โˆ•= ๐พ1 . The challenger ๐’ฎ now
and ๐พ1 โŠ‚ ๐พ
ห† of the given reference keyword set ๐พ
ห†
generates a ciphertext ๐‘Š
by the following procedure: re-randomized encryption ๐ธ๐‘๐‘˜ (0)
ห† โˆ– ๐พ๐‘ and ๐ธ๐‘๐‘˜ (0)๐‘ if ๐‘ค โˆˆ ๐พ๐‘
if ๐‘ค โˆˆ ๐พ
if ๐‘š๐‘ =1, then the construction of MapReduce Filter is
exactly same as that real protocol described above, hence
in this case with probability 1/2 + ๐œ– the adversary returns
๐‘โ€ฒ such that ๐‘โ€ฒ = ๐‘.
โˆ™ if ๐‘š๐‘ =0, then the simulated MapReduce Filter searches
nothing, hence in this case with probability 1/2 the
adversary returns ๐‘โ€ฒ such that ๐‘โ€ฒ =๐‘.
The ๐’ฎ now outputs what the adversary outputs. As a result,
the challenger ๐’ฎ obtains the non-negligible advantage 1/2 +
๐œ–/2 to break the semantic security of the Paillerโ€™s encryption.
โˆ™
V. C ONCLUSION
We have implemented a private query processing protocol on an inverted index program and have shown that the
proposed protocol is semantically secure if the underlying
homomorphic public-key encryption scheme is semantically
secure. The formalization of the private query processing
protocol is general and we expect more applications can be
deployed within the proposed framework.
R EFERENCES
[1] M.Armbrust, A.Fox, R.Griffith, A.D.Joseph, R.H.Katz, A.Konwinski,
G.Lee, D.A.Patterson, A.Rabkin, I.Stoica, M.Zaharia: Above the Clouds:
A Berkeley View of Cloud Computing, Technical Report No. UCB/EECS2009-28
[2] D.Boneh, E.Kushilevitz, R.Ostrovsky and W.E.Skeith III: Public Key
Encryption That Allows PIR Queries. CRYPTO 2007: 50-67
[3] J.Bethencourt, D.X.Song and B.Waters: New Constructions and Practical
Applications for Private Stream Searching (Extended Abstract). IEEE
Symposium on Security and Privacy 2006: 132-139
[4] John Bethencourt, Dawn Xiaodong Song, Brent Waters: New Techniques
for Private Stream Searching. ACM Trans. Inf. Syst. Secur. 12(3): (2009)
[5] S.Ding, J.Attenberg and T.Suel: Scalable techniques for document identifier assignment in inverted indexes. WWW 2010: 311-320
[6] J.Dean and S.Ghemawat: MapReduce: Simplified Data Processing on
Large Clusters. OSDI 2004: 137-150
[7] J.Dean and S.Ghemawat: MapReduce: simplified data processing on large
clusters. Commun. ACM 51(1): 107-113 (2008)
[8] J.Dean and S.Ghemawat: MapReduce: a flexible data processing tool.
Commun. CACM 53(1): 72-77 (2010)
[9] I.DamgaฬŠrd and M.Jurik: A Generalisation, a Simplification and Some
Applications of Paillierโ€™s Probabilistic Public-Key System. Public Key
Cryptography 2001: 119-136.
[10] R.L.Grossman. The Case for Cloud Computing. IT Professional 11(2):
23-27 (2009)
[11] Robert L. Grossman, Yunhong Gu, Michal Sabala, Wanzhi Zhang:
Compute and storage clouds using wide area high performance networks.
Future Generation Comp. Syst. 25(2): 179-183 (2009)
[12] R.Ostrovsky and W.E.Skeith III: Private Searching on Streaming Data.
CRYPTO 2005: 223-240
[13] R.Ostrovsky and W.E.Skeith III: Private Searching on Streaming Data.
J. Cryptology 20(4): 397-430 (2007)
[14] P.Paillier: Public-Key Cryptosystems Based on Composite Degree Residuosity Classes. EUROCRYPT 1999: 223-238.
[15] S.Petrovic and P.Brown: Large Scale Analysis of the eDonkey P2P File
Sharing System. INFOCOM 2009: 2746-2750
[16] H.Wan, C.Tan and Q.Li: Snoogle: A Search Engine for the Physical
World. INFOCOM 2008: 1382-1390
[17] H.Yan, S.Ding and T.Suel: Compressing term positions in web indexes.
SIGIR 2009: 147-154
[18] H.Yan, S.Ding and T.Suel: Inverted index compression and query
processing with optimized document ordering. WWW 2009: 401-410