Private Query Processing on Inverted Index
Wee Keong Ng
Yonggang Wen
School of Computer Engineering
NTU, Singapore
School of Computer Engineering
NTU, Singapore
AbstractโA private query criteria ๐๐พ is a Boolean logic
expression in โง, โจ and ¬ of an input set ๐พ. A private query
processing protocol takes as input a private query criteria ๐๐พ
and a public data set ๐๐ถ and outputs a document ๐ โ ๐๐ถ
such that ๐๐พ (๐) =1. This paper studies private query processing
protocols in the context of inverted index programs and makes
the following 3-fold contributions:
1) in the first fold, a new notion of private query processing protocols defined over inverted index programs within the MappingReducing-Filtering framework is introduced and formalized; Our
formalization is general and can be applied to the other scenarios
such as private searching on streams, data processing on large
clusters and compressing term positions in web indexes as well;
2) in the second fold, a new implementation of private query
processing protocols based on (๐, ๐)-Bloom filters with storages
and additively homomorphic public-key encryptions is proposed.
The idea behind our implementation is that a map function Map
is activated to generate a matrix ๐ of form (document๐ : word๐,1 ,
. . ., word๐,๐ , ๐ = 1, . . . , ๐). The reduce function Reduce is then
ห of form (keyword๐ ,
invoked to generate an inverted index ๐
document๐,1 , document๐,๐ผ๐ ). Finally, a (๐, ๐)-Bloom-Filter with
storage is activated to generate matched documents according to
the specified query criteria;
3) in the third fold, we show that the proposed query processing
protocol on the inverted index is semantically secure assuming
that the underlying additively homomorphic public-key encryption is semantically secure.
To the best of our knowledge, this is the first semantically
secure query processing protocol defined over the inverted index
programs and we expect more applications to be deployed within
this framework.
I. I NTRODUCTION
The task of retrieving commercial data in the presence
of malicious adversaries falls into the general field of private information retrieval (PIR) which is well studied up
to date [12], [13], [3], [4]. For example, Ostrovsky and
Skeith [12], [13] have already proposed solutions of private
searching on streaming data, where a client ๐ queries whether
a server stores the data containing a keyword key, and in case
that a data contains key, ๐ would like to obliviously retrieve
this data such that the corrupted server knows nothing about
what is the specified keyword and which data is retrieved.
We however demonstrate that the OS protocols may not work
efficiently in certain applications. For mining data sets in a
Pet Identification scenario, a retriever may be interested in the
owner information of a pet rather than other information. As
a result, many words in a stored document ๐ can be ignored
during the course of PIR (notice that we do not claim that
c
978-1-4673-0279-1/11/ $26.00 โ2011
IEEE
Huafei Zhu
CAS, I2 R
A*STAR, Singapore
the general PIR does not work for mining data sets rather
we emphasize that the general PIR technique may not work
efficiently. this argument applies to the results presented in [3],
[4] as well), one can expect more efficient solutions to these
problems rather than the general methods presented in [12],
[13], [3], [4]. Since no known results deals with the motivation
problem above, we formalize an interesting research problem
below
on input a private query criteria ๐๐พ and a public
set ๐๐ถ stored even in a possibly corrupted server
๐ถ, how to implement an efficient query processing
protocol such that it outputs a subsect of documents
๐ โ ๐๐ถ satisfying ๐๐พ (๐) =1 while ๐ถ knows nothing
about what is the specified criteria ๐๐พ and which ๐
is retrieved from ๐๐ถ ?
A. This work
This paper intends to provide an efficient implementation
of private processing protocols in the context of inverted
index programs. To help the reader understand the idea of
our implementation, we would like to first sketch the basic
notions of MapReduce and Inverted index and then provide a
high level description of our implementation.
MapReduce: MapReduce introduced by Dean and Ghemawat [6], [7], [8] is a programming model automatically
parallelized and executed on a large cluster of the commodity machines. A MapReduce program supporting distributed
computing on large data sets, consists of two functions: a
map function and a reduce function. A map function Map
transforms a piece of data into (key, value) pairs whereas
a reduce function Reduce merges the emitted values of the
same key into a single result (key: value1 . . . value๐ ).
The MapReduce program is general and many interesting programs can be easily expressed as MapReduce computations:
distributed grep, count URL access frequency, reverse weblink graph, term-vector per host, distributed sort and inverted
index [6], [7], [8]. The applications of MapReduce in the
Cloud Computing scenarios are discussed in [1], [10], [11].
Inverted index: An inverted index program is an instance of MapReduce that stores, for each keyword occurring
somewhere in the collection, information about the locations
where it occurs. The map function in the inverted index
problem parses each document, and emits a sequence of
<documentID, word> pairs. The reduce function accepts all
pairs of a given word, sorts the corresponding document IDs
and emits a <word, list(documentID)> pair. The set of all
output pairs forms an inverted index. Numerous applications
of the inverted index programs have been introduced so far.
We refer to the reader [5], [17], [18], [15], [16] for further
reference.
A high-level description of our implementation:
Let
๐ ={0, 1}โ be a universe of words and ๐ท โ ๐ be a
dictionary such that โฃ๐ทโฃ = ๐ผ < โ. Let ๐พ = (๐1 , . . . , ๐๐พ )
ห =
be a set of keywords selected by a query processor while ๐พ
ห
ห
(๐1 , . . . , ๐๐ ) be a set of keyword selected by the inverted index
ห โ ๐ท. Let ๐๐ถ = (๐1 , . . . , ๐๐ )
program. We assume that ๐พ โ ๐พ
be a document set stored in a server ๐ถ. Our implementation
is sketched below
1) Inverted index generation procedure on input ๐๐ถ ,
the query processor invokes the MapReduce function to
perform the following computations
โ invoking
a map function Map with input
ห = (ห
(๐1 , . . . , ๐๐ ) and ๐พ
๐1 , . . . , ห
๐๐ ) to generate
a matrix ๐ of form (document, keyword) below
โ
โ
๐1,1 , . . . , ห
๐1,๐ผ1
๐1 : ห
โ ๐2 : ห
๐2,1 , . . . , ห
๐2,๐ผ2 โ
โ
โ
โ ... ..., ..., ... โ
๐๐ : ห
๐๐,1 , . . . , ห
๐๐,๐ผ
๐
โ
invoking a reduce function Reduce with input matrix
ห
๐ to generate inverted index ๐
โ ห
โ
๐1 : ๐1,1 , . . . , ๐1,๐ฝ1
โ ๐ห2 : ๐2,1 , . . . , ๐2,๐ฝ โ
2 โ
โ
โ ... ..., ...,
โ
ห
๐๐ : ๐๐,1 , . . . , ๐๐,๐ฝ๐
ห , the query
2) Document filtering procedure on input ๐
processor invokes additively homomorphic encryption
scheme (say the Paillierโs encryption scheme) to filter all
documents containing keywords in ๐พ by the following
processing
ห by invoking Pailโ the query processor encodes ๐พ
lierโs homomorphic encryption scheme ๐ธ๐๐ () to
ห Let
generate ciphertexts {๐ค
ห๐ }๐๐=1 ) of the set ๐พ.
ห = ๐ธ๐๐ (0, ๐๐ ) if
๐ค
ห = ๐ธ๐๐ (1, ๐๐ ) if ๐ โ ๐พ and ๐ค
ห โ ๐พ.
๐โ๐พ
ห = {๐ค
โ Let ๐
ห๐ }๐๐=1 . Let ๐ห๐ =(๐๐,1 , . . . , ๐๐,๐ฝ๐ ) for ๐ =
1, . . . , ๐. The query processor generates ๐ โ ๐ค
ห๐ห๐ ,
๐
๐
and then randomly distributes ๐ copies of ๐๐ into
๐-bin of the (๐, ๐)-Bloom Filter.
Note that we can adjust the system parameter ๐ โฅ
1 in the Paillierโs encryption to guarantee that the
message space is sufficiently large for encrypting
each plaintext ๐ห๐ . See Section 3 for more details.
3) Basis retrieving procedure the query processor retrieves the matched document ๐ห๐พ from ๐-bin of the
ห,
๐๐ , ๐ห๐ ) โ ๐
(๐, ๐)-Bloom Filter, where ๐ห๐พ = {๐ห๐ โฃ (ห
ห
๐๐ โ ๐พ}; By the correctness of the protocol in Section 4, we know that the basis are retrievable in the
(๐, ๐)-Bloom Filter with storage with the overwhelming
probability.
4) Off-line document processing procedure Given ๐ห๐พ ,
the query processor computes ๐ from the specified
Boolean logic expression ๐๐พ (๐ห1 , . . . , ๐ห๐พ ) by substituting ๐๐ in ๐๐พ with the corresponding document set ๐ห๐ .
This ends a brief description of our protocol. Clearly, the
query processor in our model only needs to generate a set
ห
โฃ๐พโฃ
of ciphertexts {๐ค
ห๐ }๐=1 of the common reference keyword set
ห projected on the specified private keyword set ๐พ. The
๐พ
retrieved basis ๐ห๐พ for the Boolean criteria ๐๐พ is sufficient for
the query processor to compute ๐ =๐๐พ (๐ห๐พ ). Consequently,
the computation complexity of query protocol is significantly
reduced (i.e., the size of ciphertexts generated during the
ห rather than ๐ as
private query processing is reduced to โฃ๐พโฃ
ห โช โฃ๐โฃ.
that presented the OS protocols) in case that โฃ๐พโฃ
B. The novelty of our query protocol
One can see that the mapping-reducing-filtering model is
ห
general in the sense that if the MapReduce keyword set ๐พ
is whole dictionary ๐ท and the map function Map and the
reduce function Reduce are both dummy, then the mappingreducing-filtering model is reduced to the MapReduce-free,
filtering-only model [12], [13]; if only the reduce function
is dummy in the mapping-reducing-filtering model, then the
reduced model is equivalent to the mapping-filtering model. As
a result, our framework can be applied to the other scenarios
such as private searching on streams, data processing on large
clusters and compressing term positions in web indexes as
well (see [5], [17], [18], [15], [16] for detail); The mappingreducing-filtering model also benefits an inverted index to
select a common reference word set independent with the
selection of an private keyword set and the dictionary ๐ท. As a
result, such a model allows us to avoid encrypting every word
in the dictionary as the protocols presented [12], [13] and thus
a private query processing protocol defined in the mappingreducing-filtering model is much more efficient and flexible
than that defined in the MapReduce-free, filtering-only model
as well as that defined in the mapping-filtering model.
C. The result
We remark that the private query processing protocol
described above guarantees that non-match documents are
not collected while the matched documents are collected
with overwhelming probability. We show that the proposed
query processing protocol proposed in this paper is privacypreserving assuming that the underlying additively homomorphic public-key encryption is semantically secure. To the best
of our knowledge, this is the first private query processing
protocol defined over the inverted index program in the
Mapping-Reducing-Filtering framework. We therefore expect
more applications to be deployed within this framework.
RoadMap The rest of this paper is organized as follows:
syntax and security definition of query processing protocols
are introduced and formalized in Section 2. In Section 3, building blocks โ additively homomorphic public-key encryption
scheme (say, the Paillierโs encryption scheme) and (๐, ๐)Bloom Filters with Storages are sketched; An implementation
of private query processing protocol is described and analyzed
in Section 4. We conclude this work in Section 5.
II. P RIVACY- PRESERVING QUERY PROCESSING ON
INVERTED INDEX
A. Syntax
Let ๐ = {0, 1}โ be a universe of words and ๐ท โ ๐ be a
dictionary. Let ๐พ= (๐1 , . . . , ๐๐พ ) be a set of keywords selected
by a query processor ๐ . ๐ keeps ๐พ secret (hence ๐พ is called a
ห = (ห
๐๐ ) be a set of keyword
private keyword set). Let ๐พ
๐1 , . . . , ห
ห is publicly
selected by the inverted index. The keyword set ๐พ
ห is called a common reference keyword set).
known (hence ๐พ
ห โ ๐ท.
We assume that ๐พ โ ๐พ
Let ๐ฌ be a class of query type. A query type ๐ฌ could be a class
of logical expressions in โง, โจ and ¬. Let ๐๐ถ = (๐1 , . . . , ๐๐ )
be ๐ documents stored in a server ๐ถ. Given a set of keywords
๐พ โ ๐ท and a query ๐ โ ๐ฌ, we define ๐๐พ : ๐ โ {0, 1} that
takes a subset document ๐ as input and returns 1, if and only
if ๐ matches the criteria.
Definition 1: (Syntax of query processing) For a query ๐๐พ
over a set of keywords ๐พ, and for a subset ๐ โ ๐๐ถ , we say ๐
matches query ๐๐พ if and only if ๐๐พ (๐) =1.
To compute ๐ from a given set ๐๐พ ={๐1 , . . . , ๐๐พ }, the query
processor, first of all, maps the private query criteria ๐๐พ to
๐๐พ by the following procedure:
โ
โ
โ
๐๐ โง ๐๐ if and only if ๐๐ โง ๐๐ ;
๐๐ โจ ๐๐ if and only if ๐๐ โจ ๐๐ ;
¬๐๐ if and only ¬๐๐ . Note that ¬๐๐ = ๐พ โ {๐๐ }, it follows
that ¬๐๐ = ๐๐พ โ {๐๐ }.
The query processor then substitutes ๐๐ in ๐๐พ with the
corresponding document set ๐ห๐ . Finally, the query processor
obtains ๐ from the Boolean logic expression ๐๐พ (๐ห1 , . . . , ๐ห๐พ ).
B. The correctness
C. The security
Since the off-line document processing procedure is performed by query processor itself, it follows that the definition
of security of a private query processing protocol should be
isolated from the off-line processing procedure. We therefore
consider the following game between an adversary ๐ and a
challenger ๐ฎ.
โ ๐ฎ first invokes a key generation algorithm of an additively
homomorphic encryption scheme to obtain (๐๐, ๐ ๐), and
then sends ๐๐ to ๐;
โ ๐ chooses two queries for two sets of keywords
ห and sends (๐๐พ , ๐๐พ ) to ๐ฎ;
๐พ0 , ๐พ1 โ ๐พ,
0
1
โ ๐ฎ chooses a bit ๐ and invokes the encryption scheme
ห๐ ={๐ค
ห๐ , where ๐
ห๐,1 , . . . , ๐ค
ห๐,๐ } and ๐ค
ห๐,๐
to generate ๐
= ๐ธ๐๐ (1, ๐๐,๐ ) if ห
๐๐,๐ โ ๐พ๐ and ๐ค
ห๐,๐ = ๐ธ๐๐ (0, ๐๐,๐ ) if
ห
ห๐ โ ๐พ๐ , ๐ = 1, . . . , ๐;
๐๐,๐ โ ๐พ
ห๐ , and
โ ๐ฎ creates an instance of filtering algorithm with ๐
sends the state ๐ต of the underlying Blooming filter to ๐;
โ ๐ can experiment with the code of ๐ต and finally outputs
๐โฒ โ {0, 1}
The adversary ๐ wins the game if ๐โฒ =๐ and loses otherwise.
We define the adversaryโs advantage in this game to be
Adv๐ (๐) = โฃPr(๐โฒ = ๐) โ 1/2โฃ
Definition 3: (Semantic security of query processing protocols) A query processing protocol is semantically secure if
for any adversary ๐ described in the above game, we have
that Adv๐ (๐) is a negligible function, where the probability
is taken over coin-tosses of the challenger and the adversary.
III. B UILDING BLOCKS
This section sketches building blocks that will be used to
construct private query processing protocols on an inverted
index program: (๐, ๐)-Bloom Filters and additively homomorphic public-key schemes.
A. (๐, ๐)-Bloom filter with Storage
A (๐, ๐)-Bloom Filter consists of an array of ๐-bits
๐ต[1], . . . , ๐ต[๐], initially set to 0 using ๐ independent random
hash functions โ1 , . . . , โ๐ with range [1, . . . , ๐]. This work
will use a variation of a Bloom Filter, called (๐, ๐)-Bloom
Filter with Storage first introduced and formalized in [2]
The correctness of a query processing protocol means
that we must save matched documents with overwhelming
probability and saves non-matched documents with negligible probability. That is, the buffer decryption algorithm can
distinguish collisions in the buffer from the valid documents.
Definition 4: A (๐, ๐)-Bloom Filter with Storage is a collection {โ๐ }๐
๐=1 of functions together with a collection of sets
{๐ต๐ }๐๐=1 , where โ๐ : {0, 1}โ โ [1, . . . , ๐]. To insert a pair
(๐ข, ๐ฃ) into this structure, ๐ฃ is added to ๐ตโ๐ (๐ข) for all ๐ โ [๐],
where [๐]=[1, . . . , ๐]. To determine whether or not ๐ฃ is stored
in a set ๐ , one examines all of the sets ๐ฃ โ ๐ตโ๐ (๐ข) and returns
true if all checks are valid.
Definition 2: (Correctness of query processing protocol)
Let ๐(๐) be a negligible function and ๐ be a security parameter.
Let ๐๐ถ be available documents stored at a server ๐. Let ๐ต โ
be a subset of the matching documents. We say that a query
processing protocol is correct if
As usual, we model โ๐ as uniform, independent randomness.
For each ๐ข โ ๐ , we define ๐ป๐ข = {โ๐ (๐ข)โฃ๐ โ [๐]}. The
correctness of our construction relies on the following lemma
due to Boneh et al [2].
โ
Pr[๐ต = {๐ โ ๐๐ถ โฃ ๐๐พ (๐) = 1}] > 1 โ ๐(๐)
๐
Lemma 1: Let ({โ๐ }๐
๐=1 , {๐ต๐ }๐=1 ) be a (๐, ๐)-Bloom Filter with Storage. Suppose the filter has been initialized to store
some set ๐ of size โฃ๐ โฃ and associated values. Suppose also
that ๐ = โ๐๐โฃ๐ โฃโ, where ๐ > 1 is a constant. Denote the relationship of element-value associates by ๐
(โ
, โ
). Then for any
๐ข โ ๐ , the following statements hold true with probability 1neg(๐), where the probability is over the uniform randomness
used to model the โ๐ and neg(๐) is a negligible function
1) ๐ข โ ๐ if and only if ( ๐ตโ๐ (๐ข) โ= โ
, โ ๐ โ [๐] );
2) โฉ๐โ[๐] ๐ตโ๐ (๐ข) = {๐ฃโฃ๐
(๐ข, ๐ฃ) = 1}
B. Additively homomorphic encryption scheme
Paillier investigated a novel computational problem called
the composite residuosity class problem (CRS), and its applications to public key cryptography in [14]. The decisional
composite residuosity class problem states the following thing:
โ
given ๐ง โ๐ ๐๐
2 deciding whether ๐ง is ๐ -th residue or non ๐ th residue. The decisional composite residuosity class assumption means that there exists no polynomial time distinguisher
for ๐ -th residues modulo ๐ 2 .
Paillierโs encryption scheme: The public key is a 2๐-bit RSA
modulus ๐ =๐๐, where ๐, ๐ are two large safe primes with
length ๐ and the secret key is (๐, ๐). The plain-text space is
โ
๐๐ and the cipher-text space is ๐๐
2 . To encrypt a message
โ
uniformly at random and
๐ โ ๐๐ , one chooses ๐ โ ๐๐
computes the cipher-text as ๐ธ๐ ๐พ (๐, ๐) = ๐ ๐ ๐๐ mod ๐ 2 ,
โ
where ๐ = (1 + ๐ ) has order ๐ in ๐๐
2 . The private key
is (๐, ๐). To decrypt a ciphertetxt ๐ =(1 + ๐ )๐ ๐๐ mod ๐ 2
with the help of the trapdoor information (๐, ๐), one first
computes ๐1 =๐ mod ๐ , and then computes ๐ from the equa๐ โ1 mod๐(๐ )
tion ๐=๐1
mod ๐ ; Finally, one can compute ๐
from the equation ๐๐โ๐ mod ๐ 2 =1 + ๐๐ . The Paillierโs
public-key cryptosystem is homomorphic, i.e., ๐ธ๐ ๐พ (๐1 , ๐1 )
× ๐ธ๐ ๐พ (๐2 , ๐2 ) mod ๐ 2 = ๐ธ๐ ๐พ (๐1 + ๐2 mod ๐ , ๐1 ×
๐2 mod ๐ ) and it is semantically secure if the decisional
composite residuosity class problem is hard. We refer to the
reader [14] for more details.
The Damgaฬrd and Jurik [9] public-key encryption scheme, a
length-flexible Paillierโs encryption schem, will be used when
the size of ๐ห๐ โ ๐ห is large (e.g., โฃ๐ห๐ โฃ > ๐ )
โ The public key is a 2๐-bit RSA modulus ๐ = ๐ ๐, where
๐ , ๐ are two large safe primes. The plain-text space is
โ
๐๐ ๐ and the cipher-text space is ๐๐
๐ +1 . The private key
is (๐, ๐) and the public key is (๐, ๐ ), where ๐ โฅ 1.
โ
โ To encrypt ๐ โ ๐๐ ๐ , one chooses ๐ โ ๐๐ uniformly at
random and computes the cipher-text ๐ as ๐ธ๐ ๐พ (๐, ๐) =
๐
(1 + ๐ )๐ ๐๐ mod ๐ ๐ +1 .
๐ ๐๐
โ To decrypt a ciphertext ๐ =(1 + ๐ ) ๐
mod ๐ ๐ +1 , the
โฒ
decryption algorithm ๐๐ ๐ first computes ๐ =๐ mod ๐ and
โ
; Once
then using the secret key ๐(๐ ) to calculate ๐ โ ๐๐
given ๐, ๐๐ ๐ outputs the message ๐ โ ๐๐ ๐ accordingly.
The plaintext of Damgaฬrd and Jurikโs public key encryption
scheme is flexible and thus enables us to encrypt any length of
the documents, say ๐ < โฃ๐ห๐ โฃ < ๐ ๐ . We will not distinguish
the Paillierโs encryption from the Damgaฬrd and Jurik lengthflexible encryption scheme. Both schemes are denoted by
(๐ธ๐๐ (), ๐ท๐ ๐ ()) uniformly throughout the paper.
IV. P RIVATE QUERY PROCESSING PROTOCOL
In this section, an implementation of private query processing protocol in the mapping-reducing-filtering framework
is introduced and formalized. We show that proposed query
protocol is semantically secure if the underlying homomorphic
public-key encryption scheme is semantically secure.
A. The description
1) The input and output: An input of a query processor
๐ is a private keyword set ๐พ =(๐1 , . . . , ๐๐พ ); An input of a
possibly corrupted server ๐ is a document set ๐๐ถ =(๐1 , . . . , ๐๐ );
An output of ๐ is a document set ๐๐พ โ ๐๐ถ , where ๐๐พ
ห, ห
={๐ห๐ โฃ(ห
๐๐ , ๐ห๐ ) โ ๐
๐๐ โ ๐พ}. An output of ๐ is โฅ.
2) Common reference keyword set: Common-referencestring of query processing protocol: a common reference
ห =(๐1 , . . . , ๐๐ ) and a public description of an
keyword set ๐พ
ห โ ๐ท.
inverted index program. We assume that ๐พ โ ๐พ
3) An initialization algorithm: The initial algorithm โ
comprises two PPT algorithms: additively homomorphic encryption generation algorithm โ1 (an instance of the Paillierโs
encryption scheme throughout the paper) and a keyword hiding
algorithm โ2 . The details of algorithms are described below
๐
โ on input a security parameter 1 , ๐ invokes โ1 to generate
two large safe prime numbers ๐ and ๐ such that โฃ๐โฃ = โฃ๐โฃ
=๐. Let ๐ = ๐๐, ๐๐ =(๐, ๐ ) (๐ โฅ 1) and ๐ ๐ =(๐, ๐). Let
๐ธ๐๐ () be Paillierโs encryption scheme defined over ๐๐
and ๐ท๐ ๐ () be the corresponding decryption algorithm.
ห and ๐พ, ๐ invokes โ2 to generate a ciphertext
โ on input ๐พ
ห
ห projected on ๐พ, where ๐
ห ={๐ค
ห๐ }
set ๐ of ๐พ
ห1 , . . . , ๐ค
ห
and ๐ค
ห๐ = ๐ธ๐๐ (1, ๐๐ค ) if ๐๐ โ ๐พ and ๐ค
ห๐ = ๐ธ๐๐ (0, ๐๐ค ) if
ห
ห โ ๐พ, ๐ = 1, . . . , ๐;
๐๐ โ ๐พ
4) An inverted index program: On input ๐๐ถ , ๐ invokes the
ห below
inverted index program to generate an inverted index ๐
โ
โ ห
๐1 : ๐1,1 , . . . , ๐1,๐ฝ1
โ ๐ห2 : ๐2,1 , . . . , ๐2,๐ฝ โ
2 โ
โ
โ
โ ... ..., ...,
ห
๐๐ : ๐๐,1 , . . . , ๐๐,๐ฝ๐
5) A filtering algorithm: A filtering algorithm โฑ comprises
the following three algorithms: a collection algorithm โฑ0 and a
buffer encoding algorithm โฑ1 and a buffer decoding algorithm
โฑ2 . On input a query ๐๐พ the filtering algorithm performs the
following computations
โ For ๐ = 1, . . . , ๐, ๐ invokes the collection algorithm โฑ0
ห
๐
to construct a temporary collection ๐(๐) = ๐ค
ห๐ ๐ , where ๐ห๐
= (๐๐,1 , . . . , ๐๐,๐ฝ๐ ) for ๐ = 1, . . . , ๐;
โ for ๐ = 1, . . . , ๐, let ๐ข(๐) โ ห
๐๐ and ๐ฃ(๐) โ ๐(๐); Given
(๐ข(๐), ๐ฃ(๐)), ๐ invokes the encoding algorithm โฑ1 to
throw ๐ copies of ๐ฃ(๐) to ๐ bins of the (๐, ๐)-Bloom
Filter with locations {โ๐ (๐ข)}๐
๐=1 . Let ๐ต be the current
state of the (๐, ๐)-Bloom Filter.
โ Given ๐ต, ๐ invokes โฑ2 to compute the locations โ๐ (๐ข(๐))
(1 โค ๐ โค ๐, 1 โค ๐ โค ๐๐ ) and then checks each specified
location is stored by some data; if some of the specified
location is empty, โฑ2 outputs 0 indicating the failure of
the buffer storage; In case that the output is 1, โฑ2 checks
?
that ๐ข(๐) = ห
๐๐ ; If the check is valid, โฑ2 decrypts ๐(๐) to
ห
obtain ๐๐ .
ห๐พ , the query processor computes ๐ from the
โ Given ๐
Boolean logic expression ๐๐พ (๐ห1 , . . . , ๐ห๐พ ).
This ends the description of query processing protocol.
B. The correctness
Before providing the security of the scheme, we show the
correctness of the private query processing protocol. Let ๐
be the number of documents (๐ห1 , . . . , ๐ห๐ ) generated by the
inverted index program. Each document has ๐ copies that are
thrown into the ๐-out-of-๐ bins randomly specified by the
hash functions {โ๐ }๐
๐=1 . Thus we have total ๐๐ documents
thrown ๐ bins. Borrowing the notation from [12], [13], we
call a document ๐ห๐ a color ๐ถ๐ (๐ = 1, . . . , ๐) and call a copy
of the color ๐ถ๐ a ball ๐ต(๐, ๐), where ๐ = 1, . . . , ๐. Thus,
we have total ๐๐ balls that are thrown into ๐ bins. We say
a color ๐ถ๐ survives if at least one ball of color ๐ถ๐ survives.
We say that the color-survival game succeeds of all ๐ colors
survives, otherwise, we say that it fails.
Let ๐ธ be an event that a single specified ball survives this
๐๐โ1
process. Then Pr[๐ธ] =( ๐โ1
> โ1๐ assuming that ๐ โฅ
๐ )
2๐๐. Let ๐ธ๐ be an event that the ๐-th ball of a certain color
does not survive. Then the โฉ
probability that all ๐ balls of this
๐
color does not survive is Pr[ ๐=1 ๐ธ๐ ] โค (1โ โ1๐ )๐ < (1/2)๐ .
โ
Let ๐ธ be an event that the at least one of the color does not
survive and ๐ธ๐โ be an event that the color ๐ถ๐ does not survive.
โช๐
โ๐
Then Pr[๐ธ โ ] โค Pr[ ๐=1 ๐ธ๐โ ] โค ๐=1 Pr[๐ธ๐โ ] โค 2๐๐ , which is
clearly negligible in ๐. This means that with overwhelming
probability that all colors survive and hence all ๐ documents
(๐ห1 , . . . , ๐ห๐ ) are retrievable in our Bloom-Filter with storage
with the overwhelming probability.
C. The proof of security
Theorem 1: The query processing protocol described above
is semantically secure assuming that the underlying Paillierโs
encryption is semantically secure.
Proof Suppose there exists an adversary ๐ that can gain
a non-negligible advantage ๐ in our semantic security game
from the definition 4. We will show that ๐ can be used to gain
an advantage in breaking semantic security of the underlying
public-key encryption scheme.
A challenger ๐ฎ is first given an encryption ๐ of a message
๐๐ โ {0, 1} chosen uniformly at random, i.e., ๐ = ๐ธ๐๐ (๐๐ )
(note that the challenger ๐ฎ is also given the public key ๐๐ but
not the secret key ๐ ๐ of the underlying Paillierโs encryption
scheme).
ห
The challenger ๐ฎ is also given two set of keywords ๐พ0 โ ๐พ
ห such that ๐พ0 โ= ๐พ1 . The challenger ๐ฎ now
and ๐พ1 โ ๐พ
ห of the given reference keyword set ๐พ
ห
generates a ciphertext ๐
by the following procedure: re-randomized encryption ๐ธ๐๐ (0)
ห โ ๐พ๐ and ๐ธ๐๐ (0)๐ if ๐ค โ ๐พ๐
if ๐ค โ ๐พ
if ๐๐ =1, then the construction of MapReduce Filter is
exactly same as that real protocol described above, hence
in this case with probability 1/2 + ๐ the adversary returns
๐โฒ such that ๐โฒ = ๐.
โ if ๐๐ =0, then the simulated MapReduce Filter searches
nothing, hence in this case with probability 1/2 the
adversary returns ๐โฒ such that ๐โฒ =๐.
The ๐ฎ now outputs what the adversary outputs. As a result,
the challenger ๐ฎ obtains the non-negligible advantage 1/2 +
๐/2 to break the semantic security of the Paillerโs encryption.
โ
V. C ONCLUSION
We have implemented a private query processing protocol on an inverted index program and have shown that the
proposed protocol is semantically secure if the underlying
homomorphic public-key encryption scheme is semantically
secure. The formalization of the private query processing
protocol is general and we expect more applications can be
deployed within the proposed framework.
R EFERENCES
[1] M.Armbrust, A.Fox, R.Griffith, A.D.Joseph, R.H.Katz, A.Konwinski,
G.Lee, D.A.Patterson, A.Rabkin, I.Stoica, M.Zaharia: Above the Clouds:
A Berkeley View of Cloud Computing, Technical Report No. UCB/EECS2009-28
[2] D.Boneh, E.Kushilevitz, R.Ostrovsky and W.E.Skeith III: Public Key
Encryption That Allows PIR Queries. CRYPTO 2007: 50-67
[3] J.Bethencourt, D.X.Song and B.Waters: New Constructions and Practical
Applications for Private Stream Searching (Extended Abstract). IEEE
Symposium on Security and Privacy 2006: 132-139
[4] John Bethencourt, Dawn Xiaodong Song, Brent Waters: New Techniques
for Private Stream Searching. ACM Trans. Inf. Syst. Secur. 12(3): (2009)
[5] S.Ding, J.Attenberg and T.Suel: Scalable techniques for document identifier assignment in inverted indexes. WWW 2010: 311-320
[6] J.Dean and S.Ghemawat: MapReduce: Simplified Data Processing on
Large Clusters. OSDI 2004: 137-150
[7] J.Dean and S.Ghemawat: MapReduce: simplified data processing on large
clusters. Commun. ACM 51(1): 107-113 (2008)
[8] J.Dean and S.Ghemawat: MapReduce: a flexible data processing tool.
Commun. CACM 53(1): 72-77 (2010)
[9] I.Damgaฬrd and M.Jurik: A Generalisation, a Simplification and Some
Applications of Paillierโs Probabilistic Public-Key System. Public Key
Cryptography 2001: 119-136.
[10] R.L.Grossman. The Case for Cloud Computing. IT Professional 11(2):
23-27 (2009)
[11] Robert L. Grossman, Yunhong Gu, Michal Sabala, Wanzhi Zhang:
Compute and storage clouds using wide area high performance networks.
Future Generation Comp. Syst. 25(2): 179-183 (2009)
[12] R.Ostrovsky and W.E.Skeith III: Private Searching on Streaming Data.
CRYPTO 2005: 223-240
[13] R.Ostrovsky and W.E.Skeith III: Private Searching on Streaming Data.
J. Cryptology 20(4): 397-430 (2007)
[14] P.Paillier: Public-Key Cryptosystems Based on Composite Degree Residuosity Classes. EUROCRYPT 1999: 223-238.
[15] S.Petrovic and P.Brown: Large Scale Analysis of the eDonkey P2P File
Sharing System. INFOCOM 2009: 2746-2750
[16] H.Wan, C.Tan and Q.Li: Snoogle: A Search Engine for the Physical
World. INFOCOM 2008: 1382-1390
[17] H.Yan, S.Ding and T.Suel: Compressing term positions in web indexes.
SIGIR 2009: 147-154
[18] H.Yan, S.Ding and T.Suel: Inverted index compression and query
processing with optimized document ordering. WWW 2009: 401-410
© Copyright 2026 Paperzz