Private Query Processing on Inverted Index

Private Query Processing on Inverted Index
Wee Keong Ng
Yonggang Wen
School of Computer Engineering
NTU, Singapore
School of Computer Engineering
NTU, Singapore
Abstract—A private query criteria 𝑄𝐾 is a Boolean logic
expression in ∧, ∨ and ¬ of an input set 𝐾. A private query
processing protocol takes as input a private query criteria 𝑄𝐾
and a public data set 𝑑𝐶 and outputs a document 𝑑 ∈ 𝑑𝐶
such that 𝑄𝐾 (𝑑) =1. This paper studies private query processing
protocols in the context of inverted index programs and makes
the following 3-fold contributions:
1) in the first fold, a new notion of private query processing protocols defined over inverted index programs within the MappingReducing-Filtering framework is introduced and formalized; Our
formalization is general and can be applied to the other scenarios
such as private searching on streams, data processing on large
clusters and compressing term positions in web indexes as well;
2) in the second fold, a new implementation of private query
processing protocols based on (𝑚, 𝑛)-Bloom filters with storages
and additively homomorphic public-key encryptions is proposed.
The idea behind our implementation is that a map function Map
is activated to generate a matrix 𝑀 of form (document𝑗 : word𝑗,1 ,
. . ., word𝑗,𝑛 , 𝑗 = 1, . . . , 𝑚). The reduce function Reduce is then
ˆ of form (keyword𝑖 ,
invoked to generate an inverted index 𝑀
document𝑖,1 , document𝑖,𝛼𝑖 ). Finally, a (𝑚, 𝑛)-Bloom-Filter with
storage is activated to generate matched documents according to
the specified query criteria;
3) in the third fold, we show that the proposed query processing
protocol on the inverted index is semantically secure assuming
that the underlying additively homomorphic public-key encryption is semantically secure.
To the best of our knowledge, this is the first semantically
secure query processing protocol defined over the inverted index
programs and we expect more applications to be deployed within
this framework.
I. I NTRODUCTION
The task of retrieving commercial data in the presence
of malicious adversaries falls into the general field of private information retrieval (PIR) which is well studied up
to date [12], [13], [3], [4]. For example, Ostrovsky and
Skeith [12], [13] have already proposed solutions of private
searching on streaming data, where a client 𝑃 queries whether
a server stores the data containing a keyword key, and in case
that a data contains key, 𝑃 would like to obliviously retrieve
this data such that the corrupted server knows nothing about
what is the specified keyword and which data is retrieved.
We however demonstrate that the OS protocols may not work
efficiently in certain applications. For mining data sets in a
Pet Identification scenario, a retriever may be interested in the
owner information of a pet rather than other information. As
a result, many words in a stored document 𝑑 can be ignored
during the course of PIR (notice that we do not claim that
c
978-1-4673-0279-1/11/ $26.00 ⃝2011
IEEE
Huafei Zhu
CAS, I2 R
A*STAR, Singapore
the general PIR does not work for mining data sets rather
we emphasize that the general PIR technique may not work
efficiently. this argument applies to the results presented in [3],
[4] as well), one can expect more efficient solutions to these
problems rather than the general methods presented in [12],
[13], [3], [4]. Since no known results deals with the motivation
problem above, we formalize an interesting research problem
below
on input a private query criteria 𝑄𝐾 and a public
set 𝑑𝐶 stored even in a possibly corrupted server
𝐶, how to implement an efficient query processing
protocol such that it outputs a subsect of documents
𝑑 ⊆ 𝑑𝐶 satisfying 𝑄𝐾 (𝑑) =1 while 𝐶 knows nothing
about what is the specified criteria 𝑄𝐾 and which 𝑑
is retrieved from 𝑑𝐶 ?
A. This work
This paper intends to provide an efficient implementation
of private processing protocols in the context of inverted
index programs. To help the reader understand the idea of
our implementation, we would like to first sketch the basic
notions of MapReduce and Inverted index and then provide a
high level description of our implementation.
MapReduce: MapReduce introduced by Dean and Ghemawat [6], [7], [8] is a programming model automatically
parallelized and executed on a large cluster of the commodity machines. A MapReduce program supporting distributed
computing on large data sets, consists of two functions: a
map function and a reduce function. A map function Map
transforms a piece of data into (key, value) pairs whereas
a reduce function Reduce merges the emitted values of the
same key into a single result (key: value1 . . . value𝑛 ).
The MapReduce program is general and many interesting programs can be easily expressed as MapReduce computations:
distributed grep, count URL access frequency, reverse weblink graph, term-vector per host, distributed sort and inverted
index [6], [7], [8]. The applications of MapReduce in the
Cloud Computing scenarios are discussed in [1], [10], [11].
Inverted index: An inverted index program is an instance of MapReduce that stores, for each keyword occurring
somewhere in the collection, information about the locations
where it occurs. The map function in the inverted index
problem parses each document, and emits a sequence of
<documentID, word> pairs. The reduce function accepts all
pairs of a given word, sorts the corresponding document IDs
and emits a <word, list(documentID)> pair. The set of all
output pairs forms an inverted index. Numerous applications
of the inverted index programs have been introduced so far.
We refer to the reader [5], [17], [18], [15], [16] for further
reference.
A high-level description of our implementation:
Let
𝑊 ={0, 1}∗ be a universe of words and 𝐷 ⊆ 𝑊 be a
dictionary such that ∣𝐷∣ = 𝛼 < ∞. Let 𝐾 = (𝑘1 , . . . , 𝑘𝛾 )
ˆ =
be a set of keywords selected by a query processor while 𝐾
ˆ
ˆ
(𝑘1 , . . . , 𝑘𝜆 ) be a set of keyword selected by the inverted index
ˆ ⊆ 𝐷. Let 𝑑𝐶 = (𝑑1 , . . . , 𝑑𝑙 )
program. We assume that 𝐾 ⊆ 𝐾
be a document set stored in a server 𝐶. Our implementation
is sketched below
1) Inverted index generation procedure on input 𝑑𝐶 ,
the query processor invokes the MapReduce function to
perform the following computations
∙ invoking
a map function Map with input
ˆ = (ˆ
(𝑑1 , . . . , 𝑑𝑙 ) and 𝐾
𝑘1 , . . . , ˆ
𝑘𝜆 ) to generate
a matrix 𝑀 of form (document, keyword) below
⎞
⎛
𝑘1,1 , . . . , ˆ
𝑘1,𝛼1
𝑑1 : ˆ
⎜ 𝑑2 : ˆ
𝑘2,1 , . . . , ˆ
𝑘2,𝛼2 ⎟
⎟
⎜
⎝ ... ..., ..., ... ⎠
𝑑𝑙 : ˆ
𝑘𝑙,1 , . . . , ˆ
𝑘𝑙,𝛼
𝑙
∙
invoking a reduce function Reduce with input matrix
ˆ
𝑀 to generate inverted index 𝑀
⎛ ˆ
⎞
𝑘1 : 𝑑1,1 , . . . , 𝑑1,𝛽1
⎜ 𝑘ˆ2 : 𝑑2,1 , . . . , 𝑑2,𝛽 ⎟
2 ⎟
⎜
⎝ ... ..., ...,
⎠
ˆ
𝑘𝜆 : 𝑑𝑛,1 , . . . , 𝑑𝜆,𝛽𝜆
ˆ , the query
2) Document filtering procedure on input 𝑀
processor invokes additively homomorphic encryption
scheme (say the Paillier’s encryption scheme) to filter all
documents containing keywords in 𝐾 by the following
processing
ˆ by invoking Pail∙ the query processor encodes 𝐾
lier’s homomorphic encryption scheme 𝐸𝑝𝑘 () to
ˆ Let
generate ciphertexts {𝑤
ˆ𝑖 }𝜆𝑖=1 ) of the set 𝐾.
ˆ = 𝐸𝑝𝑘 (0, 𝑟𝑘 ) if
𝑤
ˆ = 𝐸𝑝𝑘 (1, 𝑟𝑘 ) if 𝑘 ∈ 𝐾 and 𝑤
ˆ ∖ 𝐾.
𝑘∈𝐾
ˆ = {𝑤
∙ Let 𝑊
ˆ𝑖 }𝜆𝑖=1 . Let 𝑑ˆ𝑖 =(𝑑𝑖,1 , . . . , 𝑑𝑖,𝛽𝑖 ) for 𝑖 =
1, . . . , 𝜆. The query processor generates 𝑐 ← 𝑤
ˆ𝑑ˆ𝑖 ,
𝑖
𝑖
and then randomly distributes 𝑚 copies of 𝑐𝑖 into
𝑛-bin of the (𝑚, 𝑛)-Bloom Filter.
Note that we can adjust the system parameter 𝑠 ≥
1 in the Paillier’s encryption to guarantee that the
message space is sufficiently large for encrypting
each plaintext 𝑑ˆ𝑖 . See Section 3 for more details.
3) Basis retrieving procedure the query processor retrieves the matched document 𝑑ˆ𝐾 from 𝑛-bin of the
ˆ,
𝑘𝑖 , 𝑑ˆ𝑖 ) ∈ 𝑀
(𝑚, 𝑛)-Bloom Filter, where 𝑑ˆ𝐾 = {𝑑ˆ𝑖 ∣ (ˆ
ˆ
𝑘𝑖 ∈ 𝐾}; By the correctness of the protocol in Section 4, we know that the basis are retrievable in the
(𝑚, 𝑛)-Bloom Filter with storage with the overwhelming
probability.
4) Off-line document processing procedure Given 𝑑ˆ𝐾 ,
the query processor computes 𝑑 from the specified
Boolean logic expression 𝑄𝐾 (𝑑ˆ1 , . . . , 𝑑ˆ𝛾 ) by substituting 𝑘𝑖 in 𝑄𝐾 with the corresponding document set 𝑑ˆ𝑖 .
This ends a brief description of our protocol. Clearly, the
query processor in our model only needs to generate a set
ˆ
∣𝐾∣
of ciphertexts {𝑤
ˆ𝑖 }𝑖=1 of the common reference keyword set
ˆ projected on the specified private keyword set 𝐾. The
𝐾
retrieved basis 𝑑ˆ𝐾 for the Boolean criteria 𝑄𝐾 is sufficient for
the query processor to compute 𝑑 =𝑄𝐾 (𝑑ˆ𝐾 ). Consequently,
the computation complexity of query protocol is significantly
reduced (i.e., the size of ciphertexts generated during the
ˆ rather than 𝑑 as
private query processing is reduced to ∣𝐾∣
ˆ ≪ ∣𝑑∣.
that presented the OS protocols) in case that ∣𝐾∣
B. The novelty of our query protocol
One can see that the mapping-reducing-filtering model is
ˆ
general in the sense that if the MapReduce keyword set 𝐾
is whole dictionary 𝐷 and the map function Map and the
reduce function Reduce are both dummy, then the mappingreducing-filtering model is reduced to the MapReduce-free,
filtering-only model [12], [13]; if only the reduce function
is dummy in the mapping-reducing-filtering model, then the
reduced model is equivalent to the mapping-filtering model. As
a result, our framework can be applied to the other scenarios
such as private searching on streams, data processing on large
clusters and compressing term positions in web indexes as
well (see [5], [17], [18], [15], [16] for detail); The mappingreducing-filtering model also benefits an inverted index to
select a common reference word set independent with the
selection of an private keyword set and the dictionary 𝐷. As a
result, such a model allows us to avoid encrypting every word
in the dictionary as the protocols presented [12], [13] and thus
a private query processing protocol defined in the mappingreducing-filtering model is much more efficient and flexible
than that defined in the MapReduce-free, filtering-only model
as well as that defined in the mapping-filtering model.
C. The result
We remark that the private query processing protocol
described above guarantees that non-match documents are
not collected while the matched documents are collected
with overwhelming probability. We show that the proposed
query processing protocol proposed in this paper is privacypreserving assuming that the underlying additively homomorphic public-key encryption is semantically secure. To the best
of our knowledge, this is the first private query processing
protocol defined over the inverted index program in the
Mapping-Reducing-Filtering framework. We therefore expect
more applications to be deployed within this framework.
RoadMap The rest of this paper is organized as follows:
syntax and security definition of query processing protocols
are introduced and formalized in Section 2. In Section 3, building blocks − additively homomorphic public-key encryption
scheme (say, the Paillier’s encryption scheme) and (𝑚, 𝑛)Bloom Filters with Storages are sketched; An implementation
of private query processing protocol is described and analyzed
in Section 4. We conclude this work in Section 5.
II. P RIVACY- PRESERVING QUERY PROCESSING ON
INVERTED INDEX
A. Syntax
Let 𝑊 = {0, 1}∗ be a universe of words and 𝐷 ⊆ 𝑊 be a
dictionary. Let 𝐾= (𝑘1 , . . . , 𝑘𝛾 ) be a set of keywords selected
by a query processor 𝑃 . 𝑃 keeps 𝐾 secret (hence 𝐾 is called a
ˆ = (ˆ
𝑘𝜆 ) be a set of keyword
private keyword set). Let 𝐾
𝑘1 , . . . , ˆ
ˆ is publicly
selected by the inverted index. The keyword set 𝐾
ˆ is called a common reference keyword set).
known (hence 𝐾
ˆ ⊆ 𝐷.
We assume that 𝐾 ⊆ 𝐾
Let 𝒬 be a class of query type. A query type 𝒬 could be a class
of logical expressions in ∧, ∨ and ¬. Let 𝑑𝐶 = (𝑑1 , . . . , 𝑑𝑙 )
be 𝑙 documents stored in a server 𝐶. Given a set of keywords
𝐾 ⊂ 𝐷 and a query 𝑄 ∈ 𝒬, we define 𝑄𝐾 : 𝑑 → {0, 1} that
takes a subset document 𝑑 as input and returns 1, if and only
if 𝑑 matches the criteria.
Definition 1: (Syntax of query processing) For a query 𝑄𝐾
over a set of keywords 𝐾, and for a subset 𝑑 ⊆ 𝑑𝐶 , we say 𝑑
matches query 𝑄𝐾 if and only if 𝑄𝐾 (𝑑) =1.
To compute 𝑑 from a given set 𝑑𝐾 ={𝑑1 , . . . , 𝑑𝛾 }, the query
processor, first of all, maps the private query criteria 𝑄𝐾 to
𝑑𝐾 by the following procedure:
∙
∙
∙
𝑘𝑖 ∧ 𝑘𝑗 if and only if 𝑑𝑖 ∧ 𝑑𝑗 ;
𝑘𝑖 ∨ 𝑘𝑗 if and only if 𝑑𝑖 ∨ 𝑑𝑗 ;
¬𝑘𝑖 if and only ¬𝑑𝑖 . Note that ¬𝑘𝑖 = 𝐾 ∖ {𝑘𝑖 }, it follows
that ¬𝑑𝑖 = 𝑑𝐾 ∖ {𝑑𝑖 }.
The query processor then substitutes 𝑘𝑖 in 𝑄𝐾 with the
corresponding document set 𝑑ˆ𝑖 . Finally, the query processor
obtains 𝑑 from the Boolean logic expression 𝑄𝐾 (𝑑ˆ1 , . . . , 𝑑ˆ𝛾 ).
B. The correctness
C. The security
Since the off-line document processing procedure is performed by query processor itself, it follows that the definition
of security of a private query processing protocol should be
isolated from the off-line processing procedure. We therefore
consider the following game between an adversary 𝒜 and a
challenger 𝒮.
∙ 𝒮 first invokes a key generation algorithm of an additively
homomorphic encryption scheme to obtain (𝑝𝑘, 𝑠𝑘), and
then sends 𝑝𝑘 to 𝒜;
∙ 𝒜 chooses two queries for two sets of keywords
ˆ and sends (𝑄𝐾 , 𝑄𝐾 ) to 𝒮;
𝐾0 , 𝐾1 ⊂ 𝐾,
0
1
∙ 𝒮 chooses a bit 𝑏 and invokes the encryption scheme
ˆ𝑏 ={𝑤
ˆ𝑏 , where 𝑊
ˆ𝑏,1 , . . . , 𝑤
ˆ𝑏,𝜆 } and 𝑤
ˆ𝑏,𝑗
to generate 𝑊
= 𝐸𝑝𝑘 (1, 𝑟𝑏,𝑗 ) if ˆ
𝑘𝑏,𝑗 ∈ 𝐾𝑏 and 𝑤
ˆ𝑏,𝑗 = 𝐸𝑝𝑘 (0, 𝑟𝑏,𝑗 ) if
ˆ
ˆ𝑏 ∖ 𝐾𝑏 , 𝑗 = 1, . . . , 𝜆;
𝑘𝑏,𝑗 ∈ 𝐾
ˆ𝑏 , and
∙ 𝒮 creates an instance of filtering algorithm with 𝑊
sends the state 𝐵 of the underlying Blooming filter to 𝒜;
∙ 𝒜 can experiment with the code of 𝐵 and finally outputs
𝑏′ ∈ {0, 1}
The adversary 𝒜 wins the game if 𝑏′ =𝑏 and loses otherwise.
We define the adversary’s advantage in this game to be
Adv𝒜 (𝑘) = ∣Pr(𝑏′ = 𝑏) − 1/2∣
Definition 3: (Semantic security of query processing protocols) A query processing protocol is semantically secure if
for any adversary 𝒜 described in the above game, we have
that Adv𝒜 (𝑘) is a negligible function, where the probability
is taken over coin-tosses of the challenger and the adversary.
III. B UILDING BLOCKS
This section sketches building blocks that will be used to
construct private query processing protocols on an inverted
index program: (𝑚, 𝑛)-Bloom Filters and additively homomorphic public-key schemes.
A. (𝑚, 𝑛)-Bloom filter with Storage
A (𝑚, 𝑛)-Bloom Filter consists of an array of 𝑛-bits
𝐵[1], . . . , 𝐵[𝑛], initially set to 0 using 𝑚 independent random
hash functions ℎ1 , . . . , ℎ𝑚 with range [1, . . . , 𝑛]. This work
will use a variation of a Bloom Filter, called (𝑚, 𝑛)-Bloom
Filter with Storage first introduced and formalized in [2]
The correctness of a query processing protocol means
that we must save matched documents with overwhelming
probability and saves non-matched documents with negligible probability. That is, the buffer decryption algorithm can
distinguish collisions in the buffer from the valid documents.
Definition 4: A (𝑚, 𝑛)-Bloom Filter with Storage is a collection {ℎ𝑖 }𝑚
𝑖=1 of functions together with a collection of sets
{𝐵𝑗 }𝑛𝑗=1 , where ℎ𝑖 : {0, 1}∗ → [1, . . . , 𝑛]. To insert a pair
(𝑢, 𝑣) into this structure, 𝑣 is added to 𝐵ℎ𝑖 (𝑢) for all 𝑖 ∈ [𝑚],
where [𝑚]=[1, . . . , 𝑚]. To determine whether or not 𝑣 is stored
in a set 𝑈 , one examines all of the sets 𝑣 ∈ 𝐵ℎ𝑖 (𝑢) and returns
true if all checks are valid.
Definition 2: (Correctness of query processing protocol)
Let 𝜈(𝑘) be a negligible function and 𝑘 be a security parameter.
Let 𝑑𝐶 be available documents stored at a server 𝒞. Let 𝐵 ∗
be a subset of the matching documents. We say that a query
processing protocol is correct if
As usual, we model ℎ𝑖 as uniform, independent randomness.
For each 𝑢 ∈ 𝑈 , we define 𝐻𝑢 = {ℎ𝑖 (𝑢)∣𝑖 ∈ [𝑚]}. The
correctness of our construction relies on the following lemma
due to Boneh et al [2].
∗
Pr[𝐵 = {𝑑 ∈ 𝑑𝐶 ∣ 𝑄𝐾 (𝑑) = 1}] > 1 − 𝜈(𝑘)
𝑛
Lemma 1: Let ({ℎ𝑖 }𝑚
𝑖=1 , {𝐵𝑗 }𝑗=1 ) be a (𝑚, 𝑛)-Bloom Filter with Storage. Suppose the filter has been initialized to store
some set 𝑈 of size ∣𝑈 ∣ and associated values. Suppose also
that 𝑛 = ⌈𝑐𝑚∣𝑈 ∣⌉, where 𝑐 > 1 is a constant. Denote the relationship of element-value associates by 𝑅(⋅, ⋅). Then for any
𝑢 ∈ 𝑈 , the following statements hold true with probability 1neg(𝑘), where the probability is over the uniform randomness
used to model the ℎ𝑖 and neg(𝑘) is a negligible function
1) 𝑢 ∈ 𝑈 if and only if ( 𝐵ℎ𝑖 (𝑢) ∕= ∅, ∀ 𝑖 ∈ [𝑚] );
2) ∩𝑖∈[𝑚] 𝐵ℎ𝑖 (𝑢) = {𝑣∣𝑅(𝑢, 𝑣) = 1}
B. Additively homomorphic encryption scheme
Paillier investigated a novel computational problem called
the composite residuosity class problem (CRS), and its applications to public key cryptography in [14]. The decisional
composite residuosity class problem states the following thing:
∗
given 𝑧 ∈𝑟 𝑍𝑁
2 deciding whether 𝑧 is 𝑁 -th residue or non 𝑁 th residue. The decisional composite residuosity class assumption means that there exists no polynomial time distinguisher
for 𝑁 -th residues modulo 𝑁 2 .
Paillier’s encryption scheme: The public key is a 2𝑘-bit RSA
modulus 𝑁 =𝑝𝑞, where 𝑝, 𝑞 are two large safe primes with
length 𝑘 and the secret key is (𝑝, 𝑞). The plain-text space is
∗
𝑍𝑁 and the cipher-text space is 𝑍𝑁
2 . To encrypt a message
∗
uniformly at random and
𝑚 ∈ 𝑍𝑁 , one chooses 𝑟 ∈ 𝑍𝑁
computes the cipher-text as 𝐸𝑃 𝐾 (𝑚, 𝑟) = 𝑔 𝑚 𝑟𝑁 mod 𝑁 2 ,
∗
where 𝑔 = (1 + 𝑁 ) has order 𝑁 in 𝑍𝑁
2 . The private key
is (𝑝, 𝑞). To decrypt a ciphertetxt 𝑐 =(1 + 𝑁 )𝑚 𝑟𝑁 mod 𝑁 2
with the help of the trapdoor information (𝑝, 𝑞), one first
computes 𝑐1 =𝑐 mod 𝑁 , and then computes 𝑟 from the equa𝑁 −1 mod𝜙(𝑁 )
tion 𝑟=𝑐1
mod 𝑁 ; Finally, one can compute 𝑚
from the equation 𝑐𝑟−𝑁 mod 𝑁 2 =1 + 𝑚𝑁 . The Paillier’s
public-key cryptosystem is homomorphic, i.e., 𝐸𝑃 𝐾 (𝑚1 , 𝑟1 )
× 𝐸𝑃 𝐾 (𝑚2 , 𝑟2 ) mod 𝑁 2 = 𝐸𝑃 𝐾 (𝑚1 + 𝑚2 mod 𝑁 , 𝑟1 ×
𝑟2 mod 𝑁 ) and it is semantically secure if the decisional
composite residuosity class problem is hard. We refer to the
reader [14] for more details.
The Damgård and Jurik [9] public-key encryption scheme, a
length-flexible Paillier’s encryption schem, will be used when
the size of 𝑑ˆ𝑗 ∈ 𝑑ˆ is large (e.g., ∣𝑑ˆ𝑗 ∣ > 𝑁 )
∙ The public key is a 2𝑘-bit RSA modulus 𝑁 = 𝑃 𝑄, where
𝑃 , 𝑄 are two large safe primes. The plain-text space is
∗
𝑍𝑁 𝑠 and the cipher-text space is 𝑍𝑁
𝑠+1 . The private key
is (𝑃, 𝑄) and the public key is (𝑁, 𝑠), where 𝑠 ≥ 1.
∗
∙ To encrypt 𝑚 ∈ 𝑍𝑁 𝑠 , one chooses 𝑟 ∈ 𝑍𝑁 uniformly at
random and computes the cipher-text 𝑐 as 𝐸𝑃 𝐾 (𝑚, 𝑟) =
𝑠
(1 + 𝑁 )𝑚 𝑟𝑁 mod 𝑁 𝑠+1 .
𝑚 𝑁𝑠
∙ To decrypt a ciphertext 𝑐 =(1 + 𝑁 ) 𝑟
mod 𝑁 𝑠+1 , the
′
decryption algorithm 𝒟𝑠𝑘 first computes 𝑐 =𝑐 mod 𝑁 and
∗
; Once
then using the secret key 𝜙(𝑁 ) to calculate 𝑟 ∈ 𝑍𝑁
given 𝑟, 𝒟𝑠𝑘 outputs the message 𝑚 ∈ 𝑍𝑁 𝑠 accordingly.
The plaintext of Damgård and Jurik’s public key encryption
scheme is flexible and thus enables us to encrypt any length of
the documents, say 𝑁 < ∣𝑑ˆ𝑗 ∣ < 𝑁 𝑠 . We will not distinguish
the Paillier’s encryption from the Damgård and Jurik lengthflexible encryption scheme. Both schemes are denoted by
(𝐸𝑝𝑘 (), 𝐷𝑠𝑘 ()) uniformly throughout the paper.
IV. P RIVATE QUERY PROCESSING PROTOCOL
In this section, an implementation of private query processing protocol in the mapping-reducing-filtering framework
is introduced and formalized. We show that proposed query
protocol is semantically secure if the underlying homomorphic
public-key encryption scheme is semantically secure.
A. The description
1) The input and output: An input of a query processor
𝑃 is a private keyword set 𝐾 =(𝑘1 , . . . , 𝑘𝛾 ); An input of a
possibly corrupted server 𝑆 is a document set 𝑑𝐶 =(𝑑1 , . . . , 𝑑𝑙 );
An output of 𝑃 is a document set 𝑑𝐾 ⊂ 𝑑𝐶 , where 𝑑𝐾
ˆ, ˆ
={𝑑ˆ𝑖 ∣(ˆ
𝑘𝑖 , 𝑑ˆ𝑖 ) ∈ 𝑀
𝑘𝑖 ∈ 𝐾}. An output of 𝑆 is ⊥.
2) Common reference keyword set: Common-referencestring of query processing protocol: a common reference
ˆ =(𝑘1 , . . . , 𝑘𝜆 ) and a public description of an
keyword set 𝐾
ˆ ⊂ 𝐷.
inverted index program. We assume that 𝐾 ⊂ 𝐾
3) An initialization algorithm: The initial algorithm ℐ
comprises two PPT algorithms: additively homomorphic encryption generation algorithm ℐ1 (an instance of the Paillier’s
encryption scheme throughout the paper) and a keyword hiding
algorithm ℐ2 . The details of algorithms are described below
𝑘
∙ on input a security parameter 1 , 𝑃 invokes ℐ1 to generate
two large safe prime numbers 𝑝 and 𝑞 such that ∣𝑝∣ = ∣𝑞∣
=𝑘. Let 𝑁 = 𝑝𝑞, 𝑝𝑘 =(𝑁, 𝑠) (𝑠 ≥ 1) and 𝑠𝑘 =(𝑝, 𝑞). Let
𝐸𝑝𝑘 () be Paillier’s encryption scheme defined over 𝑝𝑘
and 𝐷𝑠𝑘 () be the corresponding decryption algorithm.
ˆ and 𝐾, 𝑃 invokes ℐ2 to generate a ciphertext
∙ on input 𝐾
ˆ
ˆ projected on 𝐾, where 𝑊
ˆ ={𝑤
ˆ𝜆 }
set 𝑊 of 𝐾
ˆ1 , . . . , 𝑤
ˆ
and 𝑤
ˆ𝑗 = 𝐸𝑝𝑘 (1, 𝑟𝑤 ) if 𝑘𝑗 ∈ 𝐾 and 𝑤
ˆ𝑗 = 𝐸𝑝𝑘 (0, 𝑟𝑤 ) if
ˆ
ˆ ∖ 𝐾, 𝑗 = 1, . . . , 𝜆;
𝑘𝑗 ∈ 𝐾
4) An inverted index program: On input 𝑑𝐶 , 𝑃 invokes the
ˆ below
inverted index program to generate an inverted index 𝑀
⎞
⎛ ˆ
𝑘1 : 𝑑1,1 , . . . , 𝑑1,𝛽1
⎜ 𝑘ˆ2 : 𝑑2,1 , . . . , 𝑑2,𝛽 ⎟
2 ⎟
⎜
⎠
⎝ ... ..., ...,
ˆ
𝑘𝜆 : 𝑑𝜆,1 , . . . , 𝑑𝜆,𝛽𝜆
5) A filtering algorithm: A filtering algorithm ℱ comprises
the following three algorithms: a collection algorithm ℱ0 and a
buffer encoding algorithm ℱ1 and a buffer decoding algorithm
ℱ2 . On input a query 𝑄𝐾 the filtering algorithm performs the
following computations
∙ For 𝑗 = 1, . . . , 𝜆, 𝑃 invokes the collection algorithm ℱ0
ˆ
𝑑
to construct a temporary collection 𝑐(𝑗) = 𝑤
ˆ𝑗 𝑗 , where 𝑑ˆ𝑗
= (𝑑𝑗,1 , . . . , 𝑑𝑗,𝛽𝑗 ) for 𝑗 = 1, . . . , 𝜆;
∙ for 𝑗 = 1, . . . , 𝜆, let 𝑢(𝑗) ← ˆ
𝑘𝑗 and 𝑣(𝑗) ← 𝑐(𝑗); Given
(𝑢(𝑗), 𝑣(𝑗)), 𝑃 invokes the encoding algorithm ℱ1 to
throw 𝑚 copies of 𝑣(𝑗) to 𝑛 bins of the (𝑚, 𝑛)-Bloom
Filter with locations {ℎ𝑖 (𝑢)}𝑚
𝑖=1 . Let 𝐵 be the current
state of the (𝑚, 𝑛)-Bloom Filter.
∙ Given 𝐵, 𝑃 invokes ℱ2 to compute the locations ℎ𝑖 (𝑢(𝑗))
(1 ≤ 𝑖 ≤ 𝑙, 1 ≤ 𝑗 ≤ 𝑙𝑗 ) and then checks each specified
location is stored by some data; if some of the specified
location is empty, ℱ2 outputs 0 indicating the failure of
the buffer storage; In case that the output is 1, ℱ2 checks
?
that 𝑢(𝑗) = ˆ
𝑘𝑗 ; If the check is valid, ℱ2 decrypts 𝑐(𝑗) to
ˆ
obtain 𝑑𝑗 .
ˆ𝐾 , the query processor computes 𝑑 from the
∙ Given 𝑑
Boolean logic expression 𝑄𝐾 (𝑑ˆ1 , . . . , 𝑑ˆ𝛾 ).
This ends the description of query processing protocol.
B. The correctness
Before providing the security of the scheme, we show the
correctness of the private query processing protocol. Let 𝜆
be the number of documents (𝑑ˆ1 , . . . , 𝑑ˆ𝜆 ) generated by the
inverted index program. Each document has 𝑚 copies that are
thrown into the 𝑚-out-of-𝑛 bins randomly specified by the
hash functions {ℎ𝑖 }𝑚
𝑖=1 . Thus we have total 𝑚𝜆 documents
thrown 𝑛 bins. Borrowing the notation from [12], [13], we
call a document 𝑑ˆ𝑗 a color 𝐶𝑗 (𝑗 = 1, . . . , 𝜆) and call a copy
of the color 𝐶𝑗 a ball 𝐵(𝑗, 𝑘), where 𝑘 = 1, . . . , 𝑚. Thus,
we have total 𝑚𝜆 balls that are thrown into 𝑛 bins. We say
a color 𝐶𝑗 survives if at least one ball of color 𝐶𝑗 survives.
We say that the color-survival game succeeds of all 𝜆 colors
survives, otherwise, we say that it fails.
Let 𝐸 be an event that a single specified ball survives this
𝑚𝜆−1
process. Then Pr[𝐸] =( 𝑛−1
> √1𝑒 assuming that 𝑛 ≥
𝑛 )
2𝑚𝜆. Let 𝐸𝑗 be an event that the 𝑗-th ball of a certain color
does not survive. Then the ∩
probability that all 𝑚 balls of this
𝑚
color does not survive is Pr[ 𝑗=1 𝐸𝑗 ] ≤ (1− √1𝑒 )𝑚 < (1/2)𝑚 .
∗
Let 𝐸 be an event that the at least one of the color does not
survive and 𝐸𝑗∗ be an event that the color 𝐶𝑗 does not survive.
∪𝜆
∑𝑚
Then Pr[𝐸 ∗ ] ≤ Pr[ 𝑗=1 𝐸𝑗∗ ] ≤ 𝑗=1 Pr[𝐸𝑗∗ ] ≤ 2𝜆𝑚 , which is
clearly negligible in 𝑚. This means that with overwhelming
probability that all colors survive and hence all 𝜆 documents
(𝑑ˆ1 , . . . , 𝑑ˆ𝜆 ) are retrievable in our Bloom-Filter with storage
with the overwhelming probability.
C. The proof of security
Theorem 1: The query processing protocol described above
is semantically secure assuming that the underlying Paillier’s
encryption is semantically secure.
Proof Suppose there exists an adversary 𝒜 that can gain
a non-negligible advantage 𝜖 in our semantic security game
from the definition 4. We will show that 𝒜 can be used to gain
an advantage in breaking semantic security of the underlying
public-key encryption scheme.
A challenger 𝒮 is first given an encryption 𝑐 of a message
𝑚𝑏 ∈ {0, 1} chosen uniformly at random, i.e., 𝑐 = 𝐸𝑝𝑘 (𝑚𝑏 )
(note that the challenger 𝒮 is also given the public key 𝑝𝑘 but
not the secret key 𝑠𝑘 of the underlying Paillier’s encryption
scheme).
ˆ
The challenger 𝒮 is also given two set of keywords 𝐾0 ⊂ 𝐾
ˆ such that 𝐾0 ∕= 𝐾1 . The challenger 𝒮 now
and 𝐾1 ⊂ 𝐾
ˆ of the given reference keyword set 𝐾
ˆ
generates a ciphertext 𝑊
by the following procedure: re-randomized encryption 𝐸𝑝𝑘 (0)
ˆ ∖ 𝐾𝑏 and 𝐸𝑝𝑘 (0)𝑐 if 𝑤 ∈ 𝐾𝑏
if 𝑤 ∈ 𝐾
if 𝑚𝑏 =1, then the construction of MapReduce Filter is
exactly same as that real protocol described above, hence
in this case with probability 1/2 + 𝜖 the adversary returns
𝑏′ such that 𝑏′ = 𝑏.
∙ if 𝑚𝑏 =0, then the simulated MapReduce Filter searches
nothing, hence in this case with probability 1/2 the
adversary returns 𝑏′ such that 𝑏′ =𝑏.
The 𝒮 now outputs what the adversary outputs. As a result,
the challenger 𝒮 obtains the non-negligible advantage 1/2 +
𝜖/2 to break the semantic security of the Pailler’s encryption.
∙
V. C ONCLUSION
We have implemented a private query processing protocol on an inverted index program and have shown that the
proposed protocol is semantically secure if the underlying
homomorphic public-key encryption scheme is semantically
secure. The formalization of the private query processing
protocol is general and we expect more applications can be
deployed within the proposed framework.
R EFERENCES
[1] M.Armbrust, A.Fox, R.Griffith, A.D.Joseph, R.H.Katz, A.Konwinski,
G.Lee, D.A.Patterson, A.Rabkin, I.Stoica, M.Zaharia: Above the Clouds:
A Berkeley View of Cloud Computing, Technical Report No. UCB/EECS2009-28
[2] D.Boneh, E.Kushilevitz, R.Ostrovsky and W.E.Skeith III: Public Key
Encryption That Allows PIR Queries. CRYPTO 2007: 50-67
[3] J.Bethencourt, D.X.Song and B.Waters: New Constructions and Practical
Applications for Private Stream Searching (Extended Abstract). IEEE
Symposium on Security and Privacy 2006: 132-139
[4] John Bethencourt, Dawn Xiaodong Song, Brent Waters: New Techniques
for Private Stream Searching. ACM Trans. Inf. Syst. Secur. 12(3): (2009)
[5] S.Ding, J.Attenberg and T.Suel: Scalable techniques for document identifier assignment in inverted indexes. WWW 2010: 311-320
[6] J.Dean and S.Ghemawat: MapReduce: Simplified Data Processing on
Large Clusters. OSDI 2004: 137-150
[7] J.Dean and S.Ghemawat: MapReduce: simplified data processing on large
clusters. Commun. ACM 51(1): 107-113 (2008)
[8] J.Dean and S.Ghemawat: MapReduce: a flexible data processing tool.
Commun. CACM 53(1): 72-77 (2010)
[9] I.Damgård and M.Jurik: A Generalisation, a Simplification and Some
Applications of Paillier’s Probabilistic Public-Key System. Public Key
Cryptography 2001: 119-136.
[10] R.L.Grossman. The Case for Cloud Computing. IT Professional 11(2):
23-27 (2009)
[11] Robert L. Grossman, Yunhong Gu, Michal Sabala, Wanzhi Zhang:
Compute and storage clouds using wide area high performance networks.
Future Generation Comp. Syst. 25(2): 179-183 (2009)
[12] R.Ostrovsky and W.E.Skeith III: Private Searching on Streaming Data.
CRYPTO 2005: 223-240
[13] R.Ostrovsky and W.E.Skeith III: Private Searching on Streaming Data.
J. Cryptology 20(4): 397-430 (2007)
[14] P.Paillier: Public-Key Cryptosystems Based on Composite Degree Residuosity Classes. EUROCRYPT 1999: 223-238.
[15] S.Petrovic and P.Brown: Large Scale Analysis of the eDonkey P2P File
Sharing System. INFOCOM 2009: 2746-2750
[16] H.Wan, C.Tan and Q.Li: Snoogle: A Search Engine for the Physical
World. INFOCOM 2008: 1382-1390
[17] H.Yan, S.Ding and T.Suel: Compressing term positions in web indexes.
SIGIR 2009: 147-154
[18] H.Yan, S.Ding and T.Suel: Inverted index compression and query
processing with optimized document ordering. WWW 2009: 401-410

Download Report

Private Query Processing on Inverted Index

Paperzz.com

Your Paperzz