Bandwidth-Efficient Continuous Query Processing over DHTs

Bandwidth-Efficient Continuous
Query Processing over DHTs
Yingwu Zhu
Background


Instantaneous Query
Continuous Query
Instantaneous Query (1)

Documents are indexed




Retrieve “one-time” relevant docs
Latency is a top priority
Query Q = t1 Λ t2 …



Node responsible for keyword t stores the IDs of
documents containing that term (i.e., inverted
lists)
Fetch lists of doc IDs stored under t1, t2 ….
Intersect these lists
E.g.: Google search engine
Instantaneous Query (2)
cat:1,4,7,19,20
A
cat?
cat:1,4,7,19,20
“cat Λ dog”?
B
Send Result:
Docs 1,7
dog?
C
dog:1,5,7,26
D
cow:2,4,8,18
bat: 1,8,31
dog:1,5,7,26
Continuous Query (1)

Reverse the role of documents and queries

Queries are indexed



“Push” new relevant docs (incrementally)


Query Q = t1 Λ t2 … stored at one of the terms t1, t2
…
Question 1: How is the index term selected? (query
indexing)
Enabled by “long-lived” queries
E.g.: Google New Alert feature
Continuous Query (2)

Upon a new doc D = t1 Λ t2 (insertion)

Contacts the nodes responsible for the inverted
query lists of D’s keywords t1 and t2


Resolve the query lists  the final list of satisfied
queries (by D)



Question 2: How to locate the nodes (query nodes
QN)? (document announcement)
Question 3: What is the resolution strategy? (query
resolution)
E.g., Term Dialogue, Bloom filters (Infocom’06)
Notify owners of satisfied queries
Query Resolution: Term
Dialogue
1. Document announcement
Doc
cat
dog
cow
2. “dog” & “cow”
A
3. “11” (bit vector)
4. “horse”
5. “0” (bit vector)
Inver. list for “cat”
B
cat (query):
1. dog
2. horse & dog
3. horse & cow
Notify owner of Q1
C
Inver. list for “dog”
D
Inver. list for “cow”
Query Resolution: Bloom
filters
Doc
cat
dog
cow
1. Doc announcement “10110”
(bloom filter)
A
2. “dog” (Term Dialogue)
3. “1” (bit vector)
Inver. list for “cat”
B
cat (query):
1. dog
2. horse & dog
3. horse & cow
Notify owner of Q1
C
Inver. list for “dog”
D
Inver. list for “cow”
Motivation

Latency is not the primary concern, but bandwidth
can be one of the important design issues


Various query indexing schemes incur different cost
Various query resolution strategies cause different costs
 Design a bandwidth-efficient continuous query
system with “proper” query indexing (Question #1),
document announcement (Question #2), and query
resolution (Question #3) approaches
Contributions

Novel query indexing schemes  Question #1


Multicast-based document announcement 
Question #2


Focus of this talk!
In the paper
Adaptive query resolution  Question #3



Make intelligent decisions in resolving query terms
Minimize the bandwidth cost
In the full tech. report paper
Design


Focus on simple keyword queries, e.g., Q = t1
Λ t2 Λ … Λtn
Leverage DHTs


Query indexing


Location & storage of documents and continuous
queries
How to choose index terms for queries?
Doc. announcement, query resolution

Not covered in this talk!
Current Indexing Schemes


Random Indexing (RI)
Optimal Indexing (OI)
Random Indexing (RI)

Randomly chooses a term as index term

Q = t1 Λ … Λ tm




Index term ti is randomly selected
Q is indexed in a DHT node responsible for ti
Pros: simple
Cons:

Popular terms are more likely to be index terms
for queries


Load imbalance
Introduce many irrelevant queries in query resolution,
wasting bandwidth
Optimal Indexing (OI)

Q = t1 Λ … Λ tm



Pros:


Index term ti is deterministically chosen, the most selective
term, i.e., with the least frequency
Q is indexed in a DHT node responsible for ti
Maximize load balance & minimize bandwidth cost
Cons:


Assume perfect knowledge of term statistics
Impractical, e.g., due to large number of documents, node
churn, continuous doc updates, ….
Solution 1: MHI

Minimum Hash Indexing



Order query terms by their hashes
Select the term with minimum hash as the index
term
Q = t1 Λ… Λ tm


Index term ti is deterministically chosen, s.t. h(ti) < h(tx)
(for all x≠i)
Q is indexed in a DHT node responsible for ti
RI v.s. MHI
D = {t2, t4, t5, t6}
t1
t2
t3
t4
t5
t6
t7
Where h(ti) < h(tj) for i < j.
• 3 queries, irrelevant to D:
•Q1= t1 Λ t2 Λ t4
•Q2= t3 Λ t4 Λ t5
•Q3= t3 Λ t5 Λ t6
(1) RI:
•Q1, Q2, and Q3 will be considered in query resolution each with
probability of 67% (need to resolve terms t1,t2,t3,t4,t5,and t6)
(2) MHI
•All of them will be filtered out!  bandwidth savings!
•How?
MHI: filtering irrelevant
queries!
t
B
No action
t5:
C
none
G
No action
A
t4:
2:
none
D
none
No action
Disregarded in
query resolution,
saving bandwidth!
D = {t2, t4, t5, t6}
No action
E
t 1:
Q1
t6:
none
F
t 3:
Q2 , Q3
Q1= t1 Λ t2 Λ t4
Q2= t3 Λ t4 Λ t5
Q3= t3 Λ t5 Λ t6
MHI

Pros:




Simple and deterministic
Does not require term stats
Saves bandwidth over RI (up to 39.3% saving for
various query types)
Cons:

Some popular terms can be index terms by their
minimum hashes in their queries!

Load imbalance & irrelevant queries to process
Solution 2: SAP-MHI


MHI is good but may still index queries under
popular terms
SAmPling-based MHI(SAP-MHI)



Sampling (synopsis of K popular terms) + MHI
Avoid indexing queries under K popular terms
Challenge: support duplicate-sensitive aggregates of
popular terms as synopses may be gossiped over
multiple DHT overlay links and term frequencies may
be overestimated!
 Borrow idea from duplicate-sensitive aggregation in
sensor networks
SAP-MHI

Duplicate-sensitive aggregation




Goal:  a synopsis of K popular terms
Based on coin tossing experiment CT(y)
 Toss a fair coin until either the first head occurs or y coin
tosses end up with no head, and return the number of tosses
Each node a
 Produce a local synopsis Sa containing K popular terms (the
terms with the highest values of CT(y))
 Gossip Sa to its neighbor nodes
 Upon receiving a synopsis Sb from a neigbor b, aggregate Sa
and Sb, producing a new synopsis Sa (max() operations)
 Thus, each node has a synopsis of K popular terms after a
sufficient number of gossip rounds
Intuition: If a term appears in more documents then its value
produced by CT(y) will be larger than the values of rare terms
SAP-MHI: Indexing Example



Query Q=t1 Λ t2 Λ t3 Λ t4 Λ t5, where
h(t1)<h(t2)<h(t3)<h(t4)<h(t5)
Synopsis S={t1,t2}
Q is indexed on the node which is
responsible for t3, instead of t1
Simulations
Parameter
Value
DHT
1000-node Chord
Document collection
TREC-1,2-AP
Mean of query sizes
5
# of continuous queries
100,000
# of docs
10,000
# of unique terms
46,654
# of unique terms per doc
178
Query types
Skew, Uniform, InverSkew
Query resolution
Term Dialogue, Bloom filters
SAP-MHI v.s. MHI
SAP-MHI improves load balance over MHI with increasing
synopsis size K, for Skew queries.
Bandwidth saving (%)
SAP-MHI v.s. MHI
70
60
50
40
30
20
10
0
Bloom filters are used in query resolution.
Skew
Uniform
InverSkew
100
500
1000
1500
Synopsis size K
2000
3000
Bandwidth saving (%)
SAP-MHI v.s. MHI
100
80
60
Term Dialogue is used in query resolution.
Skew
Uniform
InverSkew
40
20
0
100
500
1000
1500
Synopsis size K
2000
3000
% of queries filtered
SAP-MHI v.s. MHI
100
80
60
Skew
Uniform
InverSkew
40
20
0
100
500
1000
1500
Synopsis size K
2000
3000
This shows why SAP-MHI saves bandwidth over MHI!
Summary



Focus on a simple keyword query model
Bandwidth is a top priority
Query indexing impacts bandwidth cost
 Goal: Sift out as many irrelevant queries as possible!
 MHI and SAP-MHI
 SAP-MHI is a more viable solution


Load is more balanced, more bandwidth saving!
Sampling cost is controlled



# of popular terms is relatively low
Memberships of popular terms do not change rapidly
Document announcement & adaptive query resolution
further cut down bandwidth consumption (not covered
in this talk)
Thank You!