Bandwidth-Efficient Continuous
Query Processing over DHTs
Yingwu Zhu
Background
Instantaneous Query
Continuous Query
Instantaneous Query (1)
Documents are indexed
Retrieve “one-time” relevant docs
Latency is a top priority
Query Q = t1 Λ t2 …
Node responsible for keyword t stores the IDs of
documents containing that term (i.e., inverted
lists)
Fetch lists of doc IDs stored under t1, t2 ….
Intersect these lists
E.g.: Google search engine
Instantaneous Query (2)
cat:1,4,7,19,20
A
cat?
cat:1,4,7,19,20
“cat Λ dog”?
B
Send Result:
Docs 1,7
dog?
C
dog:1,5,7,26
D
cow:2,4,8,18
bat: 1,8,31
dog:1,5,7,26
Continuous Query (1)
Reverse the role of documents and queries
Queries are indexed
“Push” new relevant docs (incrementally)
Query Q = t1 Λ t2 … stored at one of the terms t1, t2
…
Question 1: How is the index term selected? (query
indexing)
Enabled by “long-lived” queries
E.g.: Google New Alert feature
Continuous Query (2)
Upon a new doc D = t1 Λ t2 (insertion)
Contacts the nodes responsible for the inverted
query lists of D’s keywords t1 and t2
Resolve the query lists the final list of satisfied
queries (by D)
Question 2: How to locate the nodes (query nodes
QN)? (document announcement)
Question 3: What is the resolution strategy? (query
resolution)
E.g., Term Dialogue, Bloom filters (Infocom’06)
Notify owners of satisfied queries
Query Resolution: Term
Dialogue
1. Document announcement
Doc
cat
dog
cow
2. “dog” & “cow”
A
3. “11” (bit vector)
4. “horse”
5. “0” (bit vector)
Inver. list for “cat”
B
cat (query):
1. dog
2. horse & dog
3. horse & cow
Notify owner of Q1
C
Inver. list for “dog”
D
Inver. list for “cow”
Query Resolution: Bloom
filters
Doc
cat
dog
cow
1. Doc announcement “10110”
(bloom filter)
A
2. “dog” (Term Dialogue)
3. “1” (bit vector)
Inver. list for “cat”
B
cat (query):
1. dog
2. horse & dog
3. horse & cow
Notify owner of Q1
C
Inver. list for “dog”
D
Inver. list for “cow”
Motivation
Latency is not the primary concern, but bandwidth
can be one of the important design issues
Various query indexing schemes incur different cost
Various query resolution strategies cause different costs
Design a bandwidth-efficient continuous query
system with “proper” query indexing (Question #1),
document announcement (Question #2), and query
resolution (Question #3) approaches
Contributions
Novel query indexing schemes Question #1
Multicast-based document announcement
Question #2
Focus of this talk!
In the paper
Adaptive query resolution Question #3
Make intelligent decisions in resolving query terms
Minimize the bandwidth cost
In the full tech. report paper
Design
Focus on simple keyword queries, e.g., Q = t1
Λ t2 Λ … Λtn
Leverage DHTs
Query indexing
Location & storage of documents and continuous
queries
How to choose index terms for queries?
Doc. announcement, query resolution
Not covered in this talk!
Current Indexing Schemes
Random Indexing (RI)
Optimal Indexing (OI)
Random Indexing (RI)
Randomly chooses a term as index term
Q = t1 Λ … Λ tm
Index term ti is randomly selected
Q is indexed in a DHT node responsible for ti
Pros: simple
Cons:
Popular terms are more likely to be index terms
for queries
Load imbalance
Introduce many irrelevant queries in query resolution,
wasting bandwidth
Optimal Indexing (OI)
Q = t1 Λ … Λ tm
Pros:
Index term ti is deterministically chosen, the most selective
term, i.e., with the least frequency
Q is indexed in a DHT node responsible for ti
Maximize load balance & minimize bandwidth cost
Cons:
Assume perfect knowledge of term statistics
Impractical, e.g., due to large number of documents, node
churn, continuous doc updates, ….
Solution 1: MHI
Minimum Hash Indexing
Order query terms by their hashes
Select the term with minimum hash as the index
term
Q = t1 Λ… Λ tm
Index term ti is deterministically chosen, s.t. h(ti) < h(tx)
(for all x≠i)
Q is indexed in a DHT node responsible for ti
RI v.s. MHI
D = {t2, t4, t5, t6}
t1
t2
t3
t4
t5
t6
t7
Where h(ti) < h(tj) for i < j.
• 3 queries, irrelevant to D:
•Q1= t1 Λ t2 Λ t4
•Q2= t3 Λ t4 Λ t5
•Q3= t3 Λ t5 Λ t6
(1) RI:
•Q1, Q2, and Q3 will be considered in query resolution each with
probability of 67% (need to resolve terms t1,t2,t3,t4,t5,and t6)
(2) MHI
•All of them will be filtered out! bandwidth savings!
•How?
MHI: filtering irrelevant
queries!
t
B
No action
t5:
C
none
G
No action
A
t4:
2:
none
D
none
No action
Disregarded in
query resolution,
saving bandwidth!
D = {t2, t4, t5, t6}
No action
E
t 1:
Q1
t6:
none
F
t 3:
Q2 , Q3
Q1= t1 Λ t2 Λ t4
Q2= t3 Λ t4 Λ t5
Q3= t3 Λ t5 Λ t6
MHI
Pros:
Simple and deterministic
Does not require term stats
Saves bandwidth over RI (up to 39.3% saving for
various query types)
Cons:
Some popular terms can be index terms by their
minimum hashes in their queries!
Load imbalance & irrelevant queries to process
Solution 2: SAP-MHI
MHI is good but may still index queries under
popular terms
SAmPling-based MHI(SAP-MHI)
Sampling (synopsis of K popular terms) + MHI
Avoid indexing queries under K popular terms
Challenge: support duplicate-sensitive aggregates of
popular terms as synopses may be gossiped over
multiple DHT overlay links and term frequencies may
be overestimated!
Borrow idea from duplicate-sensitive aggregation in
sensor networks
SAP-MHI
Duplicate-sensitive aggregation
Goal: a synopsis of K popular terms
Based on coin tossing experiment CT(y)
Toss a fair coin until either the first head occurs or y coin
tosses end up with no head, and return the number of tosses
Each node a
Produce a local synopsis Sa containing K popular terms (the
terms with the highest values of CT(y))
Gossip Sa to its neighbor nodes
Upon receiving a synopsis Sb from a neigbor b, aggregate Sa
and Sb, producing a new synopsis Sa (max() operations)
Thus, each node has a synopsis of K popular terms after a
sufficient number of gossip rounds
Intuition: If a term appears in more documents then its value
produced by CT(y) will be larger than the values of rare terms
SAP-MHI: Indexing Example
Query Q=t1 Λ t2 Λ t3 Λ t4 Λ t5, where
h(t1)<h(t2)<h(t3)<h(t4)<h(t5)
Synopsis S={t1,t2}
Q is indexed on the node which is
responsible for t3, instead of t1
Simulations
Parameter
Value
DHT
1000-node Chord
Document collection
TREC-1,2-AP
Mean of query sizes
5
# of continuous queries
100,000
# of docs
10,000
# of unique terms
46,654
# of unique terms per doc
178
Query types
Skew, Uniform, InverSkew
Query resolution
Term Dialogue, Bloom filters
SAP-MHI v.s. MHI
SAP-MHI improves load balance over MHI with increasing
synopsis size K, for Skew queries.
Bandwidth saving (%)
SAP-MHI v.s. MHI
70
60
50
40
30
20
10
0
Bloom filters are used in query resolution.
Skew
Uniform
InverSkew
100
500
1000
1500
Synopsis size K
2000
3000
Bandwidth saving (%)
SAP-MHI v.s. MHI
100
80
60
Term Dialogue is used in query resolution.
Skew
Uniform
InverSkew
40
20
0
100
500
1000
1500
Synopsis size K
2000
3000
% of queries filtered
SAP-MHI v.s. MHI
100
80
60
Skew
Uniform
InverSkew
40
20
0
100
500
1000
1500
Synopsis size K
2000
3000
This shows why SAP-MHI saves bandwidth over MHI!
Summary
Focus on a simple keyword query model
Bandwidth is a top priority
Query indexing impacts bandwidth cost
Goal: Sift out as many irrelevant queries as possible!
MHI and SAP-MHI
SAP-MHI is a more viable solution
Load is more balanced, more bandwidth saving!
Sampling cost is controlled
# of popular terms is relatively low
Memberships of popular terms do not change rapidly
Document announcement & adaptive query resolution
further cut down bandwidth consumption (not covered
in this talk)
Thank You!
© Copyright 2026 Paperzz