ODISSEA

ODISSEA
open distributed search engine architecture
A Peer-to-Peer Architecture for
Scalable Web Search and Information
Retrieval
Torsten Suel, Chandan Mathur, Jo-Wen Wu, Jiangong Zhang,
Alex Delis, Mehdi Kharrazi, Xiaohui Long, Kulesh Shanmugasundaram
Daniel Porta ([email protected])
Talk Outline







Motivation
Design Overview
System Design Details
Target Applications
Implementation Details
Efficient Query Processing
Open Questions
Seminar "Peer-to-peer Information Systems"
2
Motivation





Today, main part of the web search infrastructure is supplied by
only a few large crawl-based search engines
Strong research in the field of P2P systems over the last few
years
Computers have/will become faster and the network bandwidth
has increased/will grow
This raises two issues
 Vast data in P2P networks requires the ability to search in these
networks
 Significant computing resources provided by a P2P system could
be used to search content residing inside or outside the system
ODISSEA attempts both issues by a „distributed global indexing
and query execution service“
Seminar "Peer-to-peer Information Systems"
3
Design Overview



ODISSEA is different from many other
approaches to P2P search
It assumes a two-layered search engine
architecture and a global index structure
distributed over the nodes of the system
In a global index, as contradiction to a local
index, a single node holds the entire
inverted index for a particular term
Seminar "Peer-to-peer Information Systems"
4
Two Layer Approach

Lower layer provides
 maintanance of the global index structure
under document insertions and updates
 Maintanance of node joins and failures
 Efficient execution of simple search
queries
queries
WWW

ODISSEA
Search
server
Upper layer interacts with P2P-based
lower layer via two classes of clients
 Update clients (e.g crawler, web server)
 Query clients (user implemented
optimized query execution plan)
Seminar "Peer-to-peer Information Systems"
queries
crawler
5
Two Layer Approach




Enables a large variety of (client-based) search tools that
more fully exploit client computing resources.
Those tools could share the same lower-layer web search
infrastructure.
Tools are developed using an open API, which accesses the
search infrastructure
When processing a query, this could in the most general case
(i.e where no pre-evaluation is done on server-side) result in
large amounts of data to be transferred to the query client
Seminar "Peer-to-peer Information Systems"
6
Global vs. Local index

A
B

C


search
client
posting = [DocID, Position,
additional information]
A: chair
Inverted list is a list of postings
B: table
that represents all occurencies of
a term in the document collection
C
Inverted index for terms is the
set of the corresponding inverted
lists
Suppose a query „chair AND
table“. Then the query will be
processed as follows
search
client
Seminar "Peer-to-peer Information Systems"
7
Global vs. Local index




Local index organization is very inefficient in very large
networks (e.g. web) if result quality is the major concern,
because the query has to be transmitted to all nodes and all
of them have to respond
But in a global index organization large amounts of data need
to be transmitted between nodes when
 Initially building the index
 Evaluating a query  bad response time
Can be overcome with smart algorithmic techniques, as you
will see later
Choice depends on the types of queries and the frequency of
document updates, as well as on the question of how dynamic
the system is
Seminar "Peer-to-peer Information Systems"
8
Crawling and Fault Tolerance
Crawling approach
 Client-based, non P2P crawlers have the advantage that they
can be easily altered in the case that some web site operators
have complains about the bot
 Smart crawling strategies beyond BFS are hard to implement
in a P2P environment unless there is a centralized scheduler
P2P systems and fault tolerance
 System design relies on the assumption of a more stable P2P
environment, since otherwise administration (insert, update,
replication) would be too expencive
Seminar "Peer-to-peer Information Systems"
9
Target Applications




Full-text search in large document collections located within in
P2P communities
Search in large intranet environments
Web Search: a powerful API supports the anticipated shift
towards client-based search tools which better exploit the
resources of todays desktop machines
Search middleware: Instead of inserting documents, clients
could directly insert index entries. This might speed up query
execution, since for a document only certain „strong“
keywords can be inserted. But a drawback could be that the
identification of such keywords lies in client‘s hand
Seminar "Peer-to-peer Information Systems"
10
Implementation Details



Currently, a first system is being implemented in
Java, using Pastry as a P2P substrate (lower layer)
and a DHT mapping for hashing IDs to the
appropriate IP-address
Each node runs an indexer that stores inverted list
in compressed form in a Berkeley DB (which
contains a B+ tree), each document is also stored in
a Berkeley DB
Using MD5, all documents and term lists are hashed
to a 80-bit ID that is used for lookups in the system
Seminar "Peer-to-peer Information Systems"
11
Implementation Details
Parsing and Routing Postings
 New or updated documents are parsed at the node where they
reside, as determined by the DHT mapping
 Parser generates for each term a posting that is routed via
several intermediate nodes, as determined by the topology of the
Pastry network, until it reaches its destination node
 An index structure of a node is split up in a small structure
(residing in main memory) that is eventually merged with a bigger
structure on disk to avoid disk accesses during inserts/updates
 lower amortized complexity
Seminar "Peer-to-peer Information Systems"
12
Implementation Details
Groups and Splits
 Initially, all objects (documents, indexes) whose first w bits
(here w=16) coincide are placed into a common group
identified by this w-bit string
 Locally, each group maintains a Berkeley DB with all objects it
contains
 When a group (of documents) becomes too large (here
>1GB), it is split into two groups identified by a (w+1)-bit string
leaving a stub structure pointing to the new groups that are
assigned to new nodes
 If index structures for terms are too large (here >100MB), they
are split into two lists according to the document IDs they
contain
Seminar "Peer-to-peer Information Systems"
13
Implementation Details
Replication
 Performed at group level by attaching „/0“, „/1“, etc. to the
group label (e.g. 0100101/2)
 This new label is then what is really presented to Pastry/DHT
during lookups
 All replicas of a group form a „clique“ that communicate
periodically to update their status
 If a group replica fails, the others are in charge of detecting
this and if necessary perform repair
 Each node can contain several distinct group replicas and
therefore participate in several cliques
 Postings are first routed to only one replica that is then in
charge of forwarding them to the others over a period of a few
minutes
Seminar "Peer-to-peer Information Systems"
14
Implementation Details
Faults, Unavailability and Synchronization
 When a node leaves the system, its group replicas eventually
have to be replaced to maintain the desired degree of
replication
 A node has failed if it has been unavailable for an extended
period of time
 Create new replicas for a failed node or if a certain number of
nodes are unavailable
 Former unavailable nodes have to synchronize its index
structures using logs of missing updates
Seminar "Peer-to-peer Information Systems"
15
Efficient Query Processing
Information Thoeretic Background
 Let d be a document, q = q0…qm-1 a query consisting of m terms and
F be a function that assigns d (depending on q) a value F(d,q). Such
a function is called a ranking function.
 The top-k ranking problem for a query q is finding the k documents
with the highest values F(d,q).
 A common form of such a function looks like this
m 1
F (d , q)   f (d , qi )
i 0

Since queries typically have at most only 2 search terms, the
following algorithm focuses on the top-k ranking problem and
queries with exactly 2 search terms (for one-term queries, there is in
fact nothing to do)
Seminar "Peer-to-peer Information Systems"
16
Efficient Query Processing
Fagin‘s Algorithm (FA)
 Intuitively, an item that is ranked in the top is likely to be
ranked very high in at least one of the contributing
subcategories
 Assume a query q = q0 AND q1 and postings of the form
(d,f(d,qi)) that are sorted by the second component with
highest values on top
 Also assume that the inverted lists for q0 and q1 are located
on the same machine, so that no network communication is
required
 Goal: compute the top k documents as fast as possible
Seminar "Peer-to-peer Information Systems"
17
Efficient Query Processing
1. Scan both lists from the beginning, by reading one element from each list in
every step, until there are k documents that have been encountered in both
lists (here assume k=2)
1
2
3
4
5
A
0.9
0.8
0.7
0.69
0.67
B
0.6
0.5
0.4
0.3
0.2
6
5
3
1
7
2. Compute the scores of
these k documents.
Also, for each
document that was
encoutered in only one
of the lists, perform a
0.1
lookup into the other
list to determine the
8
score of the document.
3. Return the k documents with the highest score (here d1, d5)
Seminar "Peer-to-peer Information Systems"
18
Efficient Query Processing
Threshold Algorithm (TA)
 Scan both lists simultaneously and read (d,f(d,q0)) from the
first and (d‘,f(d‘,q1)) from the second list
 Compute t = f(d,q0) + f(d‘,q1)
 For each d in one of the lists perform immediately a lookup in
the other list in order to compute its complete score
 Algorithm terminates, when k documents have been found
that have higher scores than the current value of t
Because it does not make sense to scan two lists simultaneously
while they are distributed in a P2P network, the above
techniques have to be adapted. This leads us to the following
protocol that aims at minimize the data to be transferred.
Seminar "Peer-to-peer Information Systems"
19
Efficient Query Processing
A simple distributed pruning protocol (DPP)
Node A (holding the shorter list) sends the
first x postings to node B. Let rmin be the
smallest value f(d,q0) transmitted
Node A now
performs
lookups into its
own list for the
postings
A
B
received from
B and
determines the Node B now transmitts to A all postings
overall top k
among its first x postings with
documents
f(d,q1) > rk - rmin, together with the total
scores of the k documents from the
previous step
Seminar "Peer-to-peer Information Systems"
Node B receives
the postings from
A and performs a
lookup into its
own list in order
to compute the
total scores.
Retain the k
documents with
the highest
scores. Let rk be
the smallest value
among these.
20
Efficient Query Processing
DPP-Example for k=2 and x=3:
A containing term q0: (d1, 0.9), (d2, 0.8), (d3, 0.7), (d4, 0.69), (d5, 0.67)
B containing term q1: (d6, 0.6), (d5, 0.5), (d3, 0.4), (d1, 0.3), (d7, 0.2), (d8, 0.1)
A computes:
(d6, 0.6+ ---- ),
(d5, 0.5+0.67),
A to B:
 rmin = 0.7
(d1, 0.9), (d2, 0.8), (d3, 0.7)
A
B
(d1, 1.2)
0.9 + 0.3)
(d32, 1.1)
0.8 + ----)
(d3, 0.7 + 0.4)
 rk = 1.1
Top 2 documents:
1. (d1, 1.2)
2. (d5, 1.17)
B retains:
computes:
rk – rmin = 1.1 - 0.7 = 0.4
B to A:
(d6, 0.6), (d5, 0.5), because f(d6,5,q1) > 0.4
together with (d1, 1.2), (d3, 1.1)
Seminar "Peer-to-peer Information Systems"
21
Efficient Query Processing
Problems with the DPP
 works only with queries containing 2 search terms
 random lookups can cause disk accesses, since large index
structures reside on hard disk  bad response time
 How must the value of x be chosen?
(x should be the number of postings transmitted from A and B,
s.t. DPP works correct without extra roundtrip; depends on the
k and length of the inverted lists)
 By deriving appropriate formulae based on extensive
testing
 By sampling-based methods that estimate the number of
documents appearing in both lists
Seminar "Peer-to-peer Information Systems"
22
Efficient Query Processing
Evaluation of DPP
 900 two-term queries selected form a set of over 1 million
 Testing corpora: 120 million web pages (1.8TB) that were
crawled by their own crawler
 Value of x determined by experiments on TA
 Computation within nodes are not taken into account
 Commmunication costs and estimated times of DPP for the
top-10 documents and standard cosine measure:
Shorter lists
# postings A  B
# postings B  A
Total bytes transferred
Total com time (400Kbps)
Total com time (2Mbps)
shortest 20%
10.401
2.057
1.486
28.344
1.052
833
shorter 20% middle 20% longer 20% longest 20%
63.853
222.948
666.717
3.371.176
4.083
2.904
4.417
3.745
4.084
2.891
4.413
3.745
65.336
46.360
70.640
59.920
1.477
1.216
1.550
1.405
1.368
1.107
1.441
1.295
Seminar "Peer-to-peer Information Systems"
23
Future Work




Framework for generating optimized query
execution plans for multi-keyword queries
New algorithmic techniques for the index
synchronization problem
New strategies for load balancing and
rebuilding of lost replicas
More experimental evaluation concerning
different types of queries
Seminar "Peer-to-peer Information Systems"
24
Questions?
„The general question remains whether the near
future will see massive P2P-based systems for
challenging applications such as web search and
large-scale IR, beyond the current simple
applications such as file sharing.“
Seminar "Peer-to-peer Information Systems"
25