Document

WEB SEARCH and P2P
Advisor: Dr Sushil Prasad
Presented By: DM Rasanjalee Himali
OUTLINE
Introduction to web search engines
What is a web search engine?
Web Search engine architecture
How a web search engine work?
Relevance and Ranking
Limitations in current Web Search Engines
P2P Web Search Engines
YouSearch
Copeer
ODISSEA
Conclusion
What is a web search engine?
A Web search engine is a search engine
designed to search for information on the
World Wide Web.
Information may consist of web pages,
images and other types of files.
Some search engines also mine data
available in newsgroups, databases, or
open directories
History…
Before there were search engines
there was a complete list of all
webservers.
The very first tool used for
searching on the Internet was
Archie
downloaded directory listings of
files on FTP sites
did not index the contents of
these sites
Soon after, many search engines
appeared
Excite, Infoseek, Northern Light,
AltaVista. Yahoo!, Google, MSN
Search
Company
Millions of
searches
Relative
market
share
Google
28,454
46.47%
Yahoo!
10,505
17.16%
Baidu
8,428
13.76%
Microsoft
7,880
12.87%
NHN
2,882
4.71%
eBay
2,428
3.9%
Time Warner
1,062
1.6%
Ask.com
728
1.1%
Yandex
566
0.9%
Alibaba.com
531
0.8%
Total
61,221
100.0%
How Web Search Engine Work
A search engine operates, in the
following order
Web crawling
Indexing
Searching
Web Crawling
A web crawler
a program or which browses the World Wide Web in a
methodical, automated manner.
a means of providing up-to-date data
create a copy of all the visited pages for later processing by a
search engine
starts with a list of URLs to visit, called the seeds.
As the crawler visits these URLs, it identifies all the
hyperlinks in the page and adds them to the list of URLs to
visit, called the crawl frontier.
URLs from the frontier are recursively visited according to a
set of policies.
Robot Exclusion Protocol
also known as the robots.txt protocol
is a convention to prevent cooperating web robots from accessing all or part of a
website which is otherwise publicly viewable.
User-agent
Disallow
Disallow
Disallow
Disallow
Sitemap
Crawl-delay
Allow
Request-rate
Visit-time
:
:
:
:
:
:
:
:
:
:
*
/cgi-bin/
/images/
/tmp/
/private/
http://www.example.com/sitemap.xml.gz
10
/folder1/myfile.html
1/5 # maximum rate is one page every 5 seconds
0600-0845 # only visit between 06:00 and 08:45 UTC (GMT)
It relies on the cooperation of the web robot, so that marking an area of a site out of
bounds with robots.txt does not guarantee privacy.
The standard complements Sitemaps, a robot inclusion standard for websites.
SiteMap Protocol
allows a webmaster to inform search engines about URLs on a
website that are available for crawling.
A Sitemap is an XML file that lists the URLs for a site.
It allows webmasters to include additional information about
each URL:
when it was last updated,
how often it changes, and
how important it is in relation to other URLs in the site.
This allows search engines to crawl the site more intelligently.
Sitemaps are a URL inclusion protocol complement robots.txt
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/</loc>
<lastmod>2005-01-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
</urlset>
Indexing
The purpose of storing an index is to optimize speed and
performance in finding relevant documents for a search query
Search engine indexing collects, parses, and stores data to
facilitate fast and accurate information retrieval.
The contents of each page are analyzed to determine how it
should be indexed
Ex: words are extracted from the titles, headings, or special
fields called meta tags
Meta search engines reuse the indices of other services and
do not store a local index
Inverted Indices
inverted index stores a list of the documents containing
each word
search engine can use direct access to find the
documents associated with each word in the query to
retrieve the matching documents quickly
Word
Documents
the
Document 1, Document 3, Document 4, Document 5
cow
Document 2, Document 3, Document 4
says
Document 5
moo
Document 7
Searching
web search query
a query that a user enters into web
search engine to satisfy his or her
information needs.
is distinctive in that it is unstructured
and often ambiguous
vary greatly from standard query
languages which are governed by strict
syntax rules.
Web search engine architecture
URL
List
Fetched
pages
Compress +
store
-Relative URLs
-  absolute URLs
- docIDs
Anchors file
Partiall sorted
forward index
Resort baralls
by word IDs
Calculate
PR of all
docs
PR
- Read repository
- Uncompress + parse docs to hit list
- Distribute hit list to baralles by docID
- Parse out links and store in anchor file
lexicon
Inverted
index
Answer
queries
From “The Anatomy of a Large-Scale
Hypertextual
Web Search Engine”
Sergey Brin and Lawrence Page
Relevance and Ranking
Exactly how a particular search engine's algorithm works is a
closely-kept trade secret.
However, all major search engines follow the general rules
below.
Location, Location, Location...and Frequency
Location:
Search engines will also check to see if the search keywords appear
near the top of a web page, such as in the headline or in the first
few paragraphs of text. They assume that any page relevant to the
topic will mention those words right from the beginning.
Frequency:
A search engine will analyze how often keywords appear in relation
to other words in a web page. Those with a higher frequency are
often deemed more relevant than other web pages.
Precision and Recall
two widely used measures for evaluating the quality
of results in Information Retrieval
Precision
fraction of the documents retrieved that are relevant to
the user's information need
= number of relevant documents retrieved by a search
___________________________________________________
the total number of documents retrieved by that search
Recall
the fraction of the documents that are relevant to the
query that are successfully retrieved.
= number of relevant documents retrieved by a search
_____________________________________________________
the total number of existing relevant documents which
should have been retrieved
Often, there is an inverse relationship between
Precision and Recall
Relevance and Ranking
webmasters constantly rewrite their web
pages in an attempt to gain better
rankings.
Some sophisticated webmasters may
even "reverse engineer" the
location/frequency systems used by a
particular search engine
Because of this, all major search engines
now also make use of "off the page"
ranking criteria
Relevance and Ranking
Off the page factors:
those that a webmasters cannot easily influence
Link analysis
Search engine analyzing how pages link to each other
Helps to determine what a page is about and whether
that page is "important" and thus deserving of a ranking
boost
Click through measurement
a search engine watch what results someone selects for a
particular search,
eventually drop high-ranking pages that aren't attracting
clicks,
promote lower-ranking pages that do pull in visitors.
Limitations in current web
search engines
Centralized search engines have limited scalability.
crawler based indices are stale and incomplete
Fundamental issue: How much of the web is
‘crawlable’
If you follow the rules many sites say “robots get lost”
What about Dynamic content? (Deep Web)
The deep web is around 500 times larger than surface web.
These deep web resources mainly include data held by
databases which can be accessed only through queries.
Since crawlers discover resources only through links, they
cannot discover these resources.
There’s no guarantee that current search engines index or
even crawl the total surface web space
Limitations in current web
search engines
Single point of failure
Ambiguous ‘words’
Polysemy - words with multiple meanings “train car” “train
neural network”
Synonymy - multiple words same meaning: “neural network is
trained as follows” “neural network learns as follows”
What about ‘phrases’ - searches are not ‘bag of words’
Positional information? Structural (throw out case & punctuation)?
Non-text content data worth storing
Most web search engines today crawl only surface web.
P2P Web Search
Seen explosion of activity in the area of peer-to-peer (P2P)
systems last few years
Since an increasing amount of content now resides in P2P
networks, it becomes necessary to provide search facilities
within P2P networks.
The significant computing resources provided by a P2P
system could also be used to implement search and data
mining functions for content located outside the system
e.g., for search and mining tasks across large intranets or
global enterprises, or even to build a P2P-based
alternative to the current major search engines.
P2P Web Search
The characteristics distinguish P2P
systems from previous technologies:
-
low maintenance overhead
improved scalability
Improved reliability
synergistic performance
increased autonomy and privacy
Dynamism
P2P Web Search Engines
YouSearch
Coopeer
ODISSEA
YouSearch
YouSearch :
is a distributed search application for
personal webservers operating within a
shared context
Allow peers to aggregate into groups and
users to search over specific groups
Goal:
Provide fast, fresh and complete results to
users
YouSearch
System Overview
participants in YouSearch:
Peer-nodes
run YouSearch enabled
clients
Browsers
search YouSearch enabled
content through their web
browsers
Registrar
centralized light-weight
service that
acts like a “blackboard” on
which peer nodes store and
lookup (summarized)
network state.
YouSearch
System Overview
Search System:
Each peer node closely monitors its own content to
maintain a fresh local index
A bloom filter content summary is created by each peer
and pushed to the registrar.
When a browser issues a search query at a peer p , the
peer p first queries the summaries at the registrar to
obtain a set of peers R in the network that are hosting
relevant documents.
The peers in R are then directly contacted by with the
query to obtain the URLs for its results.
To quickly satisfy any subsequently issued queries with
identical terms, the results from each query issued at a
peer p are cached for a limited time at p
YouSearch
Indexing
Indexing is periodically
executed at every peer node.
Inspector examines each
shared file for its last
modification date and time.
If the file is new or the file
has changed, the file is
passed to the Indexer.
The Indexer maintains a
disk-based inverted-index
over the shared content.
The name and path
information of the file are
indexed as well.
YouSearch
Indexing
Summarizer:
The Summarizer obtains a list of terms T from the Indexer
and creates a bloom filter from them in the following way.
A bit vector V of length L is created with each bit set to 0.
A specified hash function H with range {1,...,L} is used to
hash each term t in T and the bit at position H(t) in V is set
to 1
YouSearch use k independent hash functions H1,H2,...,Hk
and construct k different bloom filters, one for each hash
function
In YouSearch,
the length of each bloom filter is L = 64 Kbits and
the number of bloom filters k is set to 3
Summary Manager at the registrar aggregate these Bloom
Filters into a structure that maps each bit position to a set of
peers whose Bloom Filters have the corresponding bit set
YouSearch
Querying
computes the
hash of keywords
determine
Corresponding bits of
each k bloom filters
keywords
results
Bit position to
IP address
mapping
contacts each of the peers in list and
obtains a list of URLs for matching
documents
YouSearch
Caching
Every time a global query is answered that
returns non-zero results, the querying peer
caches the result set of URLs U (temporary)
The peer then informs the registrar of the fact.
The registrar adds a mapping from the query to
the IP-address of the caching peer in its cache
table
YouSearch
Limitations:
False Positive results : 17.38%
Central registrar >> single point of failure
No extensive phrase search
No attention has been given for query
ranking
No human user collaboration
Coopeer
Coopeer:
Is a P2P web search engine where each
user computer stores a part of the web
model used for indexing and retrieving
web resources in response to queries
Goal:
complement centralized search engines
to provide more humanized and
personalized results by utilizing users’
collaboration
Coopeer
(a)Collaboration
One may look for interesting web pages in the P2P knowledge
repository consisted with shared web pages.
A novel collaborative filtering technique called PeerRank is
presented to rank pages proportional to the votes from relevant
peers;
(b)Humanization
Coopeer use a query-based representation for documents,
The relevant words are not directly extracted from page content
but introduced by human users with a high proficiency in their
expertise domains.
(c)Personalization
Similar users are self-organized according to their semantic
content of search session.
Thus, requestor peer can extend routing paths along its neighbors,
rather than just take a blind shot.
User-customized results can be obtained along personal routing
paths in contrast with CSEs.
Coopeer
System Overview
requestor forwards the query based on the semantically routing.
Peers maintain a local index about the semantic content of
remote peers.
Receiving a query message from remote peer, current peer check
it against the local store.
In order to facilitate this work, a novel query-based
representation about documents is introduced.
Based on query representation, cosine similarity between new
query and documents can be computed.
the documents are relevant enough, if the similarity exceeds a
certain threshold.
Then these results are returned to the requestor.
Receiving the returned results, the requestor peer need to rank
them in term of preference of its human owner using PeerRank
method.
Coopeer
The Coopeer client consists of four main
software agents:
1. The User Agent
is responsible for interacting with the users.
It provides a friendly user interface, so that
users can conveniently manage and
manipulate the whole search sessions.
2.
The Web-searcher Agent
is the resource of P2P knowledge repository.
It performs the user’s individual searching
with several search engines from the Internet.
3.
The Collaborator Agent
is the key component for performing users’
real-time collaborative searching.
It facilitates maintaining the P2P knowledge
repository, such as information sharing,
searching, and fusion.
4.
The Manager Agent
is the key component of Coopeer, which
coordinates and manage the other types of
agents.
It is also responsible for updating and
maintaining data.
Coopeer
PeerRank
All the users are taken as a ”Referrer Network”.
Determines page’s relevance by examining a radiating
network of ”referrers”.
Documents with more referrers gain higher ranks.
Obtain better rank order, as collaborative evaluation of
human users is much more precise than description of
term frequency or link amount.
Prevent spam, since it is difficult to pretend evaluation
from human users.
Coopeer
PeerRank
For a given search session, we firstly compute the similarity between requestor’s
favorite lists and referrer’s,
then the similarity is used as the baseline of recommending degree of the referrer.
Firstly, as shown in equation (1), the similarity of local list and recommended list is
given by the Kendall measure.
Secondly, we convert the rank of a given URL in its recommended list to a moderate
score
R(e) - weight of URL e.
C (e) - set constituted by e’s referrers.
Z
- constant > 1.
p
- local peer
Pi
- a remote peer,
Lp , Lpi - list of p and Pi respectively.
K(r)(Lp, Lpi ) -Kendall function to measure the distance of the local list and the recommended list,
r
– decay factor.
SLpi(e) - score of e in the recommended list.
Re
- rank of e and
RMax - highest rank of list pi, = the length of the list.
Coopeer
Kendall Measure
Kendall is used to
measure the distance
between two lists in the
same length.
Paper extend it to fit in
with measuring two
lists in different length.
Kendall function :
τ1 and τ2 - two lists composed with URLs
Kr(τ1, τ2) -the distance between τ1 and
τ2,
r – fixed parameter with 0 ≤ r ≤ 1. C2
2L - used for normalization is the
possible maximum of the distance.
U(τ1, τ2) - set consists of all the URLs in
τ1 and τ2,
K’ ri,j(τ1, τ2) - means the penalty of the
URL pair (i, j)
Coopeer
Query Based Representation
A novel type of representation based on the
relevant words introduced by human users
with a high proficiency in their expertise
domains.
is efficient on the P2P platform, as the user’s
evaluation can be utilized easily through the
client application.
represent and organize the local documents
for responding remote query
Coopeer
Each peer maintains:
an inverted index table
represent local documents for responding remote
query
the IDs of the documents that were replied the query
key of inverted index is terms extracted from the
previous queries
Ex: when peer j writes in two queries ”P2P
Overlay” and ”P2P Routing” and obtains two set of
documents, {d1, d2, d3} and {d3, d4} respectively.
The retrieved documents will be updated with
their corresponding query terms.
When any other peer issues a query about
”Overlay Routing Algorithm”, peer j would look up
relevant documents in the inverted index by using
VSM cosine similarity as ranking algorithm, and
d3 would gain the highest ranking.
Coopeer
Semantic Routing Algorithm
each Coopeer client maintains a local
Topic Neighbor Index
The index records the used performance
of remote peers which has similar topics
to the local peer.
These search sessions’ queries are used
to represent the peers’ semantic content
session 1 >> is the local peer which has two topics (queries)
other sessions below denote the remote peers are interested in by the local
peer in some aspect.
session 2 and 3 are relevant to ”P2P Routing” topic of local peer, while
others are about ”Pattern Recognition”.
The peers on a same topic are in descending order of the rate.
The peers providing more interested resource would move to the top of an
individual’s local index
Coopeer
with query-based inverted
index, the precision of
matching results of
different subjects was
almost 100%
system uses information
coming from centralized
search engines, so the
system is not aimed to
replace CSEs, but to
complement them.
Coopeer
Query based representation is Efficient in p2p
because user’s evaluation can be utilized easily
through the client application.
This is Inefficient in CSEs because gaining user
evaluation through web browser is inefficient &
impractical to store and index documents every user’s
query.
Prevent spam, since it is difficult to pretend
evaluation from human users.
Use human searching experience better
results
ODISSEA
A distributed global indexing and query execution
service
Maintains a global index structure under document
insertions and updates and node joins and failures
the inverted index for a particular term (word) is located at
a single node, or partitioned over a small number of nodes
in some hybrid organizations.
Assume two tier architecture.
The system is implemented on top of an underlying
global address space provided by a DHT structure
ODISSEA
System provide the lower tier of the two tier
architecture.
In the upper tier, there are two classes of clients
that interact with this P2P-based lower tier:
Update clients
insert new or updated documents into the system, which
stores and indexes them.
An update client could be a crawler inserting crawled pages, a
web server pushing documents into the index, or a node in a
file sharing system.
Query clients
design optimized query execution plans, based on statistics
about term frequencies and correlations, and issue them to
the lower tier.
ODISSEA
ODISSEA
Global Index
An inverted index for a document collection is a
data structure that contains for each word in the
collection a list of all its occurrences, or a list of
postings.
Each posting contains the document ID of the
occurrence of the word, its position inside the
document, and other information (in title? bold
face?)
each node holds a complete global postings list
for a subset of the words, as determined by a
hash function.
ODISSEA
Query Processing
a ranking function is a function F that, given a
query consisting of a set of search terms
q0,q1,…,qm-1 , assigns to each document d a
score F(d, q0,q1,…,qm-1) . The top- k ranking
problem is then the problem of identifying the k
documents in the collection with the highest
scores.
ODISSEA
We focus on two families of ranking functions,
The first family includes the common families of
term-based ranking functions used in IR, where we
add up the scores of each document with respect to
all words in the queries.
The second formula adds a query-independent value
g(d) to the score of each page;
ODISSEA
Fagin’s Algorithm
Consider the inverted lists for a search query with two
terms q0 and q1 .
Assume they are located on the same machine, and that
the postings in the list are pairs (d,f(d,qi)),i {0,1}, where
d is an integer identifying the document and " f(d,qi) is
real valued.
Assume each inverted list is sorted by the second
attribute, so that documents with largest " f(d,qi) are at
the start of the list.
Then the following algorithm, called FA, computes the
top-k results:
ODISSEA
FA:
(1)Scan both lists from the beginning, by
reading one element from each list in every step,
until there are documents that have each been
encountered in both of the lists.
(2) Compute the scores of these documents.
Also, for each document that was encountered
in only one of the lists, perform a lookup into
the other list to determine the score of the
document. Return the documents with the
highest score.
Conclusion
Still no P2P web search engine has
outperformed Google!
(+) Lot of resources for complex data mining
tasks and for crawling whole surface web
(+)Emergence of semantic communities also
has a positive impact on p2p web search
performance
(-)lack of global knowledge
(-)smart crawling strategies beyond BFS are
hard to implement in a P2P environment
without a centralized scheduler.
Some Open Problems
how to uniformly sample web pages on a web
site if one does not have an exhaustive list of
these pages?
Bar-Yosseff converted the web graph into an
undirected, connected, and regular graph.
The equilibrium of a random walk on this graph
is the uniform distribution.
It is not clear how many steps such a walk
needs to perform.
A more significant problem, however, is that
there is no reliable way of converting the web
graph into an undirected graph.
Some Open Problems
Data Streams
The query logs of a web search engine contain all the queries
issued at this search engine.
The most frequent queries change only slowly over time.
However, the queries with the largest increase or decrease
from one time period over the next show interesting trends in
user interests. We call them the top gainers and losers.
Since the number of queries is huge, the top gainers and
losers need to be computed by making only one pass over the
query logs.
This leads to the following data stream problem:
Another interesting variant is to find all items above a certain
frequency whose relative increase (i.e., their increase divided
by their frequency in the first sequence) is the largest.
References
The anatomy of a large-scale hypertextual Web search
engineSource Computer Networks and ISDN Systems Volume 30
, Issue 1-7 ,1998 Sergey Brin Lawrence Page
Make it fresh, make it quick: searching a network of personal
International World Wide Web Conference
Budapest, Hungary , 2003
Towards a Fully Distributed P2P Web Search Engine
Proceedings of the 10th IEEE International Workshop on Future
Trends of Distributed Computing Systems Jin Zhou, Kai Li and Li
Tang 2004
Odissea: A peer-to-peer architecture for scalable web search and
information retrieval by: T Suel, C Mathur, J Wu, J Zhang, A Delis,
M Kharrazi, X Long, K Shanmugasunderam , 2003
Space/time Trade-offs in Hash Coding with Allowable Errors B.
Bloom. In Communications of ACM, volume 13(7), pages 422–426,
1970
www.en.wikipedia.org
Extra Slides
Bloom Filters
a space-efficient probabilistic data
structure that is used to test whether
an element is a member of a set.
False positives are possible, but false
negatives are not.
The more elements that are added to
the set, the larger the probability of
false positives.
Bloom Filters
An empty Bloom filter is a bit array of m bits, all set to 0.
There must also be k different hash functions defined, each
of which maps a key value to one of the m array positions.
To add an element, feed it to each of the k hash functions
to get k array positions. Set the bits at all these positions
to 1.
To query for an element (test whether it is in the set), feed
it to each of the k hash functions to get k array positions.
If any of the bits at these positions are 0, the element is
not in the set – if it were, then all the bits would have been
set to 1 when it was inserted.
If all are 1, then either the element is in the set, or the bits
have been set to 1 during the insertion of other elements.
Bloom Filters
An example of a Bloom filter, representing the set {x, y, z}. The colored arrows
show the positions in the bit array that each set element is mapped to. The
element w, not in the set, is detected as a nonmember as it is mapped to a
position containing a 0.
Bloom Filters
Hash (“Uncle John’s Band”) = {0,3,7}
0
1
2
3
4
5
6
7
8
1 1 0 1 0 0 1 1 0
Hash (“Box of Rain”) = {1,3,8}
Width (w)