WEB SEARCH and P2P Advisor: Dr Sushil Prasad Presented By: DM Rasanjalee Himali OUTLINE Introduction to web search engines What is a web search engine? Web Search engine architecture How a web search engine work? Relevance and Ranking Limitations in current Web Search Engines P2P Web Search Engines YouSearch Copeer ODISSEA Conclusion What is a web search engine? A Web search engine is a search engine designed to search for information on the World Wide Web. Information may consist of web pages, images and other types of files. Some search engines also mine data available in newsgroups, databases, or open directories History… Before there were search engines there was a complete list of all webservers. The very first tool used for searching on the Internet was Archie downloaded directory listings of files on FTP sites did not index the contents of these sites Soon after, many search engines appeared Excite, Infoseek, Northern Light, AltaVista. Yahoo!, Google, MSN Search Company Millions of searches Relative market share Google 28,454 46.47% Yahoo! 10,505 17.16% Baidu 8,428 13.76% Microsoft 7,880 12.87% NHN 2,882 4.71% eBay 2,428 3.9% Time Warner 1,062 1.6% Ask.com 728 1.1% Yandex 566 0.9% Alibaba.com 531 0.8% Total 61,221 100.0% How Web Search Engine Work A search engine operates, in the following order Web crawling Indexing Searching Web Crawling A web crawler a program or which browses the World Wide Web in a methodical, automated manner. a means of providing up-to-date data create a copy of all the visited pages for later processing by a search engine starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies. Robot Exclusion Protocol also known as the robots.txt protocol is a convention to prevent cooperating web robots from accessing all or part of a website which is otherwise publicly viewable. User-agent Disallow Disallow Disallow Disallow Sitemap Crawl-delay Allow Request-rate Visit-time : : : : : : : : : : * /cgi-bin/ /images/ /tmp/ /private/ http://www.example.com/sitemap.xml.gz 10 /folder1/myfile.html 1/5 # maximum rate is one page every 5 seconds 0600-0845 # only visit between 06:00 and 08:45 UTC (GMT) It relies on the cooperation of the web robot, so that marking an area of a site out of bounds with robots.txt does not guarantee privacy. The standard complements Sitemaps, a robot inclusion standard for websites. SiteMap Protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs in the site. This allows search engines to crawl the site more intelligently. Sitemaps are a URL inclusion protocol complement robots.txt <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.example.com/</loc> <lastmod>2005-01-01</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority> </url> </urlset> Indexing The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. The contents of each page are analyzed to determine how it should be indexed Ex: words are extracted from the titles, headings, or special fields called meta tags Meta search engines reuse the indices of other services and do not store a local index Inverted Indices inverted index stores a list of the documents containing each word search engine can use direct access to find the documents associated with each word in the query to retrieve the matching documents quickly Word Documents the Document 1, Document 3, Document 4, Document 5 cow Document 2, Document 3, Document 4 says Document 5 moo Document 7 Searching web search query a query that a user enters into web search engine to satisfy his or her information needs. is distinctive in that it is unstructured and often ambiguous vary greatly from standard query languages which are governed by strict syntax rules. Web search engine architecture URL List Fetched pages Compress + store -Relative URLs - absolute URLs - docIDs Anchors file Partiall sorted forward index Resort baralls by word IDs Calculate PR of all docs PR - Read repository - Uncompress + parse docs to hit list - Distribute hit list to baralles by docID - Parse out links and store in anchor file lexicon Inverted index Answer queries From “The Anatomy of a Large-Scale Hypertextual Web Search Engine” Sergey Brin and Lawrence Page Relevance and Ranking Exactly how a particular search engine's algorithm works is a closely-kept trade secret. However, all major search engines follow the general rules below. Location, Location, Location...and Frequency Location: Search engines will also check to see if the search keywords appear near the top of a web page, such as in the headline or in the first few paragraphs of text. They assume that any page relevant to the topic will mention those words right from the beginning. Frequency: A search engine will analyze how often keywords appear in relation to other words in a web page. Those with a higher frequency are often deemed more relevant than other web pages. Precision and Recall two widely used measures for evaluating the quality of results in Information Retrieval Precision fraction of the documents retrieved that are relevant to the user's information need = number of relevant documents retrieved by a search ___________________________________________________ the total number of documents retrieved by that search Recall the fraction of the documents that are relevant to the query that are successfully retrieved. = number of relevant documents retrieved by a search _____________________________________________________ the total number of existing relevant documents which should have been retrieved Often, there is an inverse relationship between Precision and Recall Relevance and Ranking webmasters constantly rewrite their web pages in an attempt to gain better rankings. Some sophisticated webmasters may even "reverse engineer" the location/frequency systems used by a particular search engine Because of this, all major search engines now also make use of "off the page" ranking criteria Relevance and Ranking Off the page factors: those that a webmasters cannot easily influence Link analysis Search engine analyzing how pages link to each other Helps to determine what a page is about and whether that page is "important" and thus deserving of a ranking boost Click through measurement a search engine watch what results someone selects for a particular search, eventually drop high-ranking pages that aren't attracting clicks, promote lower-ranking pages that do pull in visitors. Limitations in current web search engines Centralized search engines have limited scalability. crawler based indices are stale and incomplete Fundamental issue: How much of the web is ‘crawlable’ If you follow the rules many sites say “robots get lost” What about Dynamic content? (Deep Web) The deep web is around 500 times larger than surface web. These deep web resources mainly include data held by databases which can be accessed only through queries. Since crawlers discover resources only through links, they cannot discover these resources. There’s no guarantee that current search engines index or even crawl the total surface web space Limitations in current web search engines Single point of failure Ambiguous ‘words’ Polysemy - words with multiple meanings “train car” “train neural network” Synonymy - multiple words same meaning: “neural network is trained as follows” “neural network learns as follows” What about ‘phrases’ - searches are not ‘bag of words’ Positional information? Structural (throw out case & punctuation)? Non-text content data worth storing Most web search engines today crawl only surface web. P2P Web Search Seen explosion of activity in the area of peer-to-peer (P2P) systems last few years Since an increasing amount of content now resides in P2P networks, it becomes necessary to provide search facilities within P2P networks. The significant computing resources provided by a P2P system could also be used to implement search and data mining functions for content located outside the system e.g., for search and mining tasks across large intranets or global enterprises, or even to build a P2P-based alternative to the current major search engines. P2P Web Search The characteristics distinguish P2P systems from previous technologies: - low maintenance overhead improved scalability Improved reliability synergistic performance increased autonomy and privacy Dynamism P2P Web Search Engines YouSearch Coopeer ODISSEA YouSearch YouSearch : is a distributed search application for personal webservers operating within a shared context Allow peers to aggregate into groups and users to search over specific groups Goal: Provide fast, fresh and complete results to users YouSearch System Overview participants in YouSearch: Peer-nodes run YouSearch enabled clients Browsers search YouSearch enabled content through their web browsers Registrar centralized light-weight service that acts like a “blackboard” on which peer nodes store and lookup (summarized) network state. YouSearch System Overview Search System: Each peer node closely monitors its own content to maintain a fresh local index A bloom filter content summary is created by each peer and pushed to the registrar. When a browser issues a search query at a peer p , the peer p first queries the summaries at the registrar to obtain a set of peers R in the network that are hosting relevant documents. The peers in R are then directly contacted by with the query to obtain the URLs for its results. To quickly satisfy any subsequently issued queries with identical terms, the results from each query issued at a peer p are cached for a limited time at p YouSearch Indexing Indexing is periodically executed at every peer node. Inspector examines each shared file for its last modification date and time. If the file is new or the file has changed, the file is passed to the Indexer. The Indexer maintains a disk-based inverted-index over the shared content. The name and path information of the file are indexed as well. YouSearch Indexing Summarizer: The Summarizer obtains a list of terms T from the Indexer and creates a bloom filter from them in the following way. A bit vector V of length L is created with each bit set to 0. A specified hash function H with range {1,...,L} is used to hash each term t in T and the bit at position H(t) in V is set to 1 YouSearch use k independent hash functions H1,H2,...,Hk and construct k different bloom filters, one for each hash function In YouSearch, the length of each bloom filter is L = 64 Kbits and the number of bloom filters k is set to 3 Summary Manager at the registrar aggregate these Bloom Filters into a structure that maps each bit position to a set of peers whose Bloom Filters have the corresponding bit set YouSearch Querying computes the hash of keywords determine Corresponding bits of each k bloom filters keywords results Bit position to IP address mapping contacts each of the peers in list and obtains a list of URLs for matching documents YouSearch Caching Every time a global query is answered that returns non-zero results, the querying peer caches the result set of URLs U (temporary) The peer then informs the registrar of the fact. The registrar adds a mapping from the query to the IP-address of the caching peer in its cache table YouSearch Limitations: False Positive results : 17.38% Central registrar >> single point of failure No extensive phrase search No attention has been given for query ranking No human user collaboration Coopeer Coopeer: Is a P2P web search engine where each user computer stores a part of the web model used for indexing and retrieving web resources in response to queries Goal: complement centralized search engines to provide more humanized and personalized results by utilizing users’ collaboration Coopeer (a)Collaboration One may look for interesting web pages in the P2P knowledge repository consisted with shared web pages. A novel collaborative filtering technique called PeerRank is presented to rank pages proportional to the votes from relevant peers; (b)Humanization Coopeer use a query-based representation for documents, The relevant words are not directly extracted from page content but introduced by human users with a high proficiency in their expertise domains. (c)Personalization Similar users are self-organized according to their semantic content of search session. Thus, requestor peer can extend routing paths along its neighbors, rather than just take a blind shot. User-customized results can be obtained along personal routing paths in contrast with CSEs. Coopeer System Overview requestor forwards the query based on the semantically routing. Peers maintain a local index about the semantic content of remote peers. Receiving a query message from remote peer, current peer check it against the local store. In order to facilitate this work, a novel query-based representation about documents is introduced. Based on query representation, cosine similarity between new query and documents can be computed. the documents are relevant enough, if the similarity exceeds a certain threshold. Then these results are returned to the requestor. Receiving the returned results, the requestor peer need to rank them in term of preference of its human owner using PeerRank method. Coopeer The Coopeer client consists of four main software agents: 1. The User Agent is responsible for interacting with the users. It provides a friendly user interface, so that users can conveniently manage and manipulate the whole search sessions. 2. The Web-searcher Agent is the resource of P2P knowledge repository. It performs the user’s individual searching with several search engines from the Internet. 3. The Collaborator Agent is the key component for performing users’ real-time collaborative searching. It facilitates maintaining the P2P knowledge repository, such as information sharing, searching, and fusion. 4. The Manager Agent is the key component of Coopeer, which coordinates and manage the other types of agents. It is also responsible for updating and maintaining data. Coopeer PeerRank All the users are taken as a ”Referrer Network”. Determines page’s relevance by examining a radiating network of ”referrers”. Documents with more referrers gain higher ranks. Obtain better rank order, as collaborative evaluation of human users is much more precise than description of term frequency or link amount. Prevent spam, since it is difficult to pretend evaluation from human users. Coopeer PeerRank For a given search session, we firstly compute the similarity between requestor’s favorite lists and referrer’s, then the similarity is used as the baseline of recommending degree of the referrer. Firstly, as shown in equation (1), the similarity of local list and recommended list is given by the Kendall measure. Secondly, we convert the rank of a given URL in its recommended list to a moderate score R(e) - weight of URL e. C (e) - set constituted by e’s referrers. Z - constant > 1. p - local peer Pi - a remote peer, Lp , Lpi - list of p and Pi respectively. K(r)(Lp, Lpi ) -Kendall function to measure the distance of the local list and the recommended list, r – decay factor. SLpi(e) - score of e in the recommended list. Re - rank of e and RMax - highest rank of list pi, = the length of the list. Coopeer Kendall Measure Kendall is used to measure the distance between two lists in the same length. Paper extend it to fit in with measuring two lists in different length. Kendall function : τ1 and τ2 - two lists composed with URLs Kr(τ1, τ2) -the distance between τ1 and τ2, r – fixed parameter with 0 ≤ r ≤ 1. C2 2L - used for normalization is the possible maximum of the distance. U(τ1, τ2) - set consists of all the URLs in τ1 and τ2, K’ ri,j(τ1, τ2) - means the penalty of the URL pair (i, j) Coopeer Query Based Representation A novel type of representation based on the relevant words introduced by human users with a high proficiency in their expertise domains. is efficient on the P2P platform, as the user’s evaluation can be utilized easily through the client application. represent and organize the local documents for responding remote query Coopeer Each peer maintains: an inverted index table represent local documents for responding remote query the IDs of the documents that were replied the query key of inverted index is terms extracted from the previous queries Ex: when peer j writes in two queries ”P2P Overlay” and ”P2P Routing” and obtains two set of documents, {d1, d2, d3} and {d3, d4} respectively. The retrieved documents will be updated with their corresponding query terms. When any other peer issues a query about ”Overlay Routing Algorithm”, peer j would look up relevant documents in the inverted index by using VSM cosine similarity as ranking algorithm, and d3 would gain the highest ranking. Coopeer Semantic Routing Algorithm each Coopeer client maintains a local Topic Neighbor Index The index records the used performance of remote peers which has similar topics to the local peer. These search sessions’ queries are used to represent the peers’ semantic content session 1 >> is the local peer which has two topics (queries) other sessions below denote the remote peers are interested in by the local peer in some aspect. session 2 and 3 are relevant to ”P2P Routing” topic of local peer, while others are about ”Pattern Recognition”. The peers on a same topic are in descending order of the rate. The peers providing more interested resource would move to the top of an individual’s local index Coopeer with query-based inverted index, the precision of matching results of different subjects was almost 100% system uses information coming from centralized search engines, so the system is not aimed to replace CSEs, but to complement them. Coopeer Query based representation is Efficient in p2p because user’s evaluation can be utilized easily through the client application. This is Inefficient in CSEs because gaining user evaluation through web browser is inefficient & impractical to store and index documents every user’s query. Prevent spam, since it is difficult to pretend evaluation from human users. Use human searching experience better results ODISSEA A distributed global indexing and query execution service Maintains a global index structure under document insertions and updates and node joins and failures the inverted index for a particular term (word) is located at a single node, or partitioned over a small number of nodes in some hybrid organizations. Assume two tier architecture. The system is implemented on top of an underlying global address space provided by a DHT structure ODISSEA System provide the lower tier of the two tier architecture. In the upper tier, there are two classes of clients that interact with this P2P-based lower tier: Update clients insert new or updated documents into the system, which stores and indexes them. An update client could be a crawler inserting crawled pages, a web server pushing documents into the index, or a node in a file sharing system. Query clients design optimized query execution plans, based on statistics about term frequencies and correlations, and issue them to the lower tier. ODISSEA ODISSEA Global Index An inverted index for a document collection is a data structure that contains for each word in the collection a list of all its occurrences, or a list of postings. Each posting contains the document ID of the occurrence of the word, its position inside the document, and other information (in title? bold face?) each node holds a complete global postings list for a subset of the words, as determined by a hash function. ODISSEA Query Processing a ranking function is a function F that, given a query consisting of a set of search terms q0,q1,…,qm-1 , assigns to each document d a score F(d, q0,q1,…,qm-1) . The top- k ranking problem is then the problem of identifying the k documents in the collection with the highest scores. ODISSEA We focus on two families of ranking functions, The first family includes the common families of term-based ranking functions used in IR, where we add up the scores of each document with respect to all words in the queries. The second formula adds a query-independent value g(d) to the score of each page; ODISSEA Fagin’s Algorithm Consider the inverted lists for a search query with two terms q0 and q1 . Assume they are located on the same machine, and that the postings in the list are pairs (d,f(d,qi)),i {0,1}, where d is an integer identifying the document and " f(d,qi) is real valued. Assume each inverted list is sorted by the second attribute, so that documents with largest " f(d,qi) are at the start of the list. Then the following algorithm, called FA, computes the top-k results: ODISSEA FA: (1)Scan both lists from the beginning, by reading one element from each list in every step, until there are documents that have each been encountered in both of the lists. (2) Compute the scores of these documents. Also, for each document that was encountered in only one of the lists, perform a lookup into the other list to determine the score of the document. Return the documents with the highest score. Conclusion Still no P2P web search engine has outperformed Google! (+) Lot of resources for complex data mining tasks and for crawling whole surface web (+)Emergence of semantic communities also has a positive impact on p2p web search performance (-)lack of global knowledge (-)smart crawling strategies beyond BFS are hard to implement in a P2P environment without a centralized scheduler. Some Open Problems how to uniformly sample web pages on a web site if one does not have an exhaustive list of these pages? Bar-Yosseff converted the web graph into an undirected, connected, and regular graph. The equilibrium of a random walk on this graph is the uniform distribution. It is not clear how many steps such a walk needs to perform. A more significant problem, however, is that there is no reliable way of converting the web graph into an undirected graph. Some Open Problems Data Streams The query logs of a web search engine contain all the queries issued at this search engine. The most frequent queries change only slowly over time. However, the queries with the largest increase or decrease from one time period over the next show interesting trends in user interests. We call them the top gainers and losers. Since the number of queries is huge, the top gainers and losers need to be computed by making only one pass over the query logs. This leads to the following data stream problem: Another interesting variant is to find all items above a certain frequency whose relative increase (i.e., their increase divided by their frequency in the first sequence) is the largest. References The anatomy of a large-scale hypertextual Web search engineSource Computer Networks and ISDN Systems Volume 30 , Issue 1-7 ,1998 Sergey Brin Lawrence Page Make it fresh, make it quick: searching a network of personal International World Wide Web Conference Budapest, Hungary , 2003 Towards a Fully Distributed P2P Web Search Engine Proceedings of the 10th IEEE International Workshop on Future Trends of Distributed Computing Systems Jin Zhou, Kai Li and Li Tang 2004 Odissea: A peer-to-peer architecture for scalable web search and information retrieval by: T Suel, C Mathur, J Wu, J Zhang, A Delis, M Kharrazi, X Long, K Shanmugasunderam , 2003 Space/time Trade-offs in Hash Coding with Allowable Errors B. Bloom. In Communications of ACM, volume 13(7), pages 422–426, 1970 www.en.wikipedia.org Extra Slides Bloom Filters a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. The more elements that are added to the set, the larger the probability of false positives. Bloom Filters An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps a key value to one of the m array positions. To add an element, feed it to each of the k hash functions to get k array positions. Set the bits at all these positions to 1. To query for an element (test whether it is in the set), feed it to each of the k hash functions to get k array positions. If any of the bits at these positions are 0, the element is not in the set – if it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then either the element is in the set, or the bits have been set to 1 during the insertion of other elements. Bloom Filters An example of a Bloom filter, representing the set {x, y, z}. The colored arrows show the positions in the bit array that each set element is mapped to. The element w, not in the set, is detected as a nonmember as it is mapped to a position containing a 0. Bloom Filters Hash (“Uncle John’s Band”) = {0,3,7} 0 1 2 3 4 5 6 7 8 1 1 0 1 0 0 1 1 0 Hash (“Box of Rain”) = {1,3,8} Width (w)
© Copyright 2024 Paperzz