Advanced information retrieval Computer engineering department Chapter 13 – Web IR Spring 2005 Searching the web Introduction (History, Definitions, Aims, Statistics) ■ Tasks of search engines - Gathering - Indexing - “Searching” (Querying and ranking algorithms) - Document and query management ■ Metasearch ■ Browsing and web directory ■ Users and web search ■ Research issues ■ Web Searching and Classical IR TREC Classic IR research 1970s 1980s 1990s then came the web web searching 2000s Terminology and Definitions ■ ■ A web page corresponds to a document in traditional IR Web pages are different in their size and in the types of files that constitute them ◆ ■ IR on the web considers as collection of documents the part of the web that is publicly indexable ◆ ■ ■ text, graphics, sound, video, GIF, JPEG, ASCII,PDF, etc exclude pages that cannot be indexed (authorization, dynamic pages) Location of web pages by navigation Location of web pages by searching (IR on the web) Challenges for web search ■ Problem with the data Distributed data ◆ High percentage of volatile data ◆ Large volume ◆ Unstructured data ◆ Redundant data ◆ Quality of data ◆ Heterogeneous data ◆ ■ Problem faced by users How to specify a query? ◆ How to interpret answers? ◆ ■ Portals Search engines identification of real names ◆ links to books from Amazon.com ◆ send electronic postcards ◆ translation to other languages ◆ search of other media (metadata) ◆ language-specific searches ◆ weather, stock price, traffic ◆ ■ Business model targeted advertising with revenue from clickthroughs ◆ word of mouth (since no industry standard evaluation) ◆ fast response (<1s) 24 hours / 7 days a week ◆ filter out spam ◆ Examples Search engines URL AltaVista www.altavista.com Excite www.excite.com Google www.google.com Infoseek www.infoseek.com Lycos www.lycos.com NorthernLight www.nlsearch.com ■ 25-55% of web covered only! ■ Some search engines powered by same IR engine ■ Most based in US, English Other search engines ■ Specialised in different countries/languages (e.g., http://www.iol.it/) ■ Specific Rank according to popularity (e.g., DirectHit) ◆ Topic oriented (e.g., SearchBroker) ◆ Personnel or institutional pages, electronic mail addresses, images, software applets ◆ Tasks of a web search engine ■ Document gathering ◆ ■ Document indexing ◆ ◆ ■ represent the content of the selected documents often 2 indices maintained (full + small for frequent queries) Searching ◆ ◆ ■ select the documents to be indexed represent the user information need into a query retrieval process (search algorithms, ranking of web pages) Document and query management ◆ ◆ display the results virtual collection (documents discarded after indexing) vs. physical collection (documents maintained after indexing) Document gathering ■ Document gathering = crawling the web ■ Crawler Robot, spider, wanderer, walker, knowbot, web search agent ◆ Program that traverses the web to send new or updated pages to be indexed ◆ Run on local server and send requests to remote servers ◆ Crawling the web (1) ■ Crawling process ◆ Start with set of URLs ✦ ✦ ◆ ◆ Submitted by users or companies Popular URLs Breath-first or depth-first Extract further URLs ■ Up to 10 millions pages per day ■ Several crawlers ◆ ◆ Problem of redundancy Web partition ⇒ robot per partition Crawling the web (2) ■ Up-to-date? Non-submitted pages need up to 2 months to be indexed ◆ Search engine learns change frequency of pages ◆ Popular pages (having many links to them) crawled more frequently ◆ ■ Guideline for robot behaviours File placed at root of web server ◆ Indicate web pages that should not be indexed ◆ Avoid overloading servers/sites ◆ Document indexing ■ ■ Document indexing = building the indices Indices are variant of inverted files ◆ ◆ ◆ ◆ meta tag analysis stop words removal + stemming position data (for phrase searches) weights ✦ ✦ ✦ ◆ ■ tf x idf; downweight long URLs (not important page) upweight terms appearing at the top of the documents, or emphasised terms use de-spamming techniques hyperlink information ✦ ✦ ✦ count link popularity anchor text from source links hub and authority value of a page Searching ■ Querying ◆ ◆ ◆ ◆ ◆ ■ 1 word or all words must be in the retrieved pages normalisation (stop words removal, stemming, etc) complex queries (date, structure, region, etc) Boolean expressions (advanced search) metadata Ranking algorithms ◆ ◆ use of web links web page authority analysis ✦ ✦ HITS (Hyperlink Induced Topic Search) PageRank (Google) Use of web links ■ ■ Web link: represent a relationship between the connected pages The main difference between standard IR algorithms and web IR algorithms is the massive presence of web links ◆ web links are source of evidence but also source of noise ◆ classical IR: citation-based IR ◆ web track in TREC, 2000, TREC-9: Small Web task (2GB of web data); Large Web task (100GB of web data, 18.5 million documents) Use of anchor text ■ represent referenced document ◆ why? ✦ ✦ ■ provides more accurate and concise description than the page itself (probably) contains more significant terms than the page itself ◆ used by ‘WWW Worm’ - one of the first search engines 1994 ◆ representation of images, programs, … generate page descriptions from anchor text Algorithms ■ Query independent page quality ◆ global analysis ✦ ■ PageRank (Google): simulates a random walk across the web and computes the “score” of a page as the probability of reaching the page Query dependent page quality ◆ local analysis ✦ HITS (Hyperlink Induced Topic Search): focusses on broad topic queries that are likely to be answered with too many pages • the more a page is pointed to by other pages, the more popular is the page • popular pages are more likely to include relevant information than nonpopular pages PageRank (1) ■ ■ ■ Designed by Brin and Page at Stanford University Used to implement Google Main idea: ◆ a page has a high rank if the sum of the ranks of its in-links is high ✦ ✦ ◆ ■ in-link of page p: a link from a page to page p out-link of a page p: a link from page p to a page a high PageRank page has many in-links or few highly ranked in-links Retrieval: use cosine product (content, feature, term weight) combined with PageRank value PageRank (2) ■ Random Surfer Model : user randomly navigates ◆ Initially the surfer is at a random page ◆ At each step the surfer proceeds ✦ ✦ ■ to a randomly chosen Web page with probability d called the “damping factor” (e.g. probability of random jump = 0.2) to a randomly chosen page linked to the current page with probability 1-d (e.g. probability of following a random outlink = 0.8) Process modelled by Markov Chain ◆ PageRank PR of a page a = probability that the surfer is at page a on a given time PR(a) = Kd + K(1-d) ∑i=1,n PR(ai)/C(ai) d set by system K normalisation factor a = page pointed by ai for i=1,n C(ai) = number of outlinks of ai HITS: Hypertext Induced Topic Search ■ Originated from Kleinberg, 1997 ■ Also referred to as the “The Connectivity Analysis Approach” ■ ■ Broad topic queries produce large sets of retrieved results ◆ abundance problem ⇒ too many relevant documents ◆ new type of quality measure needed ⇒ distinguish the most “authoritative” pages ⇒ high-quality response to a broad query HITS: for a certain topic, it identifies ◆ good authorities ✦ ◆ pages that contain relevant information (good sources of content) good hubs ✦ page that point to useful pages (good sources of links) HITS (2) ■ Intuition ◆ ◆ ◆ ◆ ■ authority comes from inlinks being a good hub comes from outlinks better authority comes from inlinks from good hubs being a better hub comes from outlinks to good authorities Mutual reinforcement between hubs and authorities ◆ ◆ a good authority page is pointed to by many hub pages a good hub page point to many authority pages HITS (3) ■ Set of pages S that are retrieved ◆ ■ Set of pages T that point to or are pointed to by retrieved set of pages S ◆ ■ sometimes k (e.g. k = 200) top-ranked pages rank pages according to in_degree (number of in-links) - not effective Set of pages T : ◆ authoritative pages relevant to query should have a large in_degree ◆ considerable overlap in the sets of pages that point to them ⇒ hubs Algorithm for HITS (General Principle) ■ ■ ■ Computation of hub and authority value of a page through the iterative propagation of “authority weight” and “hub weight” Initially all values equal to 1 Authority weight of page x(p) ◆ ■ x(p) = Σqi→p y(qi) Hub weight of page y(p) ◆ ■ if p is pointed to by many pages with large y-values,then it should receive a large x-value if p points to many pages with large x-values, then it should receive a large y-value y(p) = Σp→qi x(qi) After each computation (iteration), weights are normalised Topic distillation ■ ■ Process of finding ‘quality’ (authority - hub) Web documents related to a query topic, given an initial user information need. Extensions of HITS ◆ ARC (Automatic Resource Compilation) ✦ ✦ ◆ ■ distance-2 neighbourhood graph anchor (& surrounding) text used in computation of hub and authority values SALSA, etc. Problems with HITS ◆ mutual reinforcing relationship between hosts ◆ automatically generated links ◆ non relevant highly connected pages ◆ topic drift: generalisation of the query topic Difference between PageRank and HITS ■ ■ ■ ■ The PageRank is computed for all web pages stored in the database and then prior to the query; HITS is performed on the set of retrieved web pages, and for each query. HITS computes authorities and hubs; PageRank computes authorities only. PageRank: non-trivial to compute, HITS: easy to compute, but real-time execution is hard Implementation details of PageRank have been reported Document and query management ■ Results ◆ ◆ ◆ ◆ ◆ ◆ ■ Usually screens of 10 pages Clustering URL, size, date, abstract, etc Various sorting Most similar documents options Query refinement Virtual collection vs. physical collection ◆ ◆ ◆ document can change over time document may be different to the one that has been indexed broken link Metasearch (1) ■ Problems of Web search engines: ◆ ◆ ◆ limited coverage of the publicly indexable Web index different overlapping sections of the Web based on different IR models ➙different results to the same query ⇒ users do not have the time, knowledge to select the most appropriate search engines with regard to their information need ■ Possible solution: metasearch engines ◆ Web server that sends query to ✦ ◆ ◆ ■ Several search engines, Web directories, Databases Collect results Unify (merge) them - Data fusion Aim: better coverage, increase effectiveness Metasearch (2) ■ Divided into phases ◆ search engine selection ✦ ◆ document selection ✦ ◆ topic-dependent, past queries, network traffic, … how many documents from each search engine? merging algorithm ✦ utilise rank positions, document retrieval scores, … Metasearcher MetaCrawler Dogpile SavvySearch URL www.metacrawler.com www.dogpile.com www.search.com Sources used 13 25 > 1000 MetaCrawler Browsing ■ Web directory ◆ Catalog, yellow pages, subject categories ◆ ■ Many standard search engines provide subject categories now Hierarchical taxonomy that classifies human knowledge Arts & Humanities Investing Automotive Kids & Family Business & Economy Life & Style Computers & Internet News Education People Employment Philosophy & Religion Entertainment & Leisure Politics Games Science & Technologies Government Social Science … Yahoo! ■ ■ ■ Around one million classified pages More than 14 country specific directories (Chinese, Danish, French, …) Manual classification (has its problem) ■ Acyclic graphs ■ Search within the classified pages/taxonomy ■ Problem of coverage Users and the web (1) ■ Queries on the web: average values Measure Average value Number of words 2.35 Number of operators 0.41 Repetitions of queries 3.97 Queries per user session 2.02 Screens per query 1.39 Range 0 - 393 0 - 958 1 - 1.5 million 1 - 173325 1 - 78496 Users and the web (2) ■ Some statistics ◆ Main purpose: research, leisure, business, education, ✦ ✦ ✦ ✦ ✦ ◆ ◆ ◆ ◆ ◆ products and services (e-commerce) people and company names and home pages factoids (from any one of a number of documents) entire, broad documents mp3, image, video, audio 80% do not modify query 85% look first screen only 64% queries are unique 25% users use single keywords (problem for polysemic words and synonyms) 10% queries are empty Users and the web (3) ■ Results of search engines are identical, independent of ◆ ◆ ■ user context in which the user made the request adding context information for improving search results ⇒ focus on the user need and answer it directly ◆ explicit context ✦ ◆ implicit context ✦ ◆ query + category based on documents edited or viewed by user personalised search ✦ previous requests and interests, user profile Research issues ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ Modelling Querying Distributed architecture Ranking Indexing Dynamic pages Browsing User interface Duplicated data Multimedia
© Copyright 2025 Paperzz