6. Markov chains: Web search ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak Early History of Web Search • Pre-‐historic Hmes (mid-‐1990’s): – Excite, AltaVista, … – Search results are ranked by frequency and prominence of the search term – Bad results (e.g., search for “university” would not not rank MIT highly) – Easily exploitable Ilya Pollak Early History of Web Search • Pre-‐historic Hmes (mid-‐1990’s): – Excite, AltaVista, … – Search results are ranked by frequency and prominence of the search term – Bad results (e.g., search for “university” would not not rank MIT highly) – Easily exploitable • 1996-‐1998: several ideas for ranking a page based on how many other highly ranked pages link to it. Ilya Pollak The Web as a big graph Ilya Pollak Google: Basic Idea • PageRanki = steady-‐state probability of visiHng page i. • Keyword search results are presented according to their page rank, from highest to lowest. • IntuiHon: – Being pointed to by many pages helps – Being pointed to by popular pages helps Ilya Pollak Simple model of Web surfing • ci = # pages that page i has links to. • Assume that any surfer looking at page i will follow of the links on page i, with probability 1/ci for each link. Ilya Pollak Simple model of Web surfing • ci = # pages that page i has links to. • Assume that any surfer looking at page i will follow of the links on page i, with probability 1/ci for each link. • I.e., one-‐step transiHon probabiliHes are pij = 1/ci if a link i -‐> j exists and pij = 0 if it does not. Ilya Pollak Simple model of Web surfing • ci = # pages that page i has links to. • Assume that any surfer looking at page i will follow of the links on page i, with probability 1/ci for each link. • I.e., one-‐step transiHon probabiliHes are pij = 1/ci if a link i -‐> j exists and pij = 0 if it does not. • Define PageRank(i) as the steady state probability for the surfer to be at page i acer a large number of steps. Ilya Pollak Simple model of Web surfing • ci = # pages that page i has links to. • Assume that any surfer looking at page i will follow of the links on page i, with probability 1/ci for each link. • I.e., one-‐step transiHon probabiliHes are pij = 1/ci if a link i -‐> j exists and pij = 0 if it does not. • Define PageRank(i) as the steady state probability for the surfer to be at page i acer a large number of steps. • Might there be any problems with this definiHon? Ilya Pollak Problems • Steady-‐state probabiliHes do not necessarily exist. Ilya Pollak Problems • Steady-‐state probabiliHes do not necessarily exist. • If they exist, they are not necessarily independent of where the surfing started. 0.4 0.4 0.3 0.6 3 2 1 0.6 1 0.2 4 0.5 Ilya Pollak Problems • Steady-‐state probabiliHes do not necessarily exist. • If they exist, they are not necessarily independent of where the surfing started. 0.4 0.4 0.3 0.6 3 2 1 0.6 0.4 1 0.2 4 0.5 0.4 0.2 1 2 3 4 0.6 1 0.6 0.5 Ilya Pollak More problems: AbsorpHon into a page that doesn’t have many in-‐links Ilya Pollak More problems: AbsorpHon into a page that doesn’t have many in-‐links 1 2 … 98 99 … 100 101 Ilya Pollak More problems: AbsorpHon into a page that doesn’t have many in-‐links 1 2 1/3 1/100 1/2 1/3 1/3 1/100 1/3 1/3 1/3 1/2 … … 98 1/3 1/3 1/100 1/100 1/2 99 1/2 100 1/100 101 1 Ilya Pollak More problems: AbsorpHon into a page that doesn’t have many in-‐links 1 2 1/3 1/100 1/2 1/3 1/3 1/100 1/3 1/3 1/3 1/2 … … 98 1/3 1/3 1/100 1/100 1/2 99 1/2 100 1/100 101 1 • All states except for 101 are transient! (Because 101 is reachable from all other states, but no other state can be reached from 101.) Ilya Pollak More problems: AbsorpHon into a page that doesn’t have many in-‐links 1 2 1/3 1/100 1/2 1/3 1/3 1/100 1/3 1/3 1/3 1/2 … … 98 1/3 1/3 1/100 1/100 1/2 99 1/2 100 1/100 101 1 • All states except for 101 are transient! (Because 101 is reachable from all other states, but no other state can be reached from 101.) • State 101 is absorbing, with absorpHon probabiliHes ai = 1 for any i. Ilya Pollak More problems: AbsorpHon into a page that doesn’t have many in-‐links 1 2 1/3 1/100 1/2 1/3 1/3 1/100 1/3 1/3 1/3 1/2 … … 98 1/3 1/3 1/100 1/100 1/2 99 1/2 100 1/100 101 1 • All states except for 101 are transient! (Because 101 is reachable from all other states, but no other state can be reached from 101.) • State 101 is absorbing, with absorpHon probabiliHes ai = 1 for any i. • Therefore, PageRank(101) = 1 and PageRank(i) = 0 for all other states. Ilya Pollak Modified model of Web surfing • ci = # pages that page i has links to. • n+1 = total # pages on the Web. • Any surfer looking at page i will: – if ci = 0, choose one of the other n pages at random; Ilya Pollak Modified model of Web surfing • ci = # pages that page i has links to. • n+1 = total # pages on the Web. • Any surfer looking at page i will: – if ci = 0, choose one of the other n pages at random; – if ci ≠ 0, flip a coin whose P(heads) = p (the coin is assumed to be independent of the surfing), and • if it’s heads, select one of the out-‐links at random; • if it’s tails, select one of the n Web pages at random. Ilya Pollak Modified model of Web surfing • ci = # pages that page i has links to. • n+1 = total # pages on the Web. • Any surfer looking at page i will: – if ci = 0, choose one of the other n pages at random; – if ci ≠ 0, flip a coin whose P(heads) = p (the coin is assumed to be independent of the surfing), and • if it’s heads, select one of the out-‐links at random; • if it’s tails, select one of the n Web pages at random. • One-‐step transiHon probabiliHes are: ⎧ ⎪ ⎪ ⎪ pij = ⎨ ⎪ ⎪ ⎪ ⎩ 1 n p⋅ if ci = 0 1 1 + (1 − p ) ⋅ ci n (1 − p ) ⋅ 1 n if ci ≠ 0 and link i → j exists if ci ≠ 0 and link i → j does not exist Ilya Pollak Modified model of Web surfing ⎧ ⎪ ⎪ ⎪ pij = ⎨ ⎪ ⎪ ⎪ ⎩ 1 n p⋅ if ci = 0 1 1 + (1 − p ) ⋅ ci n (1 − p ) ⋅ 1 n if ci ≠ 0 and link i → j exists if ci ≠ 0 and link i → j does not exist • Assuming that p < 1, the resulHng Markov chain graph is fully connected, with pij ≠ 0 for all Web pages i and j. • Therefore, the enHre graph forms a single recurrent class, with no periodic states. • Define PageRank(i) as the steady state probability for the surfer to be at page i acer a large number of steps under this model. • Then PageRank(i) exists and does not depend on the starHng point. • Retrieve pages based on word frequency and prominence, and perhaps other criteria, and sort by PageRank. Ilya Pollak Comments • Google’s original algorithm used word frequency, visual prominence (e.g., font size), anchor text (text surrounding the link to page j in page i), in addiHon to PageRank. • Google’s current page ranking algorithm has hundreds of other ingredients which are kept secret and are changed with Hme, so as to both improve the algorithm and prevent people from taking advantage. • Other concurrently developed algorithms for ranking websites were based on the idea that experts’ links to page i should count for more than non-‐experts’ links. “Experts” are idenHfied by counHng how many highly-‐ranked search results they link to. This is the basis for the hubs-‐and-‐authoriHes (or HITS) algorithm of Jon Kleinberg and SALSA algorithm of Lempel and Moran. Ilya Pollak InformaHon retrieval • Web search is an example of informaHon retrieval. • Before the Web, informaHon retrieval meant searching databases of newspaper arHcles, scienHfic papers, patents, legal abstracts, medical records, etc. • An interesHng applicaHon of text-‐based search to video is SnapStream which is based on closed capHons. – Used by government enHHes and entertainment industry (e.g., the Daily Show). • PageRank is a akin to determining impact factors of scienHfic publicaHons: being cited helps, especially being cited by important publicaHons. • Non-‐text-‐based search is more difficult but has wide applicaHons: – forensics (fingerprint matching, footprint matching, face matching); – health care (matching an X-‐ray image against a data based of lung cancer images, to aid in determining the diagnosis and treatment). Ilya Pollak
© Copyright 2026 Paperzz