6. Markov chains: Web search

6. Markov chains: Web search ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak Early History of Web Search •  Pre-­‐historic Hmes (mid-­‐1990’s): –  Excite, AltaVista, … –  Search results are ranked by frequency and
prominence of the search term –  Bad results (e.g., search for “university” would not
not rank MIT highly) –  Easily exploitable Ilya Pollak
Early History of Web Search •  Pre-­‐historic Hmes (mid-­‐1990’s): –  Excite, AltaVista, … –  Search results are ranked by frequency and
prominence of the search term –  Bad results (e.g., search for “university” would not
not rank MIT highly) –  Easily exploitable •  1996-­‐1998: several ideas for ranking a page
based on how many other highly ranked
pages link to it. Ilya Pollak
The Web as a big graph Ilya Pollak
Google: Basic Idea •  PageRanki = steady-­‐state probability of visiHng
page i. •  Keyword search results are presented
according to their page rank, from highest to
lowest. •  IntuiHon: –  Being pointed to by many pages helps –  Being pointed to by popular pages helps Ilya Pollak
Simple model of Web surfing •  ci = # pages that page i has links to. •  Assume that any surfer looking at page i will follow of
the links on page i, with probability 1/ci for each link. Ilya Pollak
Simple model of Web surfing •  ci = # pages that page i has links to. •  Assume that any surfer looking at page i will follow of
the links on page i, with probability 1/ci for each link. •  I.e., one-­‐step transiHon probabiliHes are pij = 1/ci if a
link i -­‐> j exists and pij = 0 if it does not. Ilya Pollak
Simple model of Web surfing •  ci = # pages that page i has links to. •  Assume that any surfer looking at page i will follow of
the links on page i, with probability 1/ci for each link. •  I.e., one-­‐step transiHon probabiliHes are pij = 1/ci if a
link i -­‐> j exists and pij = 0 if it does not. •  Define PageRank(i) as the steady state probability for
the surfer to be at page i acer a large number of steps. Ilya Pollak
Simple model of Web surfing •  ci = # pages that page i has links to. •  Assume that any surfer looking at page i will follow of
the links on page i, with probability 1/ci for each link. •  I.e., one-­‐step transiHon probabiliHes are pij = 1/ci if a
link i -­‐> j exists and pij = 0 if it does not. •  Define PageRank(i) as the steady state probability for
the surfer to be at page i acer a large number of steps. •  Might there be any problems with this definiHon? Ilya Pollak
Problems •  Steady-­‐state probabiliHes do not necessarily exist. Ilya Pollak
Problems •  Steady-­‐state probabiliHes do not necessarily exist. •  If they exist, they are not necessarily independent of where the
surfing started. 0.4
0.4
0.3
0.6
3 2 1 0.6
1
0.2
4 0.5
Ilya Pollak
Problems •  Steady-­‐state probabiliHes do not necessarily exist. •  If they exist, they are not necessarily independent of where the
surfing started. 0.4
0.4
0.3
0.6
3 2 1 0.6
0.4
1
0.2
4 0.5
0.4
0.2
1
2 3 4 0.6
1 0.6
0.5
Ilya Pollak
More problems: AbsorpHon into a
page that doesn’t have many in-­‐links Ilya Pollak
More problems: AbsorpHon into a
page that doesn’t have many in-­‐links 1 2 …
98 99 …
100 101 Ilya Pollak
More problems: AbsorpHon into a
page that doesn’t have many in-­‐links 1 2 1/3
1/100
1/2
1/3
1/3
1/100
1/3
1/3
1/3
1/2
…
…
98 1/3
1/3
1/100
1/100
1/2
99 1/2
100 1/100
101 1
Ilya Pollak
More problems: AbsorpHon into a
page that doesn’t have many in-­‐links 1 2 1/3
1/100
1/2
1/3
1/3
1/100
1/3
1/3
1/3
1/2
…
…
98 1/3
1/3
1/100
1/100
1/2
99 1/2
100 1/100
101 1
•  All states except for 101 are transient! (Because 101 is reachable
from all other states, but no other state can be reached from 101.) Ilya Pollak
More problems: AbsorpHon into a
page that doesn’t have many in-­‐links 1 2 1/3
1/100
1/2
1/3
1/3
1/100
1/3
1/3
1/3
1/2
…
…
98 1/3
1/3
1/100
1/100
1/2
99 1/2
100 1/100
101 1
•  All states except for 101 are transient! (Because 101 is reachable
from all other states, but no other state can be reached from 101.) •  State 101 is absorbing, with absorpHon probabiliHes ai = 1 for any i. Ilya Pollak
More problems: AbsorpHon into a
page that doesn’t have many in-­‐links 1 2 1/3
1/100
1/2
1/3
1/3
1/100
1/3
1/3
1/3
1/2
…
…
98 1/3
1/3
1/100
1/100
1/2
99 1/2
100 1/100
101 1
•  All states except for 101 are transient! (Because 101 is reachable
from all other states, but no other state can be reached from 101.) •  State 101 is absorbing, with absorpHon probabiliHes ai = 1 for any i. •  Therefore, PageRank(101) = 1 and PageRank(i) = 0 for all other states. Ilya Pollak
Modified model of Web surfing •  ci = # pages that page i has links to. •  n+1 = total # pages on the Web. •  Any surfer looking at page i will: –  if ci = 0, choose one of the other n pages at random; Ilya Pollak
Modified model of Web surfing •  ci = # pages that page i has links to. •  n+1 = total # pages on the Web. •  Any surfer looking at page i will: –  if ci = 0, choose one of the other n pages at random; –  if ci ≠ 0, flip a coin whose P(heads) = p (the coin is assumed to be
independent of the surfing), and •  if it’s heads, select one of the out-­‐links at random; •  if it’s tails, select one of the n Web pages at random. Ilya Pollak
Modified model of Web surfing •  ci = # pages that page i has links to. •  n+1 = total # pages on the Web. •  Any surfer looking at page i will: –  if ci = 0, choose one of the other n pages at random; –  if ci ≠ 0, flip a coin whose P(heads) = p (the coin is assumed to be
independent of the surfing), and •  if it’s heads, select one of the out-­‐links at random; •  if it’s tails, select one of the n Web pages at random. •  One-­‐step transiHon probabiliHes are: ⎧
⎪
⎪
⎪
pij = ⎨
⎪
⎪
⎪
⎩
1
n
p⋅
if ci = 0
1
1
+ (1 − p ) ⋅
ci
n
(1 − p ) ⋅
1
n
if ci ≠ 0 and link i → j exists
if ci ≠ 0 and link i → j does not exist
Ilya Pollak
Modified model of Web surfing ⎧
⎪
⎪
⎪
pij = ⎨
⎪
⎪
⎪
⎩
1
n
p⋅
if ci = 0
1
1
+ (1 − p ) ⋅
ci
n
(1 − p ) ⋅
1
n
if ci ≠ 0 and link i → j exists
if ci ≠ 0 and link i → j does not exist
•  Assuming that p < 1, the resulHng Markov chain graph is fully
connected, with pij ≠ 0 for all Web pages i and j. •  Therefore, the enHre graph forms a single recurrent class, with no
periodic states. •  Define PageRank(i) as the steady state probability for the surfer to be
at page i acer a large number of steps under this model. •  Then PageRank(i) exists and does not depend on the starHng point. •  Retrieve pages based on word frequency and prominence, and
perhaps other criteria, and sort by PageRank. Ilya Pollak
Comments •  Google’s original algorithm used word frequency, visual prominence
(e.g., font size), anchor text (text surrounding the link to page j in
page i), in addiHon to PageRank. •  Google’s current page ranking algorithm has hundreds of other
ingredients which are kept secret and are changed with Hme, so as
to both improve the algorithm and prevent people from taking
advantage. •  Other concurrently developed algorithms for ranking websites were
based on the idea that experts’ links to page i should count for more
than non-­‐experts’ links. “Experts” are idenHfied by counHng how
many highly-­‐ranked search results they link to. This is the basis for
the hubs-­‐and-­‐authoriHes (or HITS) algorithm of Jon Kleinberg and
SALSA algorithm of Lempel and Moran. Ilya Pollak
InformaHon retrieval •  Web search is an example of informaHon retrieval. •  Before the Web, informaHon retrieval meant searching databases of
newspaper arHcles, scienHfic papers, patents, legal abstracts,
medical records, etc. •  An interesHng applicaHon of text-­‐based search to video is
SnapStream which is based on closed capHons. –  Used by government enHHes and entertainment industry (e.g., the Daily Show). •  PageRank is a akin to determining impact factors of scienHfic
publicaHons: being cited helps, especially being cited by important
publicaHons. •  Non-­‐text-­‐based search is more difficult but has wide applicaHons: –  forensics (fingerprint matching, footprint matching, face matching); –  health care (matching an X-­‐ray image against a data based of lung cancer
images, to aid in determining the diagnosis and treatment). Ilya Pollak