Web Data - Classes

CS 440
Database Management Systems
Web Data
1
How the Web different from a database
of documents?
2
How the Web different from a database
of documents?
• Hypertext vs. text: a lot of additional clues
– graph vs. set
– anchor text vs. text: how others say about you?
• Geographically distributed vs. centralized
– so you need to build a crawler
• Precision more valued than recall
– quality is important than quantity, especially “broad” queries
• Spamming
• Hoaxes and more …
• Web scale is super-huge
– scalability is the key
3
Web data and query
• Data model
–
–
–
–
directed graph
nodes: Web pages
links: hyperlinks
all nodes belong to the same type.
• Query is a set of terms
• Answer
– ranked list of relevant and important pages
– quantifying a subjective quality
• Basic data/query model
– more complex models, e.g., assigning types to pages.
4
Web search before Google
• Web as a set of documents
• Relevance: content-based retrieval
– documents match queries by contents
– q: ’clinton’  rank higher pages with more ‘clinton’
• Importance???
– contents: what documents say about themselves
– many spams and unreliable information in the results.
• Directory services were used
– Yahoo! was one of the leaders
– Google co-founders were told “nobody will use a keyword interface”.
5
Google: PageRank
• From the Stanford Digital Libraries project 1996-98
• Published the paper in 1997:
S. Brin, L. Page: The Anatomy of a Large-Scale Hypertextual Web Search
Engine. WWW7 / Computer Networks 30(1-7): 107-117 (1998)
• Tried to sell to Infoseek in 1997
• Founded in 1998 by Brin and Page
6
Web: Adjacent Matrix
• Web: G = {V, E}
– V = {x, y, z}, |V| = n
– E = {(x, x), (x, y), (x, z),
(y, z),
(z, x), (z, y) }
– A: n x n matrix: Aij = 1 if page i links to page j, 0 if not
target node
y
A=
z
7
source node
x
1
0
1
1
0
1
1
1
0
Transposed Adjacent Matrix
• Adjacent matrix A:
– what does row j represent?
• Transpose At:
– what does row j represent?
x
A=
1
0
1
1
0
1
1
1
0
At =
1
1
1
0
0
1
1
1
0
y
z
8
PageRank: importance of pages
• PageRank (or importance): recursively
– a page P is important if important pages link to it
– importance of P:
• proportionally contributed by the back-linked pages
• Example:
x
– vx = 1/2 vx + 1/2 vz
– vy = 1/2 vz
– vz = 1/2 vx + 1 vy
• Random-surfer interpretation:
y
z
– surfer randomly follows links to navigate
– PageRank = the prob. that surfer will visit the page
9
Computing PageRank
• Importance-propagation equation:
1/2
v= 0
1/2
0
0
1
1/2
1/2 v
0
• linked-from (At) or links-to matrix (A)?
• column-normalized:
• column x is all that x points to
• sum of column = 1
•Transition Matrix
• Computation: by relaxation
v:
1
1
1
1
2
1
1/2
3/2
3
fixpoint
5/4 …
6/5
3/4 …
3/5
1
…
6/5
x
y
z
10
Problems: Dead Ends
• Dead ends:
– page without successors has nowhere to send its importance
– eventually, what would happen to v?
x
• Example:
a
b
y
z
– v a = 0 va + 0 vb
– v b = 1 va + 0 vb
11
Problems: Spider Trap
• Spider traps:
– group of pages without out-of-group links will trap a spider inside
– what would happen to v?
x
a
y
b
z
• Example:
– va = 1/2 va + 0 vb
– vb = 1/2 va + 1 vb
• Solutions??
12
Solutions: surfer’s random jump
• Surfer can randomly jump to a new page
– without following links
v = d M v + (1-d) e / n
– M: transition matrix, e: a vector with all 1’s,
n: number of nodes in the graph
– d: damping factor (set to .85 in paper)
• model the probability of randomly jumping to this page
• another interpretation:
– “tax” importance of each page and distribute to all pages
• Teleportation
13
Anti-Spamming
• Spamming:
– attempt to create artifacts to “please” search engines
– so that ranking will be high
– e.g., commercial “search engine optimization service”
• Google anti-spam device:
– unlike other search engines, tends to believe what others say
about you
• by links and anchor texts
– recursive importance also works:
• importance (not just links) propagate
– Still, not perfect solution
14
What you should know
•
•
•
•
Web data and query model
PageRank formula and algorithm
Dead ends and spider traps
Teleportation
15