p338b - CSE, IIT Bombay

The Structure of Broad Topics
on the Web
Soumen Chakrabarti
Mukul M. Joshi
Kunal Punera
Lab. for Intelligent Internet Research,
IIT Bombay
David M. Pennock
NEC Research Institute
Graph structure of the Web
 Over two billion nodes, 20 billion links
 Power-law degree distribution
• Pr(degree = k)  1/k2.1
 Looks like a “bow-tie” at large scale
IN
Strongly
connected
core (SCC)
OUT
“This is
the Web”
The need for content-based models
 Why does a radius-1
expansion help in
topic distillation?
Query
Search
engine
Crawler
 Why does topicspecific focused
crawling work?
 Why is a global
PageRank useful for
specific queries?
Root
set
Classifier
Check
Prune
frontier topic if irrelevant
d
p(v)   (1  d )
N
Uniform
jump

u v
p(u )
OutDegree( u )
Walk to
out-neighbor
The need for content-based models
 How are different topics linked to each other?
• Application: crawling, classification, clustering
 Are URL collections representative of Web topic
populations?
• Web directories: Dmoz, Yahoo!
• TREC Web track
“This is the
Web with
topics!”
How to characterize “topics”
 Web directories—a natural choice
 Start with http://dmoz.org
 Keep pruning until all leaf topics
Test doc
have enough (>300) samples
Classifier
 Approx 120k sample URLs
Topic
Prob
 Flatten to approx 482 topics
Arts
0.1
Computers
0.3
 Train text classifier (Rainbow)
Science
0.6
 Characterize new document d as
a vector of probabilities pd = (Pr(c|d) c)
Critique and defense
 Cannot capture fine-grained or emerging
topics
• Emerging topics most often specialize existing
broad topics, which rarely change
 Classifier may be inaccurate
• Adequate if much better than random guess
• Can compensate errors using held-out
validation data
 Results depend on one Web directory
• Can repeat with many others and compare
Background topic distribution
 What fraction of Web pages are
about Health?
 Sampling via random walk
• PageRank walk (Henzinger et al.)
• Undirected regular walk (Bar-Yossef
et al.)
 Make graph undirected (link:…)
 Add self-loops so that all nodes
have the same degree
 Sample with large stride
 Collect topic histograms
Convergence
Stride=30k
Stride=75k
Distribution
difference
0.4
Background distribution
0.3
0.2
0.1
 Start from pairs of diverse topics
 Two random walks, sample from each walk
 Measure distance between topic distributions
• L1 distance |p1 – p2| = c|p1(c) – p2(c)| in [0,2]
• Below .05 —.2 within 300—400 physical pages
Sports
Society
Shopping
Science
Reference
Recreation
Home
Health
1000
Games
500
Computers
Arts
0
Business
1
0.8
0.6
0.4
0.2
0
#hops
0
Biases in topic directories
 Use Dmoz to train a
classifier
 Sample the Web
 Classify samples
 Diff Dmoz topic distribution
from Web sample topic
distribution
 Report maximum deviation
in fractions
 NOTE: Not exactly Dmoz
Dmoz over-represents
Games.Video_Games
Society.People
Arts.Celebrities
...Education.Colleges
...Travel.Reservations
Dmoz under-represents
…WWW…Directories!
Sports.Hockey
Society.Philosophy
Education…K12…
Recreation…Camping
Topic-specific degree distribution
 Preferential attachment:
connect
to v w.p. proportional to
current degree of v,
regardless of topic
 More realistic: u has a
topic, and links to v with
related topics
 Unclear if power-law
should still hold
 Holds for large degree
Intra-topic
linkage
Inter-topic
linkage
Random forward walk without jumps
/Arts/Music
/Sports/Soccer
1.4
1
1.2
0.8
0.6
L_1 Distance
L_1 Distance
1.2
1
0.8
From background
From hop0
0.4
0.2
0.6
From background
From hop0
0.4
0
5
10
15
Wander hops
20
0
5
10
15
Wander hops
20
 Sampling walk is designed to mix topics well
 How about walking forward without jumping?
• Start from a page u0 on a specific topic
• Sample many forward random walks (u0, u1, …, ui, …)
• Compare (Pr(c|ui) c) with (Pr(c|u0) c) and with the
background distribution
Observations and implications
 Forward walks wander away from
starting topic slowly
 But do not converge to the
background distribution
 Global PageRank ok also
for topic-specific queries
• Jump parameter d=.1—.2
• Topic drift not too bad within
path length of 5—10
• Prestige conferred mostly by
same-topic neighbors
W.p. d jump to
a random node
W.p. (1-d)
jump to an
out-neighbor
u.a.r.
Jump
 Also explains why focused crawling works
Highprestige
node
Citation matrix
 Given a page is about topic i, how likely is
it to link to topic j?
• Matrix C[i,j] = probability that page about topic
i links to page about topic j
u
v
• Soft counting: C[i,j] += Pr(i|u)Pr(j|v)
 Applications
• Classifying Web pages into topics
• Focused crawling for topic-specific pages
• Finding relations between topics in a directory
Citation, confusion, correction
From topic
Arts
Business
Computers
Games
Health
Home
Recreation
Reference
Science
Shopping
Society
Sports
To topic 
Classifier’s confusion
on held-out documents
can be used to correct
confusion matrix
From topic
To topic 
Guessed topic 
True topic
Fine-grained views of citation
Prominent off-diagonal
(/Arts/Music to
/Shopping/Music)
entries raise design
issues for taxonomy
editors and maintainers
Clear block-structure derived
from coarse-grain topics
Strong diagonal blocks reflect
tightly-knit topic communities
Concluding remarks
 A model for content-based communities
• New characterization and measurement of
topical locality on the Web
• How to set the PageRank jump parameter?
• Topical stability of topic distillation
• Better crawling and classification
 A tool for Web directory maintenance
• Fair sampling and representation of topics
• Block-structure and off-diagonals
• Taxonomy inversion