The Search Problem - CS HUJI Home Page

Order Out of Chaos
Analyzing the Link Structure of the Web for
Directory Compilation and Search.
Presented 19.11.2000
by Benjy Weinberger
The Search Problem
• The WWW is a vast collection of
information: over 1 billion text pages plus a
multitude of multimedia files. Over a
million new resources are added every day.
• How do we find the information we need in
such a large collection?
• Search is the most common activity on the
web after email.
Types of Solutions
• There are basically two types of solution to
the “search problem”:
– Directories are static lists of web pages,
ordered in a topic taxonomy.
– Search Engines attempt to match web sites to a
user’s query in real time.
• The distinction between the two types is
becoming more and more fuzzy.
Types of Searches
• Two main kinds of searches:
– Specific queries. E.g. “does Netscape 4.0
support the Java 1.1 code-signing API?”
– Broad-topic queries. E.g. “find out information
about Che Guevara.”
• Other types of searches exist, such as:
– Similar-page queries. E.g. “find pages ‘similar’
to www.cnn.com.”
Different Queries Different Challenges
• Specific Queries: The difficulty is mostly
that there are very few pages with the
required information - the “needle in a
haystack” problem.
• Broad-topic Queries: The difficulty is that
there are too many relevant pages for
humans to digest.
Broad-topic Queries
• Center around the notion of an authority - a
page containing high-quality, authoritative
information on the relevant topic.
• Problem: we are asking a computer to make
a judgment on abstract and subjective
qualities of “relevance” and “authority”.
Manual Solutions
• Manually edited catalogues of pages,
ordered in a taxonomy of topics.
• Problems:
– Completely unscalable. Only a tiny fraction of
the web can be catalogued.
– Subjective. Decisions made by individual
editors.
Manual Solutions
• Nonetheless, solution is highly popular.
• Yahoo is still the #1 site on the web in terms
of visits.
• The Open Directory Project is has over
2,000,000 sites in 250,000 categories, edited
by tens of thousands of volunteers.
Automated Solutions - 1st Wave
• “Relevance” assumed to be associated with
existence of keywords.
• “Authority” assumed to be associated with
frequency and placement of keywords.
– The more a page mentions the word the higher
that page’s score.
– Font size, proximity to other keywords and
place in page may increase score.
Automated Solutions - 1st Wave
• Solution implemented by search engines
such as AltaVista, HotBot, Lycos, Northern
Light and many others.
• These solutions are inadequate:
– The determination of “relevance” is inaccurate.
– The determination of “authority” is inaccurate.
– Don’t work for non-textual resources.
• Result: users still end up wading through
thousands of results.
The Web as a Graph
• But ... the web is more than just a set of
pages. The hyperlinks between pages induce
a digraph structure on the web.
• Hyperlinks are created by human site
authors and therefore can imply some
human determination of authority.
• But how can we mine this information for
search?
Naive Approach
• Determine relevance by keyword matching.
• Determine authority by number of inbound
links.
• Heuristic:
– Find all pages matching keywords.
– Rank them by in-degree.
Naive Approach - Problems
• Good authorities may not contain keywords.
– Example: honda.com, toyota.com, ford.com do
not contain the term “car manufacturers”.
• Doesn’t look at who is linking to the page.
• A highly popular page will be an authority
on ANY query string it contains.
– Example: cnn.com, yahoo.com, amazon.com.
Where Do We Stand
• The hyperlink structure of the web is
constructed by the deliberate annotations of
millions of web site authors.
• How can we mine this seemingly random
link structure for underlying meaning?
Short Mathematical Interlude
• Given a square matrix A, an eigenvalue is a
scalar  with the property that there exists a
vector x  0 such that Ax  x . The vector
x is called an eigenvector.
• The normalized eigenvector corresponding
to the largest eigenvalue is called the
principal eigenvector.
Short Mathematical Interlude
• A stochastic matrix is a matrix with nonnegative entries such that the entries in each
row sum to 1.
• A stochastic matrix always has a principal
eigenvector (with eigenvalue 1):
1
1



A   1  
1
1
All eigenvalues are <= 1 in absolute value.
Short Mathematical Interlude
• For any matrix A, AA T and A T A are
symmetric.
• A symmetric real n-by-n matrix has n real
eigenvalues (with multiplicity).
• For any vector x not orthogonal to the
principal eigenvector y,
An x
n
 y

n
A x
Short Mathematical Interlude
• The adjacency matrix of a directed graph G
is the matrix A such that:
1 (i, j)  E(G )
Ai, j  
0 (i, j)  E(G )
Google
• High-quality, highly popular search engine.
• Developed by Sergey Brin and Larry Page
at Stanford. Later became a commercial
venture.
• Combines new ranking algorithm with
state-of-the-art engineering solutions for
scalability and performance.
Google
• Offline processing assigns each site on the
web a PageRank - a global ranking based
on link structure analysis.
• Relevance determined by keywords.
• Authority determined by PageRank.
PageRank
• Intuitive description: a page has high rank if
the sum of the ranks of pages pointing to it
is high.
• This is a circular definition, represented
mathematically by:
PR (q) 1  d
PR (p)  d 

n
q  p deg( q )
where n is the total number of pages, deg(q)
is q’s out degree and d is a damping factor.
PageRank
• PR is a probability vector on the web pages
(0 <= PR(p) <= 1 and the sum is 1).
• PR corresponds to a random surfer model:
– A surfer clicks at random through the web,
without hitting “back”. When she gets bored,
which happens with probability 1-d, she jumps
to a random page.
– PR(p) is the probability of hitting page p in this
“random surf”.
PageRank
• We define a matrix A for which:
1 deg( i) i  j
Ai, j  
i
 j
 0
Note that A is a stochastic matrix.
• The PageRank vector is an eigenvector of
B  dA 
T
1d
n
J , J
1  1
 
1  1
PageRank
• It is not hard to see, using the properties of
stochastic matrices, that B has a principal
eigenvector.
• We can initialize PR to all ones and apply B
iteratively until it converges to this
eigenvector.
Google
• Advantages:
– Superior results to other engines.
– Responds to any query, not just predetermined
topics.
• Disadvantages:
– PageRank is global, not contextual.
– Discriminates against “orphan” pages.
HITS
• HITS - Developed by Jon Kleinberg et al
within the CLEVER project at IBM
Almaden.
• Introduced the twin notions of Hubs and
Authorities.
HITS - Hubs & Authorities
• An authority is a web site containing highquality, highly-relevant information (with
respect to the given search query).
• A hub is a web site that points to many
related authorities.
• Intuitively, hubs help “pull” authorities
together into a dense cluster of related sites.
HITS
• Hub (resp. authority) weights determine
how good a hub (resp. authority) a site is.
– The idea is to first find a focused subgraph of
the web relevant to the specified topic.
– Then the link structure of this subgraph is
analyzed to find the hub and authority weights
for each site, thus identifying the top authorities
for this topic.
– Processing requirement per topic implies that
uses are mainly for directory compilation.
HITS
• The focused subgraph is created by first
taking the t highest-ranked pages from a
text-based search engine as a root set R.
• R is expanded into the base set S by taking
all sites pointing to or pointed at by a site in
R.
• Note that while R may fail to contain some
“important” authorities, S will probably
contain them.
HITS
• We have a circular definition:
– A good authority is a site pointed to by many
good hubs.
– A good hub is a site that points to many good
authorities.
• This is the mutually reinforcing
relationship.
• How can we break this circularity?
HITS
• A static mathematical formulation:
– For a web site p let h(p) be its hub weight and
a(p) be its authority weight. We take the vectors
a, h to be normalized.
– We would like to find vectors a , h such that (up
to normalization):
a ( p) 
 h (q )
q p
, h ( p) 
 a (q )
p q
HITS
• If A is the adjacency matrix of the graph in
question (the focused subgraph) then the
constraints can be written as:
Aa
h
Aa
T
A h
, a T
A h
By substitution we see that a is defined by:
Ba
a
, B  AT A
Ba
HITS
• In other words, a is an eigenvector of B:
Ba  a ,   Ba
• B is the co-citation matrix: B(i,j) is the
number of sites that jointly point to both i
and j.
• B is symmetric and has n orthogonal unit
eigenvectors. Which one do we take?
HITS
• Let’s look at the problem dynamically:
– We initialize a(p) = h(p) = 1 for all p.
– We iterate the following operations:
a ( p )   h (q )
h ( p)   a (q )


q p
p q
a  ATh
h  Aa

a  Ba , B  A T A
And renormalize after each iteration.
HITS
• The eigenvectors of B are precisely the
stationary points of this process.
• For any stationary point a, Ba  a and
therefore the authority weights are
reinforced by the factor  .
• Conclusion: The principal eigenvector
represents the “densest cluster” within the
focused subgraph.
HITS
• As we have said, by initializing
a(p)=h(p)=1, a will converge to the
principal eigenvector of B.
– Initializing differently may lead to convergence
to a different eigenvector.
– In practice convergence is achieved after only
10-20 iterations.
HITS
• The non-principal eigenvectors may
represent other, less dense subclusters of the
focused subgraph.
• Other subclusters can occur when:
– A query string has several meanings
– A concept is relevant to several communities.
– The topic is highly polarized, involving groups
that are not likely to link to one another (such
as “abortion”).
HITS
• Unlike the principal eigenvector, the other
eigenvectors have both positive and
negative entries.
• Each non-principal eigenvector provides us
with two densely connected clusters: The
“most positive” and “most negative” entries
of the vector.
• The larger the eigenvalue, the more
“relevant” the corresponding eigenvector.
Spectral Filtering
• A generalization of HITS.
• Developed by Chakrabarti, Dom, Gibson,
Raghavan and others. Some of these worked
with Kleinberg on CLEVER at IBM
Almaden.
• Used commercially in Quiver to mine
community knowledge.
Spectral Filtering
• We have two kinds of entities: hubs and
authorities. These can both be of the same
type (such as web sites, cited documents) or
of different types (people and products,
bookmark folders and web sites).
• Note that we distinguish, a priori, between
hubs and authorities.
Spectral Filtering
• The link structure consists of directed links
from hubs to authorities.
• In mathematical terms we have a bipartite
digraph (H;A;E) with all edges directed
from H to A.
Spectral Filtering - Affinities
• For each hub i and authority j let a(i,j) >= 0
be the affinity of i for j.
• E.g: The percentage of common terms
between documents or the number of visits
a user makes to a bookmark.
• A=a(i,j) is the affinity matrix.
Spectral Filtering
• In SF the mutually reinforcing relationship
becomes:
a ( p )   a q , p h (q ) , h ( p )   a p , q a (q )
q p
p q
Or:
a  A T h , h  Aa
Spectral Filtering
• Note that HITS is simply SF with each web
site represented as both a hub and an
authority, and with the following simple
affinities:
1 i  j
a i, j  
 j
0 i 
Spectral Filtering
• The spectral analysis used in HITS applies
here, and the principal eigenvector is used
to determine the highest ranked authorities.
Spectral Filtering
• Notes:
– The links from hubs to authorities represent the
opinions of many people regarding the
authorities.
– SF mines this information for results reflecting
the opinions of these people as a group.
– The SF algorithm “pulls” the densely linked
cluster together.
– SF is relatively insensitive to “noise”.
References
[1] J.Kleinberg. Authoritative Sources in a Hyperlinked Environment.
Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete
Algorithms, 1998.
[2] S.Chakrabarti, B.Dom, D.Gibson, R.Kumar, P.Raghavan,
S.Rajagopalan, A.Tomkins. Spectral Filtering for Resource Discovery.
ACM SIGIR Workshop on Hypertext Information Retrieval on the Web,
1998.
[3] S.Brin, L.Page. The Anatomy of a Large-Scale Hypertextual Web
Search Engine. Proc. 7th International WWW Conference, 1998.
[4] S.Brin, R.Motwani,L.Page,T.Winograd. The PageRank Citation
Ranking: Bringing Order to the Web. Manuscript.
End
• Thank you for listening!