Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger The Search Problem • The WWW is a vast collection of information: over 1 billion text pages plus a multitude of multimedia files. Over a million new resources are added every day. • How do we find the information we need in such a large collection? • Search is the most common activity on the web after email. Types of Solutions • There are basically two types of solution to the “search problem”: – Directories are static lists of web pages, ordered in a topic taxonomy. – Search Engines attempt to match web sites to a user’s query in real time. • The distinction between the two types is becoming more and more fuzzy. Types of Searches • Two main kinds of searches: – Specific queries. E.g. “does Netscape 4.0 support the Java 1.1 code-signing API?” – Broad-topic queries. E.g. “find out information about Che Guevara.” • Other types of searches exist, such as: – Similar-page queries. E.g. “find pages ‘similar’ to www.cnn.com.” Different Queries Different Challenges • Specific Queries: The difficulty is mostly that there are very few pages with the required information - the “needle in a haystack” problem. • Broad-topic Queries: The difficulty is that there are too many relevant pages for humans to digest. Broad-topic Queries • Center around the notion of an authority - a page containing high-quality, authoritative information on the relevant topic. • Problem: we are asking a computer to make a judgment on abstract and subjective qualities of “relevance” and “authority”. Manual Solutions • Manually edited catalogues of pages, ordered in a taxonomy of topics. • Problems: – Completely unscalable. Only a tiny fraction of the web can be catalogued. – Subjective. Decisions made by individual editors. Manual Solutions • Nonetheless, solution is highly popular. • Yahoo is still the #1 site on the web in terms of visits. • The Open Directory Project is has over 2,000,000 sites in 250,000 categories, edited by tens of thousands of volunteers. Automated Solutions - 1st Wave • “Relevance” assumed to be associated with existence of keywords. • “Authority” assumed to be associated with frequency and placement of keywords. – The more a page mentions the word the higher that page’s score. – Font size, proximity to other keywords and place in page may increase score. Automated Solutions - 1st Wave • Solution implemented by search engines such as AltaVista, HotBot, Lycos, Northern Light and many others. • These solutions are inadequate: – The determination of “relevance” is inaccurate. – The determination of “authority” is inaccurate. – Don’t work for non-textual resources. • Result: users still end up wading through thousands of results. The Web as a Graph • But ... the web is more than just a set of pages. The hyperlinks between pages induce a digraph structure on the web. • Hyperlinks are created by human site authors and therefore can imply some human determination of authority. • But how can we mine this information for search? Naive Approach • Determine relevance by keyword matching. • Determine authority by number of inbound links. • Heuristic: – Find all pages matching keywords. – Rank them by in-degree. Naive Approach - Problems • Good authorities may not contain keywords. – Example: honda.com, toyota.com, ford.com do not contain the term “car manufacturers”. • Doesn’t look at who is linking to the page. • A highly popular page will be an authority on ANY query string it contains. – Example: cnn.com, yahoo.com, amazon.com. Where Do We Stand • The hyperlink structure of the web is constructed by the deliberate annotations of millions of web site authors. • How can we mine this seemingly random link structure for underlying meaning? Short Mathematical Interlude • Given a square matrix A, an eigenvalue is a scalar with the property that there exists a vector x 0 such that Ax x . The vector x is called an eigenvector. • The normalized eigenvector corresponding to the largest eigenvalue is called the principal eigenvector. Short Mathematical Interlude • A stochastic matrix is a matrix with nonnegative entries such that the entries in each row sum to 1. • A stochastic matrix always has a principal eigenvector (with eigenvalue 1): 1 1 A 1 1 1 All eigenvalues are <= 1 in absolute value. Short Mathematical Interlude • For any matrix A, AA T and A T A are symmetric. • A symmetric real n-by-n matrix has n real eigenvalues (with multiplicity). • For any vector x not orthogonal to the principal eigenvector y, An x n y n A x Short Mathematical Interlude • The adjacency matrix of a directed graph G is the matrix A such that: 1 (i, j) E(G ) Ai, j 0 (i, j) E(G ) Google • High-quality, highly popular search engine. • Developed by Sergey Brin and Larry Page at Stanford. Later became a commercial venture. • Combines new ranking algorithm with state-of-the-art engineering solutions for scalability and performance. Google • Offline processing assigns each site on the web a PageRank - a global ranking based on link structure analysis. • Relevance determined by keywords. • Authority determined by PageRank. PageRank • Intuitive description: a page has high rank if the sum of the ranks of pages pointing to it is high. • This is a circular definition, represented mathematically by: PR (q) 1 d PR (p) d n q p deg( q ) where n is the total number of pages, deg(q) is q’s out degree and d is a damping factor. PageRank • PR is a probability vector on the web pages (0 <= PR(p) <= 1 and the sum is 1). • PR corresponds to a random surfer model: – A surfer clicks at random through the web, without hitting “back”. When she gets bored, which happens with probability 1-d, she jumps to a random page. – PR(p) is the probability of hitting page p in this “random surf”. PageRank • We define a matrix A for which: 1 deg( i) i j Ai, j i j 0 Note that A is a stochastic matrix. • The PageRank vector is an eigenvector of B dA T 1d n J , J 1 1 1 1 PageRank • It is not hard to see, using the properties of stochastic matrices, that B has a principal eigenvector. • We can initialize PR to all ones and apply B iteratively until it converges to this eigenvector. Google • Advantages: – Superior results to other engines. – Responds to any query, not just predetermined topics. • Disadvantages: – PageRank is global, not contextual. – Discriminates against “orphan” pages. HITS • HITS - Developed by Jon Kleinberg et al within the CLEVER project at IBM Almaden. • Introduced the twin notions of Hubs and Authorities. HITS - Hubs & Authorities • An authority is a web site containing highquality, highly-relevant information (with respect to the given search query). • A hub is a web site that points to many related authorities. • Intuitively, hubs help “pull” authorities together into a dense cluster of related sites. HITS • Hub (resp. authority) weights determine how good a hub (resp. authority) a site is. – The idea is to first find a focused subgraph of the web relevant to the specified topic. – Then the link structure of this subgraph is analyzed to find the hub and authority weights for each site, thus identifying the top authorities for this topic. – Processing requirement per topic implies that uses are mainly for directory compilation. HITS • The focused subgraph is created by first taking the t highest-ranked pages from a text-based search engine as a root set R. • R is expanded into the base set S by taking all sites pointing to or pointed at by a site in R. • Note that while R may fail to contain some “important” authorities, S will probably contain them. HITS • We have a circular definition: – A good authority is a site pointed to by many good hubs. – A good hub is a site that points to many good authorities. • This is the mutually reinforcing relationship. • How can we break this circularity? HITS • A static mathematical formulation: – For a web site p let h(p) be its hub weight and a(p) be its authority weight. We take the vectors a, h to be normalized. – We would like to find vectors a , h such that (up to normalization): a ( p) h (q ) q p , h ( p) a (q ) p q HITS • If A is the adjacency matrix of the graph in question (the focused subgraph) then the constraints can be written as: Aa h Aa T A h , a T A h By substitution we see that a is defined by: Ba a , B AT A Ba HITS • In other words, a is an eigenvector of B: Ba a , Ba • B is the co-citation matrix: B(i,j) is the number of sites that jointly point to both i and j. • B is symmetric and has n orthogonal unit eigenvectors. Which one do we take? HITS • Let’s look at the problem dynamically: – We initialize a(p) = h(p) = 1 for all p. – We iterate the following operations: a ( p ) h (q ) h ( p) a (q ) q p p q a ATh h Aa a Ba , B A T A And renormalize after each iteration. HITS • The eigenvectors of B are precisely the stationary points of this process. • For any stationary point a, Ba a and therefore the authority weights are reinforced by the factor . • Conclusion: The principal eigenvector represents the “densest cluster” within the focused subgraph. HITS • As we have said, by initializing a(p)=h(p)=1, a will converge to the principal eigenvector of B. – Initializing differently may lead to convergence to a different eigenvector. – In practice convergence is achieved after only 10-20 iterations. HITS • The non-principal eigenvectors may represent other, less dense subclusters of the focused subgraph. • Other subclusters can occur when: – A query string has several meanings – A concept is relevant to several communities. – The topic is highly polarized, involving groups that are not likely to link to one another (such as “abortion”). HITS • Unlike the principal eigenvector, the other eigenvectors have both positive and negative entries. • Each non-principal eigenvector provides us with two densely connected clusters: The “most positive” and “most negative” entries of the vector. • The larger the eigenvalue, the more “relevant” the corresponding eigenvector. Spectral Filtering • A generalization of HITS. • Developed by Chakrabarti, Dom, Gibson, Raghavan and others. Some of these worked with Kleinberg on CLEVER at IBM Almaden. • Used commercially in Quiver to mine community knowledge. Spectral Filtering • We have two kinds of entities: hubs and authorities. These can both be of the same type (such as web sites, cited documents) or of different types (people and products, bookmark folders and web sites). • Note that we distinguish, a priori, between hubs and authorities. Spectral Filtering • The link structure consists of directed links from hubs to authorities. • In mathematical terms we have a bipartite digraph (H;A;E) with all edges directed from H to A. Spectral Filtering - Affinities • For each hub i and authority j let a(i,j) >= 0 be the affinity of i for j. • E.g: The percentage of common terms between documents or the number of visits a user makes to a bookmark. • A=a(i,j) is the affinity matrix. Spectral Filtering • In SF the mutually reinforcing relationship becomes: a ( p ) a q , p h (q ) , h ( p ) a p , q a (q ) q p p q Or: a A T h , h Aa Spectral Filtering • Note that HITS is simply SF with each web site represented as both a hub and an authority, and with the following simple affinities: 1 i j a i, j j 0 i Spectral Filtering • The spectral analysis used in HITS applies here, and the principal eigenvector is used to determine the highest ranked authorities. Spectral Filtering • Notes: – The links from hubs to authorities represent the opinions of many people regarding the authorities. – SF mines this information for results reflecting the opinions of these people as a group. – The SF algorithm “pulls” the densely linked cluster together. – SF is relatively insensitive to “noise”. References [1] J.Kleinberg. Authoritative Sources in a Hyperlinked Environment. Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, 1998. [2] S.Chakrabarti, B.Dom, D.Gibson, R.Kumar, P.Raghavan, S.Rajagopalan, A.Tomkins. Spectral Filtering for Resource Discovery. ACM SIGIR Workshop on Hypertext Information Retrieval on the Web, 1998. [3] S.Brin, L.Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Proc. 7th International WWW Conference, 1998. [4] S.Brin, R.Motwani,L.Page,T.Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Manuscript. End • Thank you for listening!
© Copyright 2026 Paperzz