Improving Rank Algorithm of Search Engine with
Ontology and Categorization
Research Proposal
By
Qiaowei Dai
A thesis submitted for the degree of
Master of Science (Computer and Information Science)
School of Computer and Information Science
Division of Information Technology, Engineering and the Environment
Supervisor
Dr. Jiuyong Li
1st June 2009
University of South Australia
Contents
INTRODUCTION .................................................................................................................................. 5
1.1 BACKGROUND ................................................................................................................................ 5
1.2 MOTIVATION ................................................................................................................................... 8
1.3 RESEARCH AIM ............................................................................................................................... 8
1.4 SCOPE ............................................................................................................................................. 9
TRADITIONAL LINK STRUCTURE-BASED RANK ALGORITHMS ....................................... 10
2.1 OBJECTIVE AND USAGE OF HYPERLINK EXISTING IN WEB............................................................ 10
2.2 PAGERANK ALGORITHM ............................................................................................................... 11
2.2.1 Simplified PageRank Algorithm ........................................................................................... 13
2.2.2 Improved PageRank Algorithm ............................................................................................ 14
2.3 HITS ALGORITHM ........................................................................................................................ 16
2.3.1 Analysis of HITS Algorithm.................................................................................................. 17
2.3.2 Analysis of HITS Link .......................................................................................................... 18
3.4 SUMMARY ..................................................................................................................................... 20
LITERATURE REVIEW .................................................................................................................... 21
3.1 RESEARCH ON TRADITIONAL RANK ALGORITHMS OF SEARCH ENGINES ...................................... 21
3.1.1 Problems and Improvements of PageRank Algorithm .......................................................... 21
3.1.2 Problems and Improvements of Hits Algorithm ................................................................... 25
3.2 TRADITIONAL DOMAIN ONTOLOGY-BASED CONCEPT SEMANTIC SIMILARITY COMPUTATION ...... 28
3.2.1 Ontology............................................................................................................................... 28
3.2.2 Three Main Semantic Similarity Computation Models ........................................................ 29
3.3 SUMMARY ..................................................................................................................................... 29
METHODOLOGY ............................................................................................................................... 31
4.1 RESEARCH QUESTIONS ................................................................................................................. 31
4.2 RESEARCH STRATEGY ................................................................................................................... 31
4.3 EVALUATION TOOLS ...................................................................................................................... 34
2
4.3.1 The Ontology Tool ................................................................................................................ 34
CONCLUSION ..................................................................................................................................... 35
7.1 EXPECTED OUTCOMES .................................................................................................................. 35
7.1 CONTRIBUTIONS ........................................................................................................................... 35
REFERENCES ..................................................................................................................................... 36
3
Abstract
The appearance and rapid development of Internet has greatly changed the
environment of information retrieval. However, the rank algorithms for search engine
based on Internet are directly related to experiences in using when users perform
information retrievals in the new environment.
The existing rank algorithms for search engine are mainly based on the link structure
of web pages, and the two main representative algorithms are PageRank algorithm
and HITS algorithm. Many scholars and research institutions have made new
explorations and improvements based on these two algorithms, and some mature
integrated rank models suitable for search engines were generated.
In this paper, we study the shortcomings of search engines, and provide further
analysis on PageRank algorithm and Hits algorithm. Beside, we discuss the existing
improved algorithms based on link structure, and provide analysis on the
improvement ideas of existing search engine rank technology. Moreover, research on
traditional concept semantic similarity computation models based on domain ontology
is given as well.
According to the characteristics and shortcomings of existing models and algorithms,
we firstly propose an improved concept semantic similarity computation model. Then,
an improved rank algorithm which integrating categorization technology and
traditional link analysis algorithm based on it is given in this paper, which improves
HITS algorithm in two aspects, the pre-processing of Web pages and analysis on the
link structure of Web page. At last, the evaluations are provided as well.
4
CHAPTER 1
Introduction
1.1 Background
Search engines have gradually become a high efficient and convenient way for data
query and information acquisition to people. With the continuous development of
search engine technology, the current mature commercial search engines have
experienced several generations of evolution. Meanwhile, Web information retrieval
technology, which is the essence of search engines, including commercial products
has come out for about 20 years. In this period of time, great progresses in the aspects
of retrieval key technology, system structure design, query algorithm and etc. are
made, and a lot of commercial search engine services are being used on Web.
Compare with these progresses, the rapid increment of data on Web weakens the
achievement obtained in the research field of Web search in some degree, the massive
data quantity and frequent update speed have brought a completely new challenge as
well. Currently, the shortcomings existing in Web information retrieval are mainly
shown in the following aspects according to my research:
Low query quality
Low query quality is shown as when returning large amount of result pages,
however, the amount that really accords to users’ requirement is low. Moreover,
most of these relevant links don’t appear on the top of query results. Users have to
keep trying and turning pages in order to find valuable information, thus a lot of
5
time is consumed by this process. In the age that Web information amount is
increasing continuously, this problem has become particularly outstanding.
Improving Web query quality is the most critical subject of current intelligent
information retrieval research, after Web mining technology is integrated, the
query quality of search engines can obtain great improvement.
Low query update speed
There are two reasons causing the low update speed of Web query results, one is
the low efficiency of the Crawler system of search engines, which the collection
period of documents is too long, after the index is completed, difference has
emerged between acquired content and the newest pages; the other one is the
update speed of Web documents has become faster and faster. Currently, many
Websites include dynamic pages, which are activated by the background database,
thus the change of database will directly cause these dynamic pages to be changed.
The update speed of part of static pages is increasing as well. When many Web
pages are continuously visited by Web Crawler by two times, the change times of
them will much higher than two times in the interval, so users can’t obtain the
content of these changes through query.
Lack of effective information categorization
Currently, most of the query results of search engines are provided in the way of
list and paging, all the relevant and irrelevant links are put together without
association, which is quite inconvenient for users with explicit query objective,
because they have to keep jumping or selecting between various links.
Categorizing and clustering query pages is an effective way to improve the quality
of user navigation, which can make users select some category quickly and
ulteriorly refine query targets in this category. For example, if we input “mining”
into Vivisimo, several categories such as “data mining”, “gold” and “Mining and
Metallurgy” will emerge, and users can make further query in every category.
6
Keyword-based Web query lacks understanding of user behavior
In the view of the development of Web retrieval technology, keyword-based query
will be the most important retrieval way in a quite long period from now on.
Keyword-based query is a complicated retrieval mechanism implemented by the
Boolean combination of keywords. However, the query functions provided by
current search engines are quite limited, which only the most basic Boolean
connections between keywords are provided by most search engines. For instance,
Yahoo only provides two logical operators, which are “AND” and “OR”, and
compulsory applies one logical operator to all keywords. In many cases, it is quite
difficult to construct an effective query combination.
On the other hand, even to the same keywords, the search objective of different
users maybe different, it is closely related to the facts such as users’ personal
preference, the environment of context of current search, the previous search
history and so on. After these parameters are fully considered, a search engine that
accords to users’ requirement can be designed based on it. In Lawrence and Lee
Guile’s (1998) paper, they proposed a context environment-based Web retrieval
and query correction method.
Low index coverage rate of Web search engines
Currently, the coverage rate to Web of search engines is low than 50%, it is quite
difficult to completely index the whole Web because of resource restriction. In the
condition that the index coverage rate is low, when collecting documents, many
search services adopt same download priority for each page, which causes there
are many pages with low reference value remaining in index database, but some
relatively important pages are not indexed. In order to solve this problem,
discrimination of resource quality is needed in the process of Crawler traversing.
The pages with high quality should be downloaded in priority, and the index
database is constructed according to priority. In Chakrabarti, van den Berg and
Dom’s (1999) paper, they proposed an algorithm that analyzes Web document
7
quality in real time and determine download priority by means of focus crawling,
which makes up the shortcoming of low coverage rate in some degree.
1.2 Motivation
According to Intelligent Surfer Model, we can consider that the user behaviors in
browning Web page are not absolute random or blind, but related to topic. That is to
the numerous outbound links of each Web page, the outbound links which belong to
the same or similar Web page category will have the higher click rate.
No matter PageRank algorithm or HITS algorithm, they objectively describe the
essential characteristics between Web pages, but rarely consider about the topic
relativity of users’ surfer habit. Link structure-based algorithm can be integrated with
other technology very well in order to improve the algorithm adaptability.
Categorization technology can simulate user subject-related habit, so as to improve
this kind of link structure-based algorithm. Categorization technology overcomes the
unreliability brought by the assumption that the users’ behaviour of visiting Web
pages is absolute random, and distinguishes the direction relation between Web pages
according to category attributes, thus categorization technology can be regarded as an
important supplementary to traditional algorithms.
1.3 Research Aim
My research aim is to establish an improved rank algorithm for search engines based
on domain ontology and categorization technology in order to make rank algorithm
simulate the actual user behaviors in browsing Web pages more accurate. To achieve
this research aim, we have three objectives.
The first objective is to analyze the traditional link
structure-based rank algorithms
for search engines in order to gain an insight into their principles and further studies.
Thought the research and analysis on traditional domain ontology-based concept
8
semantic similarity computation, we can gain a full understanding of the principles
and weaknesses of three common computation models. Therefore, our second
objective is to analyze and improve the decision facts of concept semantic similarity
computation, and then develop an improved concept semantic similarity computation
model in order to determine the relation between two categories in the categorization
process.
The third objective is, according to the study on category-integrated PageRank
algorithm, we firstly perform a pre-categorization process to Web pages and keywords
based on the improved concept semantic similarity computation model, and then
develop a category-based HITS algorithm to satisfy the final aim of this thesis.
1.4 Scope
This thesis will focus on researching the rank algorithm for search engines, which
needs to be able to adapt the link structure of network, and gives accurate feedback to
the information queried by user. A good rank algorithm should be able to filter the
content of Web pages, reject irrelevant Web pages, and displays the Web pages which
are most relevant and close to query condition to the top of the list. Meanwhile, the
waiting time of this kind of rank computation should be in user’s acceptable scope.
9
CHAPTER 2
Traditional Link Structure-based Rank Algorithms
Hyperlink is a very important component of Web. Through the hyperlink in a page,
users can arbitrarily link from a page of any WWW server in the world to the page of
another WWW server. Hyperlink not only provides convenient information navigation,
but also is an information organization method, which includes help information that
is very rich and effective to Web information retrieval.
2.1 Objective and Usage of Hyperlink Existing in Web
In order to convenience users in intra-Website navigation, the internal hyperlinks of a
Web page can convenience users to jump between different Web pages freely, so as to
avoid to use the “back” button of Web browser. A well-designed Website should be
able to jump from an arbitrary page of the Website to the other pages of the Website
by multiple links. The main function of this kind of hyperlink is to assist users to
orderly visit the whole Website content.
Another kind of hyperlink is extra-Website hyperlink, which is the most important
hyperlink form in Web hyperlink mining research. Generally, extra-Website hyperlink
always represents the page creator’s attention and preference of some Website or
content, or say, some potential relations exist. For example, adding the hyperlink of
Yahoo to a page represents the author’s recommendation and preference of Yahoo;
adding the hyperlink of Kdnuggets, which is a famous data mining Website, to a page
represents that the page author is interested in data mining, as well as the page itself is
possibly related to data mining research. If the URL of some page is linked many
10
times in Web, it indicates that the quality of its content is high; on the contrary, the
important degree is lower. This kind of evaluation mechanism is similar to the
reference in scientific paper, the importance of the paper with more times being
referenced by other people is higher than it with less times being referenced. In Web
retrieval, besides the times being linked by other documents, the quality of source link
document is also a reference factor of evaluating the quality of linked documents,
which the document linked or recommended by the high-quality document always has
higher authority. Web can be considered as a graph structure in hyperlink analysis,
and analyze the link relations between nodes can help solve the difficult problem that
text content-based retrieval can’t achieve content quality evaluation.
Compare with the traditional search engines which use the rank algorithms based on
the query results of word frequency statistics, the advantage of hyperlink
analysis-based algorithm is that it provides an objective and cheat-proofing (Some
Web documents cheat traditional search engines by adding invisible strings) Web
resource evaluation method. Currently, link analysis algorithm is used in many Web
information acquisition fields, including rank search engine document, search related
document, arrange priority order of Web Crawler’s URL crawling, etc (Dean, J &
Henzinger, RM 1999). Recently, compared with the word frequency statistics-based
method used by traditional search engines, the Web retrieval algorithms based on
hyperlink analysis such as PageRank algorithm has great improvement in the aspect
of improving retrieval precious (Haveliwala, HT 1999).
2.2 PageRank Algorithm
PageRank is a global link analysis algorithm proposed by S. Brin and L. Page (Brin, S
& Page, L 1998). It performs statistics to the URL link condition of whole Web, and
calculates a weight, which is called as the PageRank value of this page, to every URL
according to the factors such as link times, etc. This PageRank value is fixed, not
changeable with the change of query keyword, which is different from the local link
analysis algorithm HITS.
11
u
b
a
w
v
c
Figure 1 Directed link graph G
For example, in Figure 1, page u includes a hyperlink referring to page v , there
exists link (u , v) . Here, the hyperlinks between pages compose a directed graph G .
To a node composing directed graph G in every page, if and only if when u
includes the hyperlink referring to page v , there exists directed edge (u, v) from u
to v .
To node v , nodes b , c , u have contributions to the weight value of v , because
these three nodes all exist directed edges to v . The more the directed edges referring
to some node, the higher the node (page) quality is. The main shortcoming of this
kind of algorithm is that only the link quantity is considered, which means all the
links are equivalent, but whether the quality of source node itself is high or low is not
considered. In fact, the high-quality page in Web always includes high-quality links,
to the effect of linked document quality evaluation, the impact of the quality of source
node is always high than the quantity. For example, the links appearing in Yahoo
always have certain reference value, because Yahoo itself is a relatively authoritative
Website, just as the papers issued in top publications always have higher academic
value.
PageRank algorithm is in recursive form, its value relies on the linked times and the
PageRank value of source link (Brin, S & Page, L 1998).
12
2.2.1 Simplified PageRank Algorithm
Simplified PageRank algorithm implements the basic recursive procedure of link
times and source PageRank. Let the pages on Web as 1 , 2 , …, m , N (i ) is the
amount of the extra-Website links of page i , B (i ) is the page set referring to page
i . Assume Web is a strong connected graph (actually it is impossible, this problem
will be discussed in the next section), then the PageRank value of page i can be
expressed by:
r (i )
r( j)
jB ( i ) N ( j )
The expression above can be written as r AT r , r is the vector of m 1 , the
arbitrary element in matrix A , which aij
1
. If page i refers to j , then
N (i )
a ij 0 . Thus, vector r is the eigenvector of matrix AT . Because Web is assumed to
be strong connected, the eigenvalue of AT is 1 .
From the definition above, we can find that PageRank is accord with Random Surfer
Model (Page, L, Brin, S & Motwani, R 1998). We can consider Random Surfer Model
in this way: Assume a user visits Web page by means of randomly clicking hyperlinks,
moreover, he doesn’t use “back” function and keeps continuous clicking. The
PageRank of page i is essentially the probability of clicking page i in the process
that a user browses the whole Web by means of random surfer. Motwani, R &
Raghavan, P (1995) had made further research on RSM, these works can be also used
to analyze the Web link attributes.
The computation of simplified PageRank algorithm can use iterative method, after
several times of iteration, stop the iterative procedure when the PageRank value
converge to the condition that deviation is small enough. For example, in Figure 2, the
computation procedure is shown below:
1 Select arbitrary random vector s
13
2
r AT s
3 If | r s | ( is the selected iterative threshold value), stop iteration. r is the
PangeRank vector
4
s r , back to step 2
Figure 2 shows the computed rank value of every node in a small graph structure by
simplified PageRank algorithm. According to the RSM of PageRank, the sum of the
PageRank value of every node is 1 .
r2=0.286
2
r1=0.286
1
3
5
R3=0.143
4
r5=0.143
r4=0.143
Figure 2 Simplified PageRank Algorithm
2.2.2 Improved PageRank Algorithm
Simplified PageRank algorithm is only suitable for the ideal strong connected
environment, but in fact, Web is not a strong connected structure. Broder, A, Kumar,
R, & Maghoul, F’s (2000) paper shows there are only 28% pages on Web are strong
connected; 44% are one-way connected; and the remaining part forms Information
Isolated Island, which is neither linked by, nor links to other page. To simplify
PageRank algorithm, non strong connected Web exists two inextricable problems,
which are rank sink and rank leak. Rank sink refers to some local strong connected
14
Web graph doesn’t include the link referring to outside. Rank leak refers to the page
that doesn’t include any external hyperlink. Actually, it is a special case of rank sink
when there is only one node in the strong connected graph. They will cause deviation
generating when analyzing graph structure. For example, if we discard the link from 5
to 1 in Figure 2, nodes 4 and 5 will form rank sink situation. If we use RSM to
simulate, we will fall into the dead circulation from 4 to 5 at last. Moreover, the rank
values of 1, 2 and 3 tend to 0, and the nodes 4 and 5 will share the rank, which the
total value is 1, of whole graph. If we remove 5 and its related links form figure 2,
node 4 will become a leak node. Because once this node is visited, the rank procedure
will stop here, thus, the rank values of all nodes will converge to 0. Therefore, Page
and Brin (Brin, S & Page, L 1998) proposed two methods, one is discarding all the
leak nodes which their outdegrees are 0, another one is introducing damping fact d
( 0 d 1 ) in simplified PageRank algorithm. The appearance of d makes
PageRank contribute to not only the node which it links to, but also the other pages on
Web. The expression of improved PageRank algorithm is shown below:
r (i ) d *
r( j) 1 d
m
jB ( i ) N ( j )
m is the total node amount of Web subgraph that Web Crawler visits. As we can see
from the expression, the simplified PageRank algorithm is the special case when
d 1.
Figure 3 shows the computed PageRank value of every node after removing the
hyperlink from 5 to 1 by improved algorithm. Every node has been adjusted by
parameter d , which make their values all converge to a non 0 value.
15
r2=0.142
2
1
r1=0.154
3
5
R3=0.101
4
r5=0.290
r4=0.313
Figure 3 Improved PageRank Algorithm
For example, in Figure 1, the PageRank value of each node is shown in the table
below ( 0.2 ):
Table 1 PageRank of each node in Figure 1
Node a
PageRank 0.060210
Node b
Node c
Node u
Node v
Node w
0.071004
0.094177
0.047534
0.097881
0.125839
PageRank can use iterative algorithm to complete recursion. To the PageRank of each
node in Figure 1, about 15 times iterations are needed. Generally, in actual
computation, 100 times iterations are enough to converge (Haveliwala, H.T 1999).
PageRank algorithm is currently applied by Google search engine, which provides
high-quality Web retrieval service.
2.3 HITS Algorithm
HITS (Hypertext Induced Topic Search) algorithm is a kind of rank algorithm that
analyzes Web resource based on local link, which is proposed by Kleinberg in 1998
(Kleinberg, J 1999). The difference between PageRank and HITS is that HITS is
related to query, and PageRank is a kind of query unrelated algorithm. As mentioned
16
in the section above, PageRank algorithm gives each page a rank value which is
unique and unrelated to query keyword, but HITS algorithm gives each page two
values, which are Authority value and Hub value.
Authority page and Hub page are two important concepts in HITS algorithm, which
are the concepts all related to query keyword. Authority page refers to some page that
is most related to query keyword and combination (Kleinberg, J 1999). For example,
when querying “University of South Australia”, then the homepage of UniSA, which
is http://www.unisa.edu.au/, is the page with the highest Authority value in this query,
and the Authority value of other pages theoretically should be lower than it. Hub page
is the page that includes multiple Authority pages (Kleinberg, J 1999). Hub page itself
may not have direct relation to query content, but through it, the Authority page with
direct relation can be linked. For example, when inputting the query combination such
as “Australian university”, the homepage of Australian Education Network, which is
http://www.australian-universities.com/, is a Hub page, which includes the links
referring to each university in Australia. Hub page can be used as the auxiliary
reference when computing Authority page, and itself can be returned to user as query
result (Chakrabarti, B, Dom, B & Raghavan, P 1998).
2.3.1 Analysis of HITS Algorithm
The central idea of HITS algorithm is that: Firstly, use text-based retrieval algorithm
to obtain a Web subset, and the pages in this subset all have relativity to user query.
Then, HITS performs link analysis on this subset, and find out the Authority pages
and Hub pages related to query in the subset. The selection of subnet in HITS
algorithm is acquired by means of keyword matching. This subnet is defined as root
set R , then use link analysis to acquire set S from root set R , S is the page that
includes Authority page and ultimately meet the query requirement. The process from
R to S is called “Neighborhood Expand”, and the algorithm procedure for
computing S is shown below:
1 Use text keyword matching to acquire root set R , which includes thousands of
URL or more;
17
2 Define S to R , that is S and R are equal;
3 To each page p in R , put the hyperlinks included by p into set S ; put the
pages referring to p into set S ;
4
S is the acquired expanded neighbourhood set.
HITS algorithm needs three parameters, which are query keyword, maximum
capability of root set R, and maximum capability of expanded neighborhood S . After
using the algorithm above, the pages in S will have more Authority pages and Hub
pages which meet the query keyword.
2.3.2 Analysis of HITS Link
The process of HITS link analysis takes advantage of the attribute that Authority and
Hub are interacting to identity them from expanded set S . Assume the pages in
expanded set S are respectively 1, 2, …, n. B (i ) represents the page set referring
to page i , F (i ) represents the page set referred by page i . HITS generates an
authority value a i and Hub value hi for each page in S . The initial value of
computing initial a i and hi can be an arbitrary value, similar to PageRank, HITS
can use iterative method to acquire convergence value. There are two steps in its
iterative procedure, which are step I and step O.
In step I, the authority value of each
page is the sum of the Hub values of pages referring to it. In step O, the Hub value of
each page is the sum of the authority values of pages referring to it. That is:
I:
ai
O: hi
h
j
a
j
jB (i )
jF (i )
The two steps, I and O, are based on the fact that one Authority page is always
referred by many Hub pages, and one Hub page includes many Authority pages. HITS
algorithm iteratively computes the two steps, I and O, till they converge. At last, a i
and hi are the Authority and Hub value of page i . The procedure is shown below:
18
1 Initialize a i , hi ;
2 Iterate procedure I, O; Perform iteration I; Perform iteration O; Normalize the value
of a and h , let
a
2
i
1;
i
h
2
i
1
i
3 Complete iteration
Figure 4 shows the application of HITS algorithm in a subgraph including 6 nodes.
h=0
a=0.408
h=0
a=0.816
h=0
a=0.408
4
5
6
1
2
3
h=0.81
6
a=0
h=0.40
8
a=0
h=0.40
8
a=0
Figure 4 HITS algorithm on six-nodes graph
As shown in Figure 4, the authority value of node 5 is equal to the Hub values of
nodes 1, 3 which refers to it, after normalization, the value is 0.816.
Assume Amm is the matrix of subgraph, then the value of position i, j in matrix
A is equal to 1 (if page i refers to j ), or 0. Set a to be the authority value vector
[a1 , a 2 ,..., a n ] , h to be the hub value vector [h1 , h2 ,..., hn ] , then the iteration I, O can
be expressed as a Ah , h AT a . After completing the iteration, the values of
authority and hub respectively satisfy a c1 AAT a , h c2 AT Ah , which c1 and c2
are constant in order to satisfy the normalization condition. Thus, vector a and
vector b respectively become the eigenvector of matrix AAT and matrix AT A .
19
This feature is similar to PageRank algorithm, their convergence speeds are decided
by eigenvector.
3.4 Summary
PageRank and HITS are currently the representations of Web retrieval algorithm
based on hyperlink mining. Through the analysis of Web hyperlink relation, we can
greatly improve the accuracy of Web retrieval, and overcome the disadvantage based
on context matching method. Currently, many search engine begin to use similar
algorithm to improve the query precious, for example, Google uses PageRank, Toema
and Altavista also adopt similar technology.
On the other hand, there are defects existing in both PageRank and HITS. PageRank
is independent to query, thus its computation amount is relatively small, but will lose
part of performance on content matching. Although HITS is query related, its link
analysis is only limited in the Web subgraph with thousands of nodes. It can’t reflect
the link condition of whole Web.
20
CHAPTER 3
Literature Review
3.1 Research on Traditional Rank Algorithms of Search
Engines
3.1.1 Problems and Improvements of PageRank Algorithm
The PageRank algorithm is more concerned about old pages, because the probability
of old pages being linked by other pages is much higher, but in fact that new pages
may contain information with better values.
Another problem it may cause is called ‘topic drift phenomenon”. The following
condition should be considered: The portal Websites on Web are always inclined to
make a clean sweep of all aspects of information, which presents as there exists
Website hyperlinks of various topics on their homepages. Meanwhile, many pages
will regard them as a guide for their further information reference, and then include
them in their own links. When searching some key words, if these portal Websites are
in the scope of consideration, there is no doubt that they will acquire the highest
authority, thereupon topic drift phenomenon generates. These portal Websites can be
always found on the top of the searching results, although they also contain the
information required by users, but usually the contents they contain are much
generalized than what the users expect, which is far from satisfactory. In contrast,
some professional Websites are more authoritative in describing these topics.
PageRank algorithm is not able to distinguish the hyperlinks in page being related to
21
its topic or not, that is to say, it is not able to judge the similarity of page content.
Thus, it is easy to cause topic drift problem, for example, Google and Yahoo are the
most popular Web pages on the Internet, and they have very high PageRank values.
Thus, when users input a query keyword, these Web pages will often appear in the
result set of this query, and occupy the very front positions. In fact, sometimes this
Web page is not even related to the users’ query topic.
In Kleinberg’s Hits algorithm paper (1999), he explicitly pointed that those links that
link back to the same Website cannot be counted in Web graph, instead, they should
be discarded. They are a kind of nepotistic links, obviously, not containing any
authority information. After the publication of Kleinberg’s paper, in Bharat and
Henzinger’s paper (1998), they pointed that there exists another nepotistic link, which
is the nepotistic link between two different Website, and this kind of links are trending
to increase rapidly. Moreover, this kind of nepotistic link may be generated in the
construction of Websites accidently. For instance, all the sub Websites of Yahoo have
links referring to main Website. The nepotistic link between two Websites will make
their authorities keep increasing in the iterative process, either for PageRank
algorithm or Hits algorithm.
In order to solving the problem that PageRank algorithm concerns old pages too much,
Ling & Fanyuan (2004) proposed an accelerated evaluation algorithm. This algorithm
make the valuable contents on network deliver in a faster speed, meanwhile, the
evaluation value of some pages containing old data will drop in a quicker speed. The
core idea of this algorithm is to predict the expected value of one certain URL in a
period of further time by analyzing the change condition of PageRank value based on
the time series, and regard it as the effective parameters of retieval service provided
by search engine. This algorithm defines a URL accelerated factor AR , which is
given by:
AR PR * sizeof ( D)
22
where sizeof (D) is the document amount of the whole page set. The expression of
accelerated PageRank is:
PRacclerat e
ARlast ' BD
M last
where ARlast is the AR value of URL in the latest time, B is the slope of the
quadratic fitting curve of the PageRank value of this URL in a period of time, D is
the day interval from the time that the page being downloaded in the latest time, and
M last is the amount of the documents in the document set downloaded in the latest
time. When users retrieve, search engines will decide the URL position in the retieval
results according to the predicted PageRank value.
The WebGather search engine (Ming, L, Jianyong, W & Baojue, C 2001) developed
by Beijing University applied another way to overcome this weakness, which is to
give compensation to new Web pages. The clicking amount of linked LHN Web pages
can be divided into same Website link amount and different Website link amount.
Different Website link is called pure LHN, and gross LHN contains both. Only pure
LHN is considered here.
To new Web pages, they are not linked by other pages yet, so compensation is given.
WLT ( P) log( Tnow Tst ) log( Tnow T ( p)), if T(p) Tst
WLT ( P) 0, if T(p) Tst
where Tnow is the current time, Tst is the compensated limit time, and T ( p) is the
time when Web pages are published.
After compensation weight being introduced, the new link weight is:
WL' ( P) log( LHN ( p) 1) WLT ( p)
23
After standardization:
WL ( p )
WL ' ( p )
WLmax
Haveliwala (2002) proposed a topic-sensitive PageRank algorithm in order to solving
the topic drift phenomenon. This algorithm considers that some pages are thought to
be important in some field, but it doesn’t represent that they are also important in
other fields. Therefore, the algorithm firstly lists 16 basic topic vectors according to
Open Directory (The Open Directory Project, which is a Web Directory for over 2.5
Million URLs), and then to every Web page, computing the PageRank values of these
basic topic vectors in offline condition. When users queries, according to the query
topic or query context inputted by users, the algorithm computes the similarity
between this topic and known basic topic, and chooses a closest topic from basic topic
set to replace the users’ query topic. The formal expression of the algorithm is shown
as follow:
R(u ) M 'R(u ) cM R(u ) (1 c) Pu
where Pu is the topic-sensitive vector of Web page u . This algorithm can
effectively avoid some obvious topic drift phenomenon, for example, when querying
“jaguar”, if the instruction of context is available, the algorithm can explicitly
distinguish whatever users tried to search:
1 jaguar car;
2 jaguar football team;
3 jaguar product;
4 jaguar, which is a kind of mammal,
thus, provide high-quality recommendation result set.
24
3.1.2 Problems and Improvements of Hits Algorithm
Multiple Websites posses some links that recursively refer to each other due to some
reason, which causes “faction attacks” emerge between these Websites. For instance,
some enterprise Websites are designed by the same Website design company for
different companies, it should be possible that there are friendly links between them.
The impact brought by faction attacks is similar to nepotistic links, but it is more
difficult to be detected than nepotistic links, because larger scope of Web graph needs
to be inspected. Another type of problem is called “mixed hub page phenomenon”,
which is a hub page simultaneously possesses links referring to several categories of
completely different topics. For example, a hub page related to a movie awards
usually includes a lot of links referring to movie companies. Mixed hub page is more
difficult to be detected by computers than faction attacks, furthermore, the probability
that it emerges is higher as well. It is possible for mixed hub page to mix the Web
pages with different topics together, especially in HITS algorithm, it is quite easy to
involve Web pages that are irrelevant to current topic in the process of constructing
extended set, and due to these Web pages have a large amount of links refer to pages
with higher authority, they cannot be discarded from results. In Google’s PageRank
algorithm, this impact can be reduced by adjusting the random surfer probability d .
In Hits algorithm, firstly a basic set is constructed, and then extended to extended set
through basic set, finally the whole Web graph is formed. The reason of doing this is
possibly that the result acquired by the information retrieval system in the first step
doesn’t include the pages that users really demand. For instance, when querying with
keyword “browser”, the pages returned by information retrieval system usually
doesn’t contain the pages of Netscape Navigator, Microsoft Internet Explorer, because
their pages will usually avoid using words such as “browser” to make product
promotion. Furthermore, usually some personal page will use words such as “best
viewed with a frames-capable browser…”, which causes the originally important
Netscape and Microsoft’s pages cannot be included in the results in the first step. This
problem can be solved by expended set, because the required Web pages can be
25
acquired through hub page. Due to this characteristic, HITS algorithm can be
impacted by the nepotistic links, faction attacks and mixed hub page mentioned above.
When constructing extended set, too many pages irrelevant to topic are involved, and
they are also with higher authority because of possessing links referring to each other.
If we restrict the radius when the extended set is constructed, it is possibly that we
can’t get enough pages. The really decent pages can be acquired only if the radius is
big enough when constructing extended set, but by then too many irrelevant pages
have been involved and causing “topic pollution phenomenon”.
Besides, similar to PageRank algorithm, Hits algorithm is also impacted by “topic
drift”. After including these portal Websites through extended set, Hits algorithm will
face the same difficulty as PageRank.
Bharat and Henzinger (1998) improved the computation method of authority weight
and hub weight by means of introducing relevance weight to hyperlinks, if the
relevance weight of hyperlink is smaller than certain threshold, then we consider the
impact of this hyperlink to page weight can be neglected, and this hyperlink will be
discarded from the subgraph. Besides, Chakrabarti, Dom and Gibson (1999) proposed
an idea that split big hub page into smaller units. A page always includes many links,
which possibly not relevant to the same topic. In this situation, in order to get a good
result, it is better to divide the hub page into continuous subsets and then make a
process, these subsets are called pagelet. The single pagelet refers to a topic more
concentrative than the whole hub page, so better retrieval results can be acquired by
computing weight for every pagelet. In the Clever system which is an application
example of HITS algorithm, the author computed the weight of hyperlink by means of
matching query keyword with the text around hyperlink and computing the word
frequency, and then replace the corresponding value in adjacency matrix with the
computed weight, thus achieves the objective of introducing semantic information
(Chakrabarti, B, Dom, B & Raghavan, P 1998).
Time parameter is introduced to improve HITS algorithm as well. To the reference of
26
a certain determined Web page, i.e., node P reference node Q , its application time,
to a great extend, reflects whether this referenced node is authoritative or not. In
reality, the visiting time of the authority pages which the users really want to visit
should be relatively long, and to those visits act as navigations occasionally or for
some other purposes, the visiting time is relatively short. In other words, if users’
visiting time to a certain page is relatively long, then we can consider this page as the
page that the users want to visit, which is target page. If this information is applied in
the computation of authority weight in HITS algorithm, the accuracy of HITS
algorithm can be greatly improved.
Xuelong, Xuemei and Xiangwei (2006) proposed a time parameter control model
which is described as follows: Define the hyperlink weight related to keyword K
which refers from page P to page Q is W ( P, Q, K , T ) , this final value of W is
determined by three facts: the link referring from P to Q ; the emergence times of
query keyword in the hyperlink characters, which is K ; the visiting time that P
visits Q , which is T . In order to control the result more precisely, a coefficient is
introduced to control the proportion, which is in hyperlink weight, of semantic
information of the surrounding characters in K , and parameter T t
is
introduced to control the impact of visiting time to weight, then the weight control
model with time parameter is given by:
W ( P , Q, K , T ) 1 ( k ) a * ( k ) t
where a can be adjusted according to different page sets, and the value of W will
continuously increase in the iterative procedure of computing authority weight, but we
only concern about the relative value between them, not the absolute value.
t
reflects the non linear increment of its authority weight with the increment of visiting
time, and other function can be constructed to control the proportion of visiting time
27
in weight as well, the function above is the simplest form.
t
Time t
Figure 5 Function image of
t
3.2 Traditional Domain Ontology-based Concept Semantic
Similarity Computation
3.2.1 Ontology
Ontology is a terminology in philosophy in the earliest, which is a systematic and
comprehensive explanation to objective existence, and its core is to represent the
abstract essence of objective reality (Zhihong, D & Shiwei, T 2002). In recent years,
ontology research is becoming mature, but in various literatures, the definition of
ontology and usage of related terminology are not completely consistent. Neches et al.
(1991) introduced the concept of ontology into artificial intelligence, and gave the
earliest definition about ontology, which is that the relation between basic
terminologies constituted by related domain knowledge and terms, as well as the
extension rules determined by these basic terms and relations. Gruber (1993) gave the
most popular definition of ontology, which is that ontology is the definite rule
explanation of concept model. Later, Studer, Benjanmins and Fensel (1998) made
further research on ontology, and gave the most complete definition about ontology,
which is that ontology is a definite formal standard specification sharing concept.
28
There are four level meanings included here, which are concept model, explicit,
formalization and common sharing (BernersLee, T, Hendler, J, Lassila, O 2001).
3.2.2 Three Main Semantic Similarity Computation Models
Concept semantic similarity computation has wide application in the fields of
information retrieval, information recommendation and filtering, data mining and
machine translation, etc, which has become a hot point of current information
technology research (Sujian, L 2002). Currently, to the semantic similarity
computation between concepts, researches are performed mainly from three different
views. Leacock (2005) proposed a distance-based semantic similarity computation
model. This kind of computation model is simple and visual, but it extremely relies on
the ontology hierarchical network established in advance, and the network structure
directly influences semantic similarity computation. Lin (2000) proposed an
information content-based semantic similarity computation model. This kind of
computation model has more persuasion in theoretically, because when computing
concept semantic similarity, the related knowledge of information theory and
probability statistic theory are fully used. But this kind of computation model can only
grossly quantify the semantic similarity between concepts; it can’t distinguish each
concept semantic similarity more detailed. Tversky et al. (2004) proposed an
attribute-based semantic similarity computation model. This kind of model can
simulate people’s regular understanding and discrimination between things in real
world, but this method only considers a unique attribute fact, so to every attribute of
objective things, performing detailed and comprehensive description is required,
which is quite difficult.
3.3 Summary
In this chapter, we firstly analysis the problems of PageRank and HITS algorithm, and
some comparisons between them are made as well. Meanwhile, the ideas of several
improved methods of classical algorithms are given. Then, some basic information
29
about ontology and domain ontology-based concept similarity computation is
introduced, and the further research and analysis will be given in the further work.
30
CHAPTER 4
Methodology
4.1 Research Questions
My main research objective is to improve rank algorithms in order to make them more
concerned about users’ surfer behaviors. To achieve this target and make my research
easier, some research questions are listed as follows:
What facts will influence the relation between two concept nodes in the
hierarchical network structure in domain ontology?
A web page maybe related to several topics, how to determine the category it
belongs to. If it is categorized into a certain category, how to make the other
categories it related to being considered as well.
How to implement categorization to Web pages and keywords?
Because the amount of Web pages and keywords are huge, do I need to introduce
any mechanism to reduce the unnecessary amount?
4.2 Research Strategy
The whole process of my research strategy is shown in Figure 6, which can be divided
in to two main steps, the first step is to model an improved domain ontology-based
concept similarity computation model, and the second step is to integrate rank
algorithm with categorization technology.
In the first step, firstly we will discuss the traditional three domain ontology-based
31
concept similarity computation models in order to get a full understanding about their
ideas, computing processes, advantages and disadvantages. Then, we will discuss and
improve the decision facts that have impact on directed edge weight in ontology
hierarchical network. At last, the improved domain ontology-based concept similarity
computation model including five decision facts will be modeled and evaluated.
In the second step, firstly we will discuss an existing categorization-integrated rank
algorithm, which combines PageRank with categorization technology in order to
provide a theoretical support. Secondly, the basic idea of categorization in this paper
is given, which will describe how categorization is implemented and the processes of
pre-categorization based on the improved model constructed in step one. Then, a
screen mechanism is introduced to filter the massive data amount. At last, the
improvement and evaluation of HITS algorithm integrated with categorization will be
provided.
32
Research on traditional domain ontology-based
concept semantic similarity computation models
Distance
-based
Content
-based
Attribute
-based
Modeling and evaluating improved domain ontology-based
concept semantic similarity computation model
Improvement of decision facts
Category
Depth
Density
Strength
Modeling
Attribute
Evaluate
Constructing categorization similarity
table according to the improved model
Pre-categorization
of Web pages
Pre-categorization
of keywords
Categorization
similarity table
Performing pre-categorization processes
according to categorization similarity table
How to implement
categorization
Research on combination of
PageRank and categorization
Define the basic idea
of categorization
Screen mechanism
Modeling
Combine HITS with
categorization
Evaluate
Modeling and evaluating HITS algorithm
integrated with categorization
Figure 6 Research strategy framework
33
4.3 Evaluation Tools
4.3.1 The Ontology Tool
The ontology tool adopted to evaluate the improved concept semantic similarity
computation model in this paper is Protégé 3.4, which is an ontology modeling tool.
Protégé is designed by Stanford University to edit instance and acquire knowledge,
which is currently the most popular ontology development tool. It shields the
shortcomings of many current ontology creation languages, and provides a friendly
GUI interface which is shown in Figure 7, which makes it much easier to edit class,
instance and attribute.
Figure 7 Main interface of Protégé 3.4
34
CHAPTER 5
Conclusion
In this paper, through the research and analysis on classical link structure-based
algorithms and their related improvements, we will propose an improved HITS
algorithm based on categorization technology.
7.1 Expected Outcomes
The semantic similarity between concepts nodes in ontology hierarchical network
can be quantified more comprehensively and accurately.
The accuracy of HITS algorithm can be improved after user behaviours are
considered by means of integrating categorization.
The computation overhead of HITS algorithm can be reduced after the selective
mechanism is introduced.
7.1 Contributions
Improve domain ontology-based concept similarity computation based on five
decision factors.
Improve HITS algorithm based on categorization technology.
35
References
Berners Lee, T, Hendler, J, Lassila, O 2001, ‘The semantic web’, Scientific American,
284(5)34-43.
Bharat, MK & Henzinger, R 1998, ‘Improved Algorithm for Topic Distillation in a
Hyperlinked Environment’, In Proceeding of {SIGIR}-98, 21st {ACM} International
Conference on Research and Development in Information Retrieval.
Brin, S & Page, L 1998, ‘The anatomy of a large-scale hypertexual web search
engine’, In Proceeding of the WWW7 Conference, page 107-117.
Broder, A, Kumar, R, & Maghoul, F 2000, ‘Graph structure in the web: experiments
and models.’ In Proceeding of the Ninth International World-Wide Web Conferecne.
Itsky AB & Hirst, G 2004, ‘Evaluating word net-based measures of lexical semantic
relatedness’, Computational Linguistics, 1(1);1-49.
Chakrabarti, S, Dom, B & Gibson, D 1999, ‘Mining the Link Stucture of the World
Wide Web’, IEEE Computer, 32(8).
Chakrabarti, S, Dom, B & Raghavan, P 1998, ‘Automatic resource compilation by
analyzing hyperlink structure and associated text’, In Proceeding of the Seventh
International World-Wide Web Conference.
Chakrabarti, S, van den Berg, M & Dom, B 1999, ‘Focused crawling: A new
approach to topic-specific web resource discovery’, In Proceedings of the Eighth
International World-Wide Web Conference.
36
Dean, J & Henzinger, RM 1999, ‘Finding related pages on the Web’, In Proceeding of
the WWW8 Conference, page 389-401.
Gan, KW & Wong, PW 2000, ‘Annotation information structures in Chinese texts
using how net’, Hong Kong: Second Chinese Language Processing Work shop, 85-92.
Gruber, T 1993, ‘Ontolingua: A translation approach to portable ontology
specification’, Knowledge Acquisition 5(2), pp.199-200.
Haveliwala, HT 1999, ‘Efficient computing of PageRank’, Stanford Database Group
Technical Report.
Haveliwala HT 2002, ‘Topic-sensitive PageRank’, Proceedings of the Eleventh
International World Wide Web Conference.
Kleinberg, J 1999, ‘Authoritative sources in a hyperlinked environment’, Jouranl of
the ACM, 46(5):604-632.
Lawrence, S & Lee Guiles, C 1998, ‘Context and page Analysis for Improved Web
Search.’, IEEE Internet Computing, page38-46.
Ling, Z & Fanyuan, M 2004, ‘Accelerated evaluation algorithm: a new method to
improve Web structure mining quality’, Computer Research and Development,
41(1):98-103.
Mendelzon, OA & Rafiei, D 2000, ‘What do the neighbors think? Computing web
page reputations’, IEEE Data Engineering Bulletin, Page 9-16.
Ming, L, Jianyong, W & Baojue, C 2001, ‘Improved Relevance Ranking in
WebGather’, J. Cimput. Sci. & Technol. Vol.16 No.5.
Motwani, R & Raghavan, P 1995, ‘Randomized Alogrithms’, Cambridge University
Press.
Neches, R, Fikes, R, Finin, T, Gruber, T, Patil, R, Senator, T & Swartout, WR 1991,
‘Enabling technology for knowledge sharing’, AI Magazine 12(3), pp.36-56
Page, L, Brin, S & Motwani, R 1998, ‘The PageRank citation ranking: Bringing order
37
to the Web’ Technical report, Computer Science Department, Stanford University.
Qianhong, P & Ju, W 1999, ‘Attribute theory-based text similarity computation’,
Computer Journal, 22(6):651-655.
Qun, L & Sujian, L 2002, ‘CNKI-based word semantic similarity computation’,
Computer Linguistics and Chinese Information Processing, 2002(7):59-76
Steichen, O & Daniel-Le, C 2005, ‘Bozec. Computation of semantic similarity within
an ontology of breast pathology to assist inter-observer consensus’, Computers in
Biology and Medicine, (4):l-21.
Studer, R, Benjanmins, VR & Fensel, D 1998, ‘Knowledge engineering: principles
and methods’, Data and knowledge engineering 25, pp.161-197.
Sujian, L 2002, ‘Research on semantic computation-based sentence similarity’,
Computer Engineering and Application, 38(7):75-76.
Weizhu, C, Ying, C & Yan, W 2005, ‘Categorization technology-based rank
algorithm for search engine – Category Rank’, Computer Application, 2005(5).
Xiaofeng, Z, Xinting, T & Yongsheng, Z 2006, ‘Ontology technology-based Internet
intelligent search research’, Computer Engineering and Design, 27(7):1194-1197.
Xuelong, W, Xuemei Z & Xiangwei L 2006, ‘Application and improvement of time
parameter in Hits algorithm’, Modern Computer.
Zhihong, D & Shiwei, T 2002, ‘Ontology research review’, Beijing University
Journal (Natual Science Edition), (5):730-738
38
© Copyright 2026 Paperzz