Detecting Similar HTML Documents Using a Fuzzy Set Information

Brigham Young University
BYU ScholarsArchive
All Faculty Publications
2005-07-01
Detecting Similar HTML Documents Using a
Fuzzy Set Information Retrieval Approach
Yiu-Kai D. Ng
[email protected]
Rajiv Yerra
Follow this and additional works at: http://scholarsarchive.byu.edu/facpub
Part of the Computer Sciences Commons
Original Publication Citation
Rajiv Yerra and Yiu-Kai Ng, "Detecting Similar HTML Documents Using a Fuzzy Set Information
Retrieval Approach." In Proceedings of IEEE International Conference on Granular Computing
(GrC'5), pp. 693-699, July 25, Beijing, China.
BYU ScholarsArchive Citation
Ng, Yiu-Kai D. and Yerra, Rajiv, "Detecting Similar HTML Documents Using a Fuzzy Set Information Retrieval Approach" (2005). All
Faculty Publications. Paper 368.
http://scholarsarchive.byu.edu/facpub/368
This Peer-Reviewed Article is brought to you for free and open access by BYU ScholarsArchive. It has been accepted for inclusion in All Faculty
Publications by an authorized administrator of BYU ScholarsArchive. For more information, please contact [email protected].
Detecting Similar HTML Documents Using a
Set Information Retrieval Approach
Fuzzy
Rajiv Yerra and Yiu-Kai Ng
Computer Science Department, Brigham Young University
Provo, Utah 84602, U.S.A.
Email: {[email protected], rajiv [email protected]}
Abstract- Web documents that are either partially or completely duplicated in content are easily found on the Internet these
days. Not only these documents create redundant information
on the Web, which take longer to filter unique information
and cause additional storage space, but also they degrade the
efficiency of Web information retrieval. In this paper, we present
a new approach for detecting similar Web documents, especially
HTML documents. Our detection approach determines the odd
ratio of any two documents, which makes use of the degrees
of resemblance of the documents, and graphically displays the
locations of similar (not necessary the same) sentences detected
in the documents after (i) eliminating non-representative words
in the sentences using the stopword-removal and stemming algorithms, (ii) computing the degree of similarity of sentences using
a fuzzy set information retrieval approach, and (iii) matching
the corresponding hierarchical content of the two documents
using a simple tree matching algorithm. The proposed method
for detecting similar documents handles wide range of Web pages
of varying size and does not require static word lists and thus
applicable to different Web (especially HTML) documents in
different subject areas, such as sports, news, science, etc.
Index Terms-fuzzy set model, Web information retrieval,
copy detection, HTML document, odds ratio
I. INTRODUCTION
specify formatting commands, which intermix with the text
content of the document. Unlike images, it is not obvious
to the naked eye if two text documents are similar, needless
to say distinguishing text from tags in HTML documents.
Since tags are intermixed with the actual content of an HTML
document, they further complicate the problem of visualizing
which portions of different HTML documents match because
matching content in HTML documents requires careful examination of the documents. Document analysis techniques
are required in order to automate the entire process and
provide a metric to show the extent of similarity of two
HTML documents. We propose a unique method in detecting
similar HTML documents, which (i) performs copy detection
on HTML documents, instead of any plain text documents
for which existing copy detection methods are designed,
(ii) detects HTML documents with similarl sentences, and
(iii) specifies the overlapping locations of any two HTML
documents graphically.
In this paper, we first introduce a sentence-based copy
detection approach, which analyzes the contents of HTML
documents to determine their degrees of similarity. Our copy
detection approach eliminates stopwords and reduces each
non-stop word to its stem in a document, which are compared
with non-stop, stemmed words in another document to determine their similarities using the fuzzy set information retrieval
(IR) approach [1]. Similar sentences in different documents are
detected, and the relative locations of matched sentences in the
documents are depicted by dotplot views [2].
We proceed to present our results as follows. In Section II, we discuss related work in copy detection and existing
methods in detecting similar documents. In Section III, we
introduce our copy detection approach, present dotplot views
that display similar sentences in two HTML documents, and
include the experimental results of our copy detection approach. In Section IV, we propose our method in calculating
the odds ratios among different HTML documents to quantify
their similarity. In Section V, we give a concluding remark.
One of the problems on the Internet these days is redundant
information, which exist due to replicated pages archived at
different locations like mirror sites. As a result, the burden
is on the user to sort through all the retrieved documents
from the Web to identify non-redundant ones, which is a
tedious and tiring process. These documents can be found in
different forms, such as documents in different versions; small
documents combined with others to form a larger document;
large documents co-existing with documents that are split from
them. One classic example of such documents is news articles
where new information is added to an original article and is
republished as a new (updated) article. Since the amount of
information available on the Web increases on a daily basis,
filtering redundant documents becomes a more difficult task
to the user. Fortunately, copy detection can be used as a
filter, which identifies and excludes documents that overlap in
II. RELATED WORK
content with other documents. We are particularly interested in
efforts
have
been made in the past to solve the
Many
copy detection of HTML documents, since they are abundant
detection
which finds similarities among docucopy
problem,
on the Web, even though there are different challenges in
handling copy detection of HTML documents.
1From now on, whenever we use the term similar sentences (documents),
HTML is a markup language, which means that the author we mean sentences (documents) that are semantically the same but different
of an HTML document must follow the HTML syntax to in terms of words used in the sentences (documents).
0-7803-9017-2/05/$20.00 C2005 IEEE
693
ments. Diff, a UNIX command, shows differences between
two textual documents, even spaces. SCAM [3] is a copy
detection approach geared towards small documents. Differed
from SCAM, SIF [4] finds similar files in a file system by
using the fingerprinting2 scheme to characterize documents.
SIF, however, cannot gauge the extent of document overlap
nor display the location of overlap like SCAM, and the
notion of similarity as defined in SIF is completely syntactic,
e.g., files containing the same information but using different
sentence structures will not be considered similar. COPS [5],
which was developed specifically to detect plagiarism among
documents, uses the hash-based scheme for copy detection.
It compares hash values of test documents with that in the
database for copy detection. There are limitations of COPS,
which include (i) uses a hash function that produces large
number of collisions, (ii) documents to be compared by COPS
must have at least 10 sentences or more, and (iii) it has
problems selecting correct sentence boundaries.
KOALA [6], which is like COPS, is specifically designed
for plagiarism and is a compromise between random fingerprinting of SIF and exhaustive fingerprinting of COPS that
selects substrings of a document based on their usage. This
results in lower usage of memory and increases accuracy.
Neither KOALA nor COPS, however, can report the location
of overlap of two documents and handle documents with
varying size. More recently, [7] present an approach to identify
duplicated HTML documents, which is significantly differed
from ours, since they consider only HTML tags, but not the
actual contents, of HTML documents. Two HTML documents
are treated the same by [7] if the number of occurrences
of their HTML tags are the same, which is inadequate. [8],
whose copy detection approach is closer to ours, is restricted
to textual documents and detect only same sentences.
Our copy detection approach overcomes most of the limitations of existing approaches. Since our approach is not lineoriented, it overcomes the problems posed by Diff. Unlike
KOALA and COPS, our approach is more general, since ours
is not particularly targeted for plagiarism, even though we
can handle the problem. Compared with SIF and SCAM,
which cannot measure the extent of overlap, our approach
can measure and specify the locations of overlapping in
(HTML) documents and the relative sentence positions in the
documents. Overall, our approach handles HTML documents
for which existing copy detection approaches are not designed.
Other methods for detecting similar documents have also
been proposed. [9] use the fingerprint approach to represent
a given document, which then plays the role of a query to a
search engine to retrieve documents for further comparisons
using the shingles and Patricia tree methods. As discussed
earlier, the fingerprint approach is either completely syntactic
or suffers from collisions. [10] introduce a statistical method
to identify multiple text databases for retrieving similar documents, which along with queries are presented as vectors
in the vector-space model (VSM) for comparisons. VSM is
a well-known and widely used IR model [11], however, its
reliance on term frequency without considering thesaurus of
index terms in documents could be a drawback in detecting
similar documents. [12] characterize documents using multiword terms, in which each term is reduced to its canonical
form, i.e., stemming, and is assigned a measure (called IQ)
based on term frequency for ranking, which is essentially
the VSM approach. Besides using VSM, [13] also consider
user's profiles [11], which describe users' preferences, in
retrieving documents, an approach that has not been widely
used and proved. In contrary, we develop an approach for
detecting similar, as well as exact matching of, sentences
in HTML documents to determine the degrees of similarity
among different documents.
III. OUR COPY DETECTION APPROACH
In our copy detection approach, we first apply the semantic
hierarchy construction tool [14] to extract text, i.e., data
content, and remove tags from an HTML document, which
yields a tag-less tree structure that precisely captures the
hierarchical relationships of document text.3 Extracting the
content of an HTML document is essential, since information
is often presented in the text of the HTML document, and tags,
in general, are constraints mainly used for identifying the type,
format, and/or structure of text in the document and are not
useful for representing data content. Hereafter, the hierarchy
is passed through a stopword4-removal [11] and stemming
[15] process, which removes all the stopwords and replaces
words in sentences resided at each node in the hierarchy by
their stems. For example, the stem of computing, computer,
computation, computational, and computers is comput. This
process reduces the total number of words in a document to
be considered and thus reduces the size and complexity in
document comparison. Since our copy detection approach is
a sentence-based approach,5 words in each stemmed sentence
without stopwords in one document are compared with nonstop, stemmed words in a sentence in another document to
determine whether they are similar using the fuzzy set IR
approach. Using a simple tree matching algorithm, similar
sentences in different nodes in two semantic hierarchies are
linked together and the relative locations of matched sentences
in the documents are graphically depicted by dotplot views [2]
and in the hierarchies. We discuss the details below.
A. Capturing document content using semantic hierarchies
Data in HTML documents are hierarchical and semistructured in nature. In order to extract the content of an
HTML document, the document is passed through our semantic hierarchy construction tool from where the text (not tags) of
3The semantic hierarchy construction tool, which is fully automated, does
not require the internal structure of an HTML document to be known.
4Stopwords in a sentence, such as articles, conjunctions, prepositions,
punctuation marks, numbers, non-alphabetic characters, etc., often do not play
a significant role in representing the meaning of the sentence.
5We use the boundary disambiguation algorithm in [16], which has accuracy
in the 90% range but operates in linear time, to determine sentence boundary.
2Fingerprints of a document yield the set of all possible document substrings of a certain length, called fingerprints, and fingerprinting is the process
of generating fingerprints.
694
the document is extracted. The tool takes an HTML document
TABLE I
D as input and returns a tag-less, hierarchical structure of data DOCUMENTS USED FOR CONSTRUCTING THE TERM-TERM CORRELATION
in D according to the HTML grammar.
MATRIX IN OUR FUZZY SET IR APPROACH
During the process of constructing the semantic hierarchy
Sources
Size
No. of
No. of
No. of Non-Stop
portion for non-table data, HTML tags are divided into difSentences
Stemmed Words
pages
ferent levels, which include headings, block-level, text-level
TREC
1.29 GB
4,390 T 4,829,653
28,983,918
(font, phrase, form, etc.), and addresses. These tags determine
1.53 GB
Gutenberg
5,643 T 8,935,215
72,681,213
Web News 0.63 GB 39,349 T
701,520
5,429,643
the relative locations of different portions of data content in
Etext.org T 1.13 GB | 3,124
6,239,154
41,624,048
the hierarchy, and the data content are clustered, merged, and
Total
4.58 GB [ 52,506 20,705,542
148,718,822
promoted in the hierarchy during the construction process. To
create the semantic hierarchy portion for HTML table data, if
they exist, the hierarchical dependencies (e.g., row and column where ci,j is the correlation factor between words i and j,
ni,j is the number of documents in a collection (which is the
order) among the data content in the table are determined using data
set as shown in Table I) with both words i and j, ni (nj,
various HTML table tags. (See [14] for details.)
respectively) is the number of documents in the collection with
word i (word j, respectively).
B. Copy detection using the fuzzy set IR approach
We construct the term-term (i.e., word-word) correlation
After converting two HTML documents into their semantic
matrix
with rows and columns associated to words to define
hierarchies and performing stopword removal and stemming
the
of similarity among words in the documents as
degree
on their text in each node, we adopt and modify the fuzzy set
shown
in
Table
I. Using the degrees of similarity among
IR model [1] to determine sentences in any two documents
in
words
two
we obtain the degree of similarity
sentences,
(i.e., in the nodes of their semantic hierarchies) that are either
of
the
sentences,
which
measures the extent to which the
similar or different. It has been observed that words that are
sentences
match.
To obtain the degree of similarity between
common to a number of documents discuss the same subject.
two
sentences
and
Sl
Sj, we first compute the word-sentence
1) Degree of similarity in fuzzy set IR model: At a high
correlation
of word i in Sl with all the words in
factor
ILi,j
level, a sentence can be treated as a group of words arranged
which
measures
the
degree of similarity between i and (all
Si,
in a particular order. In English language, two sentences can
the
words
as
in)
Sj,
be semantically the same but differ in structure or usage of
words in the sentences, and thus matching two sentences is
(2)
Aij = 1 JJ (1 Ci,k),
approximate or vague. This can be modeled by considering
kESj
that each word in a sentence is associated with a fuzzy set
that contains words with similar meaning, and there is a where k is one of the words in Sj and Ci,k is the correlation
degree of similarity (usually less than 1) between (words in) a factor between words i and k as defined in Equation 1.
Using the the 1u-value of each word in sentence Si as defined
sentence and the set. This interpretation in fuzzy theory is the
in
Equation 2, which is computed against sentence Si, we
fundamental concept of various fuzzy set models in IR [11].
define
the degree of similarity of Si with respect to Si as
We adopt the fuzzy set IR model for copy detection, since
follows:
(i) identification and measurement of similar sentences, apart
from exact match, further enhances the capability of a copy
(3)
Sim(Si, Sj) = W1Ji + PW2Jj +± * * + AWj
detection approach, and (ii) the fuzzy set model is designed
and proved to work well for partially related semantic content, where Wk (1 < k < n) is a word in Si, and n is the total
which can handle the problem of copy detection of similar number of words in Si. Sim(Si, Sj) is a normalized value.
Likewise, Sim(Sj, Si), which is the degree of similarity of Sj
sentences.
In the fuzzy set IR model, a term-term correlation matrix, with respect to Si, is defined accordingly.
Using Equation 3, we determine whether two sentences Si
which is constructed in a pre-processing step of our copy
detection approach, consists of words6 and their corresponding and Sj should be treated the same, i.e., EQ, as follows:
correlation factors that measure the degrees of similarity
1 if MIN(Sim(Si, Sj), Sim(Sj, Si)) >
among different words, such as "automobile" and "car." We
0.825 A ISim(Si, Sj) - Sim(Sj, Si)I
modify the fuzzy set IR model to determine the degrees of EQ(Si, Sj)=
< 0.15
similarity among different sentences by computing the corre0 otherwise,
lation factors between any pair of words from two different
(4)
sentences in their respective documents. The following word- where 0.825 is called the permissible threshold value, whereas
word correlation factor, cij, defines the extent of similarity 0.15 is called the variation threshold value. The permissible
between any two words i and j:
threshold is a value set to obtain the minimal similarity
between
any two sentences Si and S2 in our copy detection
(n1)
Ci,j = ni,j / (ni + nj
approach, which is used partially to determine whether Si and
6From now on, unless stated otherwise, whenever we use the term word, S2 should be treated as equal (EQ). Along with the permissible threshold value, the variation threshold value can be used
we mean non-stop, stemmed word.
-
-
695
-
a
1. - - - .., ..1. 1. - -
.1-
(a) Permissible threshold value
(b) Variation threshold value
Fig. 1. Determination of the permissible and variation threshold values for
similar documents using the fuzzy set IR approach
to decrease the number of false positives (i.e., sentences that
are different but are treated as the same) and false negatives
(i.e., sentences that are the same but are treated as different).
The variation threshold value sets the the maximal, allowable
difference in sentence sizes between Si and S2, which is
computed by calculating the difference between Sim(SI, S2)
and Sim(S2, Si). The threshold values 0.825 and 0.15 thus
provide the necessary and sufficient conditions for estimating
the equality of two sentences and are determined by testing
the documents in the collection as shown in Table I and
Figure 1. We explain how to compute the threshold values
below, which ensures that number of common words in Si
and Sj is significantly high to at least a certain degree, i.e., at
least 82.5% alike.
2) Determining the permissible and variation threshold
values: A total number of 200 documents, which were randomly sampled from the set of 52,506 documents as shown
in Table I, were used in part to detennine the pernissible
and variation threshold values. Out of the 200 samnpled documents, 12.5%, 17.5%, 32.5%, and 37.5% were chosen from
the archive sites TREC (http://trec.nist.gov/data/), Gutenberg
(ftp://ftp.archive.org/pub/etext/), News articles on the Web, and
Etext.org (ftp://ftp.etext.org/pub/), a text archive site, respectively, and 33% of the 200 documents are HTML documents.
Using the 200 documents, we determined the permissible
and variation threshold values. According to each predefined
threshold value V (0.5 < V < 1.0, with an increment of
0.05), we categorized each pair of sentences SI and S2 in
the 200 documents as equal or different, depending on V and
the minimum of the two degrees of similarity measures on S,
and S2. S, and S2 are treated as equal if MIN(Sim(Sl, S2),
Sim(S2, Si)) > V; otherwise, they are different. Hereafter,
we manually examined each pair of sentences to determine the
accuracy of the conclusion, i.e., equal or different, drawn on
the sentences and computed the number of false positives and
false negatives and plotted the outcomes in a graph. According
to the graph as shown in Figure l(a), we set the permissible
threshold value as 0.825, which is considered to be the optimal
value and the "balance" point of false positives and false
negatives, i.e., neither the values of false positives nor false
negatives dominates the copy detection errors and thus is the
compromise between false positives and false negatives. The
curves in Figure 1(a) show that as the (permissible) threshold
value increases, the percentage of false positives decreases,
whereas the percentage of false negatives increases. This is
because as the threshold value increases, more documents
are filtered and the number of retrieved documents decreases.
Thus, more documents which are similar are not detected as
the threshold increases, and as a result the false negatives
increase, and it is the opposite in the case of false positives.
To obtain the variation threshold value, we examined a set
of pairs of sentences PS whose minimal degree of similarity
exceeds or equals to the pennissible threshold value. The
differences between the degrees of similarity of sentence pairs
in PS are first calculated and then sorted. The percentages
of false positives and false negatives are then determined
manually and plotted at regular intervals in sorted order.
Since the minimum permissible threshold value is set to
0.825, the difference in similarity between two documents can
only range between 0 and 0.175. The greater the difference in
similarity is, the greater the chances of error is. For example,
if the difference is 0.175, one of the similarity measures can
be sim(Sl, S2) = 1.0 and the other is sim(S2, Sj) = 0.825.
This happens when the set of words in Si is a subset of the
words in S2. For example, if sentence Si is "The boy goes to
school every day on bus," whereas sentence S2 is "The bus
going tx school passing by the lake stops for the boy at the
comner every day." In this example, Si is a part of S2, but
not vice versa, and should be treated as different. According
to Figure l(b), the ideal variation threshold value between the
range 0 and 0.175 is 0.15, which is the intersection of the false
positive and false negative curves and is dominated by neither
false positives nor false negatives.
IV. DETECTING SIMILAR HTML DOCUMENTS
In this section, we present our approach in determining to
what extent the contents of different HTML documents are
in common based on the copy detection approach presented
in Section IH. We first discuss odd ratio, which is used
to precisely capture the degree of overlapping between two
HTML documents according to the degrees of resemblance
(to be defined below) between the documents. Hereafter, we
introduce our tree matching approach that determines and
depicts the overlapped portions of two HTML documents
graphically using their corresponding semantics hierarchies
and a dotplot view.
A. Odds ratios
We can compute the number of similar sentences appeared
in any two documents using the formula EQ in Equation 4.
Using EQ, we define the degree of resemblance (RS) of
document doc1 with respect to doc2, which counts the total
number of similar sentences in doc appeared in doC2 as
follows:
m
Sdc
6Ul
Jdj
ni=l znj=l EQ(Sdl.,
i )
RS(doci, doc2) =
(5)
m
where Sdl (1 < i < m) is a sentence in docl, Sd23
(1 < j < n) is a sentence in doC2, and m (n, respectively)
696
is the total number of sentences in doc1 (doc2, respective
RS(doc2, doci) is defined accordingly.
After obtaining the degrees of resemblance between
two documents doc1 and doc2, we represent these two degi
with a single value, which depicts to what extent the rela
degrees ofresemblance between doc1 and doc2 is. These sin
values of different pairs of documents, e.g., between doc,
doc2, and between doc1 and doc3, can be used to compare
relative degree of similarity among different documents.
obtain this single value using odd ratio, which comes with
property that the higher the odds ratio between two documi
is, the more the two documents resembles each other. 0
ratio, which is also called odds in the literature, is defi
as the ratio of the probability (p) that an event occurs to
probability (1 - p) that it does not, i.e.,
Odds ratio = p/(1 - p).
8o
'S
Fig. 2. Odds ratios among Documents
TABLE II
ODDS RATIOS AMONG SIX DOCUMENTS IN A COLLECTION
Doco [ DoCl I Doc3 IDoc4 [ Docs Doe
[
Doco 11 100
9.21 0.064 0.058 0.01
°0
Docl I 9.21 100 0.09 0.051 0.001 0.021
(6)
The reasons to adopt odds ratio is three fold. First, it is ezasy
to compute. Second, it is a natural, intuitively acceptable,way
to express magnitude of association. Third, it can be linke d to
other statistical methods, such as the Bayesian statistical mtodeling, Dempster-Shafer theory of evidence, and probab ility
analysis. In Equation 6, p and (1 - p) are odds. The ratio g ives
the positive versus negative value, given that p is positive
(1 - p) is negative. As (1 - p) approaches to one, the odds r atio
reaches infinity. To obtain p as defined in odds ratio, we ai
the Dempster-Shafer's theory of evidence, which comb nes
two or more evidences or propositions to obtain a si igle
proposition, and in our case the evidences or propositions are
degrees of resemblance between two documents.
The Dempster's combination rule computes a measuri of
agreement between two bodies of evidence concerning vari ous
propositions discerned from a common frame of discernn
[17]. According to this rule, which is based on indepenr lent
items of evidence, if the probability for evidence E1 tc be
reliable is PI and if the probability for evidence E2 tc be
reliable is P2, then probability that both E1 and E2 are reli;
iS Pl X P2Applying the Dempster-Shafer rule on the degrees of resemblance RS(doc1, doC2) and RS(doc2, doci) between two
documents doc1 and doC2, we obtain RS(doc1, doC2' x
RS(doc2, doc1), which plays the role of p in odds rn LtiO.
Hereafter, we compute odds ratio of doq, and doC2 as follc ws:
RS(docl, doc2) x RS(doC2, doc)
(7)
1 - (RS(docl, doC2) x RS(doc2, docl))
The odds ratio, along with the Dempster-Schafer's rul e of
combining evidence, serves the purpose of deriving a si' igle
measure between two documents, which reflects how sin ilar
the documents are. The higher the odd ratio of the two
documents is, the lesser the difference between the docum ents
is, and the greater the similarity between the two documeni s is.
For example, if the degrees of resemblance of two docum ,nts
are 0.9 and 0.9, then the odds ratio is 4.26, whereas if the
degrees of resemblance of the two documents are 0.1 and 0.9
instead, then the odds ratio is 0.098, which is further dis tant
,ply
tent
tble
DOC3
Doc4
DoC5
Doc7
0.064
0.058
0.01
o0
0.051
8.242
100
8.242
0.001
0.1
0
0.015
0.09
0.021
0.1
100
0.015
0.009
12.4
100
0
0.009 |
12.4 100
apart compared with the other two documents with the odds
ratio value of 4.26. Thus, the odds ratio accurately reflects
the "closeness" and "difference" of the two corresponding
documents. If the product of similarities is one, we assign
a value 100 as odds ratio to represent infinity.
Example 1: We compute odds ratios on a collection of ten
Web documents, which show the overall similarity between
any pair of documents, to illustrate the accuracy of the odd
ratios. The ten documents were retrieved from the Internet
by the Google search engine, using keywords such as NAT,
IP Address, ONS, etc. The computed odds ratios among all
these documents are shown in Figure 2, where we intentionally
set the odd ratios of identical pairs of documents to zero
(which are supposed to be 100 instead) so that the graph is
more visible. We choose six out of the ten documents, i.e.,
Doco, Doc,, Doc3, Doc4, Doc5, and DoC7, to demonstrate
their relative odd ratios as shown in Table II, which indicates
that Doco and Doc1 (DoC3 and Doc4, and Doc5 and DoC7,
respectively) are more similar to each other than to the other
documents.
Consider the two documents Doco and Doc1 in the
collection and as shown in Table II. Doco (http://www.cisco.
com/warp/public/473/lan-switch-cisco.pdf), entitled "How
LAN Switches work:' discusses networks, switches,
switching topologies, bridging, spanning trees and
VLANs, whereas Doc, (http://www.howstuffworks.com/lanswitch.htm/printable), entitled "How Switches work," which
introduces switches from a beginner's point of view, explains
the functioning of switches, networking basics, traffic, fully
switched networks, switch configurations, and other topics
related to switches. The degrees of similarity between Doco
and Doc1 are Sim(Doco, Doc,) = 0.91 and Sim(Docl,
697
should have already been converted into binary trees, since a
semantic hierarchy is not necessary a binary tree. Hereafter,
the Tree-Matching algorithm will be called using the binary
trees S and T, which is followed by calling algorithm GraphicaliDisplay to display the result, i.e., matched nodes in the
semantic hierarchies and matched sentences in a dotplot view.
In the TreeMatching algorithm, pointers to the roots of the
source tree S and target tree T are passed as input arguments.
The algorithm traverses trees S and T in inorder, and calls
procedure Node-Comparison to compare sentences in any
node n, of S with sentences in any node n2 of T. If n, and n2
contain similar sentences, the total number of similar sentences
found in n, (n2, respectively), as well as the NodeJD of ni
(Node-D of n2, respectively), are recorded in the node data
structure of n2 (nl, respectively).
Algorithm Update-Sentence-Pos(T, *Pos)
Input: T, a semantic hierarchy in its binary tree format, and
a sentence position index *Pos.
Output: Updated SPos list in the node data structure of each
node in T.
BEGIN
1. IF T has a LEFT SUBTREE, call
Update-SentenceIPos(T->LEFT, Pos).
2. FOR each sentence (in the given order) in T, DO
(a) T.SPos[i++] = *Pos /* i is set to 0 before the Loop */
(b) *Pos = *Pos + 1
3. IF T has a RIGHT SUBTREE, call
Update-SentencePos(T--RIGHT, Pos).
END
Doco) = 0.98, and the odds ratio is 8.242, which is treated as
moderately high.
B. Tree Matching Algorithm
We present a simple tree matching algorithm to match nodes
in one semantic hierarchy with nodes in another semantic
hierarchy such that the matched nodes contain at least one
similar sentence. The matching captures common data content
of the corresponding HTML documents, running in n log n
time complexity, where log is the logarithmic base 2 function
and n is the number of nodes in a semantic hierarchy. The
proposed tree matching algorithm is (i) efficient in terms of
time complexity, (ii) easy to implement, and (iii) intuitive and
natural in expressing matching nodes of two different semantic
hierarchies.
Intuitively, given two semantics hierarchies, which are
called source tree and target tree, we match nodes in the
two trees by traversing the trees inorder. Assume that the
two semantic hierarchies to be matched are S and T, where
S is the source tree and T is the target tree. Nodes in T
are considered one-by-one according to the order of inorder
traversal. For each node n in T, S is parsed inorder to match n,
i.e., determining the (total number of) sentences in each node
in S that are similar to the sentences in n. When a match
of n is found in S, the node data structure of n is updated
with the necessary information of the matching node m in S,
which include the Node ID of m and the sentence positions
of similar sentences in m. The node structure information of
m is also updated accordingly. This process is continued till
the entire tree T is parsed.
We adopt the following node data structure
<D, SPos, MNInf o, LEFT, RIGHT> for each node
N in S (T, respectively) in our tree matching algorithm:
. Data (D): Sentences in N.
. Sentence positions (SPos): Positions of sentences Si,
. Sj in N are maintained in the structure (Ls1,
Ls,), which are obtained according to (i) the relative
order of N in the inorder traversal of the tree and (ii)
the relative order of sentences in N.
. MatchedNodeinfo (MNInfo): Information on each
node M on the other tree that matches N, which are
stored in a linked list L, and each component of L is
Algorithm Tree.Matching(S, T)
Input: Two semantic hierarchies, source tree S and target tree
T, which are binary trees.
Output: Matched nodes and sentences in S and T.
BEGIN
1. IF T has a LEFT SUBTREE,
call TreelMatching(S, T--LEFT).
2. Call Node Comparison(S, T).
3. IF T has a RIGHT SUBTREE, call
Tree-Matching(S, T--RIGHT).
END
<NodelD, SP, NextLink>
where NodelD is the physical address of M, SP is
the sentence position of a similar sentence in M, and
NextLink is the pointer to the next component in L.
. Left Pointer (LEFT): Pointer to the left subtree.
. Right Pointer (RIGHT): Pointer to the right subtree.
Prior to invoking the Tree-Matching algorithm (given below), we assign the sentence positions of sentences in the node
data structure of each node in S as well as in T. This task
can be achieved by traversing S (T, respectively) inorder. A
sentence position index, called Pos, is initialized to zero for
the first call to algorithm UpdateSentencePos. Note that S
and T, which are two semantic hierarchies to be compared,
698
Procedure Node-Comparison(S, N)
Input: A semantic hierarchy S and a node N.
Output: Detect similar sentences between nodes in S & N.
BEGIN
1. IF S has a LEFT SUBTREE, call
Node.Comparison(S-+LEFT, N).
2. Determine similar sentences in S and N using the
fuzzy set IR approach.
(a) Let M be S. /* the node being pointed to by S *l
(b) Let sentences in M be ml, . . ., mk, and let
sentences in N be nl, ..., np.
(c) FOR i = l..k, DO
FOR j = l..p, DO
IF EQ(mi, nj) = 1, THEN
i. N.MNInf o-+Node-ID = M and
V. CONCLUSIONS
We have presented a new approach that detects similar
Web documents, especially HTML documents. We first apply
+ +++
+ ++ a fuzzy set IR copy detection approach to detect similar
.1. +l:++
in two documents and then compute the odd ratio
~~~+ +4 ofsentences
the documents to reflect the degree of common content.
Our copy detection approach is unique, since it (i) performs
copy detection among HTML documents, which can act as a
(a) Node matching of S & T
(b) Sentence matching in S & T
filter to various Web search engines/Web crawlers to improve
Fig. 3. Matching nodes and sentences in documents S and T
efficiency by indexing fewer documents and ignoring similar
ones, (ii) detects similar sentences, apart from same sentences,
N.MNInfo-*SP = M.SPos[i].
by the fuzzy set IR approach on stemmed and non-stopwords
ii. M.MNInfo-*NodeID = N and
in HTML documents, and (iii) captures similar sentences
M.MNInfo- SP = N.SPosUl.
in any two HTML documents graphically, which displays
3. IF S has a RIGHT SUBTREE, call
the location of overlapping portions of the documents. Not
Node_Comparison(S-*RIGHT, N).
only our copy detection approach handles wide range of
4. RETURN
documents of varying size, but also since it does not require
END
static word lists, it is applicable to different (HTML) Web
documents in different subject areas, such as sports, news,
Algorithm GraphicalDisplay(T, S)
science, etc. Experiment results on our copy detection and
Input: Semantic hierarchies T and S.
in
Output: Graphically display the matched sentences S and similar document detection approaches look promising.
T in a dotplot view and matched nodes in T.
REFERENCES
BEGIN
K. Kobayashi, "A fuzzy document retrieval
Y
T.
and
[1] Ogawa, Morita,
1. IF T has a LEFT SUBTREE, call
system using the keyword connection matrix and a learning method,"
Graphical-Display(T-*LEFT)
Fuzzy Sets and Systems, vol. 39, pp. 163-179, 1991.
[2] J. Helfman, "Dotplot patterns: A literal look at pattern languages'
2.LetNbeT
Theory and Practice of Object Systems, vol. 2, no. 1, pp. 31-41, 1996.
(a) MNInfo := N--MNInfo.
[3] N. Shivakumar and H. Garcia-Molina, "Scam: A copy detection mecha(b) WHILE MNInfo-*NextLink #4 NULL
nism for digital documents," D-Lib Magazine, 1995, http://www.dlib.org.
[4] U. Manber, "Finding similar ifies in large file system," in USENIX Winter
/* Perform Graphical Display */
Technical Conferences, Jan. 1994.
i. M := N-*MNInfo.NodelD
[5] S. Brin, J. Davis, and H. Garcia-Molina, "Copy detection mechanisms
(I) Cnt := Cnt + 1. /* Cnt is a similar sentences
for digital documents:' in Proceedings of the 1995 ACM SIGMOD, 1995,
pp. 398-409.
counter, initialized to zero */
H. Nevin, "Scalable document fingerprinting," in Proceedings of the 2nd
[6]
(II) Create an edge from M to N, if there is no
USENIX Workshop on Electronic Commerce, 1996, pp. 191-200.
edge connecting M and N.
[7] G. Lucca, M. Penta, and A. Fasolino, "An approach to identify duplicated
web pages," in Proceedings of COMPSAC, 2002, pp. 481-486.
(Im) Plot the pair (M-*MNInfo.SP,
D.
[8]
Campbell, W. Chen, and R. Smith, "Copy detection systems for digN-*MNInfo.SP) on a dotplot view.
ital documents," in Proceedings of IEEE Advances in Digital Libraries,
ii. MNInfo MNInfo-*NextLink
2000, pp. 78-88.
[9] A. Pereira and N. Ziviani, "Retrieving similar documents from the web,'
END WHILE
Journal of Web Engineering, vol. 2, no. 4, pp. 247-261, 2004.
(c) Shaded region of M with percentage computed by
[10] C. Yu, K. Liu, W. Wu, W. Meng, and N. Rishe, "Finding the most similar
Cnt / M.ISPosi.
documents across multiple text databases:' in Proceedings of the IEEE
Forum on Research and Technology Advances in Digital Libraries, 1999,
3. IF T has a RIGHT SUBTREE, call
pp. 150-162.
Graphical-Display(T--Right)
[11] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval.
4. RETURN
Addison Wesley, 1999.
[12] J. Cooper, A. Coden, and E. Brown, "Detecting similar documents using
END
salient terms," in Proceedings of CIKM'02, 2002, pp. 245-251.
E. Silva, F Femandes, S. Meira, and F Barros, "Activesearch:
Example 2: Consider two HTML documents, Sistani, (S) [13] J. Rabelo,for
An agent suggesting similar documents based on user's preferences,"
and Sistani2 (T), which are first converted into binary trees
in Proceedings of the Intl. Conf on Systems, Men, and Cybernetics,
before applying Tree-Matching to determine the overlap be2001, pp. 549-554.
tween the documents. In Figure 3(a), which shows portions [14] S. Lim and Y.-K Ng, "Constructing hierarchical information structures
of sub-page level html documents," in Proceedings of the 5th Intl. Conf
of the two trees, 33% of the area in Node 45 of T is shaded,
of Foundations of Data Organization (FODO'98), 1998, pp. 66-75.
since one out of three sentences in Node 45 is similar to the [15] M. Porter, "An algorithm for suffix stripping," Program, vol. 14, no. 3,
pp. 130-137, 1980.
corresponding sentence in Node 46 of S, whereas Node 49
D. Campbell, "A sentence boundary recognizer for english sentences,"
of T has 75% shaded area, since three of the four sentences [16] 1997,
unpublished work.
in the node match with sentences in various nodes in S. The [17] I. Ruthven and M. Lalmas, "Experimenting on dempster-shafer's theory
of evidence in information retrieval,' Journal of Intelligent Information
dotplot view in Figure 3(b) depicts the relative positions of
Systems, vol. 19, no. 3, pp. 267-302, 2002.
thirteen similar sentences in the documents.
---
=
699

Download Report

Detecting Similar HTML Documents Using a Fuzzy Set Information

Paperzz.com

Your Paperzz