Brigham Young University BYU ScholarsArchive All Faculty Publications 2005-07-01 Detecting Similar HTML Documents Using a Fuzzy Set Information Retrieval Approach Yiu-Kai D. Ng [email protected] Rajiv Yerra Follow this and additional works at: http://scholarsarchive.byu.edu/facpub Part of the Computer Sciences Commons Original Publication Citation Rajiv Yerra and Yiu-Kai Ng, "Detecting Similar HTML Documents Using a Fuzzy Set Information Retrieval Approach." In Proceedings of IEEE International Conference on Granular Computing (GrC'5), pp. 693-699, July 25, Beijing, China. BYU ScholarsArchive Citation Ng, Yiu-Kai D. and Yerra, Rajiv, "Detecting Similar HTML Documents Using a Fuzzy Set Information Retrieval Approach" (2005). All Faculty Publications. Paper 368. http://scholarsarchive.byu.edu/facpub/368 This Peer-Reviewed Article is brought to you for free and open access by BYU ScholarsArchive. It has been accepted for inclusion in All Faculty Publications by an authorized administrator of BYU ScholarsArchive. For more information, please contact [email protected]. Detecting Similar HTML Documents Using a Set Information Retrieval Approach Fuzzy Rajiv Yerra and Yiu-Kai Ng Computer Science Department, Brigham Young University Provo, Utah 84602, U.S.A. Email: {[email protected], rajiv [email protected]} Abstract- Web documents that are either partially or completely duplicated in content are easily found on the Internet these days. Not only these documents create redundant information on the Web, which take longer to filter unique information and cause additional storage space, but also they degrade the efficiency of Web information retrieval. In this paper, we present a new approach for detecting similar Web documents, especially HTML documents. Our detection approach determines the odd ratio of any two documents, which makes use of the degrees of resemblance of the documents, and graphically displays the locations of similar (not necessary the same) sentences detected in the documents after (i) eliminating non-representative words in the sentences using the stopword-removal and stemming algorithms, (ii) computing the degree of similarity of sentences using a fuzzy set information retrieval approach, and (iii) matching the corresponding hierarchical content of the two documents using a simple tree matching algorithm. The proposed method for detecting similar documents handles wide range of Web pages of varying size and does not require static word lists and thus applicable to different Web (especially HTML) documents in different subject areas, such as sports, news, science, etc. Index Terms-fuzzy set model, Web information retrieval, copy detection, HTML document, odds ratio I. INTRODUCTION specify formatting commands, which intermix with the text content of the document. Unlike images, it is not obvious to the naked eye if two text documents are similar, needless to say distinguishing text from tags in HTML documents. Since tags are intermixed with the actual content of an HTML document, they further complicate the problem of visualizing which portions of different HTML documents match because matching content in HTML documents requires careful examination of the documents. Document analysis techniques are required in order to automate the entire process and provide a metric to show the extent of similarity of two HTML documents. We propose a unique method in detecting similar HTML documents, which (i) performs copy detection on HTML documents, instead of any plain text documents for which existing copy detection methods are designed, (ii) detects HTML documents with similarl sentences, and (iii) specifies the overlapping locations of any two HTML documents graphically. In this paper, we first introduce a sentence-based copy detection approach, which analyzes the contents of HTML documents to determine their degrees of similarity. Our copy detection approach eliminates stopwords and reduces each non-stop word to its stem in a document, which are compared with non-stop, stemmed words in another document to determine their similarities using the fuzzy set information retrieval (IR) approach [1]. Similar sentences in different documents are detected, and the relative locations of matched sentences in the documents are depicted by dotplot views [2]. We proceed to present our results as follows. In Section II, we discuss related work in copy detection and existing methods in detecting similar documents. In Section III, we introduce our copy detection approach, present dotplot views that display similar sentences in two HTML documents, and include the experimental results of our copy detection approach. In Section IV, we propose our method in calculating the odds ratios among different HTML documents to quantify their similarity. In Section V, we give a concluding remark. One of the problems on the Internet these days is redundant information, which exist due to replicated pages archived at different locations like mirror sites. As a result, the burden is on the user to sort through all the retrieved documents from the Web to identify non-redundant ones, which is a tedious and tiring process. These documents can be found in different forms, such as documents in different versions; small documents combined with others to form a larger document; large documents co-existing with documents that are split from them. One classic example of such documents is news articles where new information is added to an original article and is republished as a new (updated) article. Since the amount of information available on the Web increases on a daily basis, filtering redundant documents becomes a more difficult task to the user. Fortunately, copy detection can be used as a filter, which identifies and excludes documents that overlap in II. RELATED WORK content with other documents. We are particularly interested in efforts have been made in the past to solve the Many copy detection of HTML documents, since they are abundant detection which finds similarities among docucopy problem, on the Web, even though there are different challenges in handling copy detection of HTML documents. 1From now on, whenever we use the term similar sentences (documents), HTML is a markup language, which means that the author we mean sentences (documents) that are semantically the same but different of an HTML document must follow the HTML syntax to in terms of words used in the sentences (documents). 0-7803-9017-2/05/$20.00 C2005 IEEE 693 ments. Diff, a UNIX command, shows differences between two textual documents, even spaces. SCAM [3] is a copy detection approach geared towards small documents. Differed from SCAM, SIF [4] finds similar files in a file system by using the fingerprinting2 scheme to characterize documents. SIF, however, cannot gauge the extent of document overlap nor display the location of overlap like SCAM, and the notion of similarity as defined in SIF is completely syntactic, e.g., files containing the same information but using different sentence structures will not be considered similar. COPS [5], which was developed specifically to detect plagiarism among documents, uses the hash-based scheme for copy detection. It compares hash values of test documents with that in the database for copy detection. There are limitations of COPS, which include (i) uses a hash function that produces large number of collisions, (ii) documents to be compared by COPS must have at least 10 sentences or more, and (iii) it has problems selecting correct sentence boundaries. KOALA [6], which is like COPS, is specifically designed for plagiarism and is a compromise between random fingerprinting of SIF and exhaustive fingerprinting of COPS that selects substrings of a document based on their usage. This results in lower usage of memory and increases accuracy. Neither KOALA nor COPS, however, can report the location of overlap of two documents and handle documents with varying size. More recently, [7] present an approach to identify duplicated HTML documents, which is significantly differed from ours, since they consider only HTML tags, but not the actual contents, of HTML documents. Two HTML documents are treated the same by [7] if the number of occurrences of their HTML tags are the same, which is inadequate. [8], whose copy detection approach is closer to ours, is restricted to textual documents and detect only same sentences. Our copy detection approach overcomes most of the limitations of existing approaches. Since our approach is not lineoriented, it overcomes the problems posed by Diff. Unlike KOALA and COPS, our approach is more general, since ours is not particularly targeted for plagiarism, even though we can handle the problem. Compared with SIF and SCAM, which cannot measure the extent of overlap, our approach can measure and specify the locations of overlapping in (HTML) documents and the relative sentence positions in the documents. Overall, our approach handles HTML documents for which existing copy detection approaches are not designed. Other methods for detecting similar documents have also been proposed. [9] use the fingerprint approach to represent a given document, which then plays the role of a query to a search engine to retrieve documents for further comparisons using the shingles and Patricia tree methods. As discussed earlier, the fingerprint approach is either completely syntactic or suffers from collisions. [10] introduce a statistical method to identify multiple text databases for retrieving similar documents, which along with queries are presented as vectors in the vector-space model (VSM) for comparisons. VSM is a well-known and widely used IR model [11], however, its reliance on term frequency without considering thesaurus of index terms in documents could be a drawback in detecting similar documents. [12] characterize documents using multiword terms, in which each term is reduced to its canonical form, i.e., stemming, and is assigned a measure (called IQ) based on term frequency for ranking, which is essentially the VSM approach. Besides using VSM, [13] also consider user's profiles [11], which describe users' preferences, in retrieving documents, an approach that has not been widely used and proved. In contrary, we develop an approach for detecting similar, as well as exact matching of, sentences in HTML documents to determine the degrees of similarity among different documents. III. OUR COPY DETECTION APPROACH In our copy detection approach, we first apply the semantic hierarchy construction tool [14] to extract text, i.e., data content, and remove tags from an HTML document, which yields a tag-less tree structure that precisely captures the hierarchical relationships of document text.3 Extracting the content of an HTML document is essential, since information is often presented in the text of the HTML document, and tags, in general, are constraints mainly used for identifying the type, format, and/or structure of text in the document and are not useful for representing data content. Hereafter, the hierarchy is passed through a stopword4-removal [11] and stemming [15] process, which removes all the stopwords and replaces words in sentences resided at each node in the hierarchy by their stems. For example, the stem of computing, computer, computation, computational, and computers is comput. This process reduces the total number of words in a document to be considered and thus reduces the size and complexity in document comparison. Since our copy detection approach is a sentence-based approach,5 words in each stemmed sentence without stopwords in one document are compared with nonstop, stemmed words in a sentence in another document to determine whether they are similar using the fuzzy set IR approach. Using a simple tree matching algorithm, similar sentences in different nodes in two semantic hierarchies are linked together and the relative locations of matched sentences in the documents are graphically depicted by dotplot views [2] and in the hierarchies. We discuss the details below. A. Capturing document content using semantic hierarchies Data in HTML documents are hierarchical and semistructured in nature. In order to extract the content of an HTML document, the document is passed through our semantic hierarchy construction tool from where the text (not tags) of 3The semantic hierarchy construction tool, which is fully automated, does not require the internal structure of an HTML document to be known. 4Stopwords in a sentence, such as articles, conjunctions, prepositions, punctuation marks, numbers, non-alphabetic characters, etc., often do not play a significant role in representing the meaning of the sentence. 5We use the boundary disambiguation algorithm in [16], which has accuracy in the 90% range but operates in linear time, to determine sentence boundary. 2Fingerprints of a document yield the set of all possible document substrings of a certain length, called fingerprints, and fingerprinting is the process of generating fingerprints. 694 the document is extracted. The tool takes an HTML document TABLE I D as input and returns a tag-less, hierarchical structure of data DOCUMENTS USED FOR CONSTRUCTING THE TERM-TERM CORRELATION in D according to the HTML grammar. MATRIX IN OUR FUZZY SET IR APPROACH During the process of constructing the semantic hierarchy Sources Size No. of No. of No. of Non-Stop portion for non-table data, HTML tags are divided into difSentences Stemmed Words pages ferent levels, which include headings, block-level, text-level TREC 1.29 GB 4,390 T 4,829,653 28,983,918 (font, phrase, form, etc.), and addresses. These tags determine 1.53 GB Gutenberg 5,643 T 8,935,215 72,681,213 Web News 0.63 GB 39,349 T 701,520 5,429,643 the relative locations of different portions of data content in Etext.org T 1.13 GB | 3,124 6,239,154 41,624,048 the hierarchy, and the data content are clustered, merged, and Total 4.58 GB [ 52,506 20,705,542 148,718,822 promoted in the hierarchy during the construction process. To create the semantic hierarchy portion for HTML table data, if they exist, the hierarchical dependencies (e.g., row and column where ci,j is the correlation factor between words i and j, ni,j is the number of documents in a collection (which is the order) among the data content in the table are determined using data set as shown in Table I) with both words i and j, ni (nj, various HTML table tags. (See [14] for details.) respectively) is the number of documents in the collection with word i (word j, respectively). B. Copy detection using the fuzzy set IR approach We construct the term-term (i.e., word-word) correlation After converting two HTML documents into their semantic matrix with rows and columns associated to words to define hierarchies and performing stopword removal and stemming the of similarity among words in the documents as degree on their text in each node, we adopt and modify the fuzzy set shown in Table I. Using the degrees of similarity among IR model [1] to determine sentences in any two documents in words two we obtain the degree of similarity sentences, (i.e., in the nodes of their semantic hierarchies) that are either of the sentences, which measures the extent to which the similar or different. It has been observed that words that are sentences match. To obtain the degree of similarity between common to a number of documents discuss the same subject. two sentences and Sl Sj, we first compute the word-sentence 1) Degree of similarity in fuzzy set IR model: At a high correlation of word i in Sl with all the words in factor ILi,j level, a sentence can be treated as a group of words arranged which measures the degree of similarity between i and (all Si, in a particular order. In English language, two sentences can the words as in) Sj, be semantically the same but differ in structure or usage of words in the sentences, and thus matching two sentences is (2) Aij = 1 JJ (1 Ci,k), approximate or vague. This can be modeled by considering kESj that each word in a sentence is associated with a fuzzy set that contains words with similar meaning, and there is a where k is one of the words in Sj and Ci,k is the correlation degree of similarity (usually less than 1) between (words in) a factor between words i and k as defined in Equation 1. Using the the 1u-value of each word in sentence Si as defined sentence and the set. This interpretation in fuzzy theory is the in Equation 2, which is computed against sentence Si, we fundamental concept of various fuzzy set models in IR [11]. define the degree of similarity of Si with respect to Si as We adopt the fuzzy set IR model for copy detection, since follows: (i) identification and measurement of similar sentences, apart from exact match, further enhances the capability of a copy (3) Sim(Si, Sj) = W1Ji + PW2Jj +± * * + AWj detection approach, and (ii) the fuzzy set model is designed and proved to work well for partially related semantic content, where Wk (1 < k < n) is a word in Si, and n is the total which can handle the problem of copy detection of similar number of words in Si. Sim(Si, Sj) is a normalized value. Likewise, Sim(Sj, Si), which is the degree of similarity of Sj sentences. In the fuzzy set IR model, a term-term correlation matrix, with respect to Si, is defined accordingly. Using Equation 3, we determine whether two sentences Si which is constructed in a pre-processing step of our copy detection approach, consists of words6 and their corresponding and Sj should be treated the same, i.e., EQ, as follows: correlation factors that measure the degrees of similarity 1 if MIN(Sim(Si, Sj), Sim(Sj, Si)) > among different words, such as "automobile" and "car." We 0.825 A ISim(Si, Sj) - Sim(Sj, Si)I modify the fuzzy set IR model to determine the degrees of EQ(Si, Sj)= < 0.15 similarity among different sentences by computing the corre0 otherwise, lation factors between any pair of words from two different (4) sentences in their respective documents. The following word- where 0.825 is called the permissible threshold value, whereas word correlation factor, cij, defines the extent of similarity 0.15 is called the variation threshold value. The permissible between any two words i and j: threshold is a value set to obtain the minimal similarity between any two sentences Si and S2 in our copy detection (n1) Ci,j = ni,j / (ni + nj approach, which is used partially to determine whether Si and 6From now on, unless stated otherwise, whenever we use the term word, S2 should be treated as equal (EQ). Along with the permissible threshold value, the variation threshold value can be used we mean non-stop, stemmed word. - - 695 - a 1. - - - .., ..1. 1. - - .1- (a) Permissible threshold value (b) Variation threshold value Fig. 1. Determination of the permissible and variation threshold values for similar documents using the fuzzy set IR approach to decrease the number of false positives (i.e., sentences that are different but are treated as the same) and false negatives (i.e., sentences that are the same but are treated as different). The variation threshold value sets the the maximal, allowable difference in sentence sizes between Si and S2, which is computed by calculating the difference between Sim(SI, S2) and Sim(S2, Si). The threshold values 0.825 and 0.15 thus provide the necessary and sufficient conditions for estimating the equality of two sentences and are determined by testing the documents in the collection as shown in Table I and Figure 1. We explain how to compute the threshold values below, which ensures that number of common words in Si and Sj is significantly high to at least a certain degree, i.e., at least 82.5% alike. 2) Determining the permissible and variation threshold values: A total number of 200 documents, which were randomly sampled from the set of 52,506 documents as shown in Table I, were used in part to detennine the pernissible and variation threshold values. Out of the 200 samnpled documents, 12.5%, 17.5%, 32.5%, and 37.5% were chosen from the archive sites TREC (http://trec.nist.gov/data/), Gutenberg (ftp://ftp.archive.org/pub/etext/), News articles on the Web, and Etext.org (ftp://ftp.etext.org/pub/), a text archive site, respectively, and 33% of the 200 documents are HTML documents. Using the 200 documents, we determined the permissible and variation threshold values. According to each predefined threshold value V (0.5 < V < 1.0, with an increment of 0.05), we categorized each pair of sentences SI and S2 in the 200 documents as equal or different, depending on V and the minimum of the two degrees of similarity measures on S, and S2. S, and S2 are treated as equal if MIN(Sim(Sl, S2), Sim(S2, Si)) > V; otherwise, they are different. Hereafter, we manually examined each pair of sentences to determine the accuracy of the conclusion, i.e., equal or different, drawn on the sentences and computed the number of false positives and false negatives and plotted the outcomes in a graph. According to the graph as shown in Figure l(a), we set the permissible threshold value as 0.825, which is considered to be the optimal value and the "balance" point of false positives and false negatives, i.e., neither the values of false positives nor false negatives dominates the copy detection errors and thus is the compromise between false positives and false negatives. The curves in Figure 1(a) show that as the (permissible) threshold value increases, the percentage of false positives decreases, whereas the percentage of false negatives increases. This is because as the threshold value increases, more documents are filtered and the number of retrieved documents decreases. Thus, more documents which are similar are not detected as the threshold increases, and as a result the false negatives increase, and it is the opposite in the case of false positives. To obtain the variation threshold value, we examined a set of pairs of sentences PS whose minimal degree of similarity exceeds or equals to the pennissible threshold value. The differences between the degrees of similarity of sentence pairs in PS are first calculated and then sorted. The percentages of false positives and false negatives are then determined manually and plotted at regular intervals in sorted order. Since the minimum permissible threshold value is set to 0.825, the difference in similarity between two documents can only range between 0 and 0.175. The greater the difference in similarity is, the greater the chances of error is. For example, if the difference is 0.175, one of the similarity measures can be sim(Sl, S2) = 1.0 and the other is sim(S2, Sj) = 0.825. This happens when the set of words in Si is a subset of the words in S2. For example, if sentence Si is "The boy goes to school every day on bus," whereas sentence S2 is "The bus going tx school passing by the lake stops for the boy at the comner every day." In this example, Si is a part of S2, but not vice versa, and should be treated as different. According to Figure l(b), the ideal variation threshold value between the range 0 and 0.175 is 0.15, which is the intersection of the false positive and false negative curves and is dominated by neither false positives nor false negatives. IV. DETECTING SIMILAR HTML DOCUMENTS In this section, we present our approach in determining to what extent the contents of different HTML documents are in common based on the copy detection approach presented in Section IH. We first discuss odd ratio, which is used to precisely capture the degree of overlapping between two HTML documents according to the degrees of resemblance (to be defined below) between the documents. Hereafter, we introduce our tree matching approach that determines and depicts the overlapped portions of two HTML documents graphically using their corresponding semantics hierarchies and a dotplot view. A. Odds ratios We can compute the number of similar sentences appeared in any two documents using the formula EQ in Equation 4. Using EQ, we define the degree of resemblance (RS) of document doc1 with respect to doc2, which counts the total number of similar sentences in doc appeared in doC2 as follows: m Sdc 6Ul Jdj ni=l znj=l EQ(Sdl., i ) RS(doci, doc2) = (5) m where Sdl (1 < i < m) is a sentence in docl, Sd23 (1 < j < n) is a sentence in doC2, and m (n, respectively) 696 is the total number of sentences in doc1 (doc2, respective RS(doc2, doci) is defined accordingly. After obtaining the degrees of resemblance between two documents doc1 and doc2, we represent these two degi with a single value, which depicts to what extent the rela degrees ofresemblance between doc1 and doc2 is. These sin values of different pairs of documents, e.g., between doc, doc2, and between doc1 and doc3, can be used to compare relative degree of similarity among different documents. obtain this single value using odd ratio, which comes with property that the higher the odds ratio between two documi is, the more the two documents resembles each other. 0 ratio, which is also called odds in the literature, is defi as the ratio of the probability (p) that an event occurs to probability (1 - p) that it does not, i.e., Odds ratio = p/(1 - p). 8o 'S Fig. 2. Odds ratios among Documents TABLE II ODDS RATIOS AMONG SIX DOCUMENTS IN A COLLECTION Doco [ DoCl I Doc3 IDoc4 [ Docs Doe [ Doco 11 100 9.21 0.064 0.058 0.01 °0 Docl I 9.21 100 0.09 0.051 0.001 0.021 (6) The reasons to adopt odds ratio is three fold. First, it is ezasy to compute. Second, it is a natural, intuitively acceptable,way to express magnitude of association. Third, it can be linke d to other statistical methods, such as the Bayesian statistical mtodeling, Dempster-Shafer theory of evidence, and probab ility analysis. In Equation 6, p and (1 - p) are odds. The ratio g ives the positive versus negative value, given that p is positive (1 - p) is negative. As (1 - p) approaches to one, the odds r atio reaches infinity. To obtain p as defined in odds ratio, we ai the Dempster-Shafer's theory of evidence, which comb nes two or more evidences or propositions to obtain a si igle proposition, and in our case the evidences or propositions are degrees of resemblance between two documents. The Dempster's combination rule computes a measuri of agreement between two bodies of evidence concerning vari ous propositions discerned from a common frame of discernn [17]. According to this rule, which is based on indepenr lent items of evidence, if the probability for evidence E1 tc be reliable is PI and if the probability for evidence E2 tc be reliable is P2, then probability that both E1 and E2 are reli; iS Pl X P2Applying the Dempster-Shafer rule on the degrees of resemblance RS(doc1, doC2) and RS(doc2, doci) between two documents doc1 and doC2, we obtain RS(doc1, doC2' x RS(doc2, doc1), which plays the role of p in odds rn LtiO. Hereafter, we compute odds ratio of doq, and doC2 as follc ws: RS(docl, doc2) x RS(doC2, doc) (7) 1 - (RS(docl, doC2) x RS(doc2, docl)) The odds ratio, along with the Dempster-Schafer's rul e of combining evidence, serves the purpose of deriving a si' igle measure between two documents, which reflects how sin ilar the documents are. The higher the odd ratio of the two documents is, the lesser the difference between the docum ents is, and the greater the similarity between the two documeni s is. For example, if the degrees of resemblance of two docum ,nts are 0.9 and 0.9, then the odds ratio is 4.26, whereas if the degrees of resemblance of the two documents are 0.1 and 0.9 instead, then the odds ratio is 0.098, which is further dis tant ,ply tent tble DOC3 Doc4 DoC5 Doc7 0.064 0.058 0.01 o0 0.051 8.242 100 8.242 0.001 0.1 0 0.015 0.09 0.021 0.1 100 0.015 0.009 12.4 100 0 0.009 | 12.4 100 apart compared with the other two documents with the odds ratio value of 4.26. Thus, the odds ratio accurately reflects the "closeness" and "difference" of the two corresponding documents. If the product of similarities is one, we assign a value 100 as odds ratio to represent infinity. Example 1: We compute odds ratios on a collection of ten Web documents, which show the overall similarity between any pair of documents, to illustrate the accuracy of the odd ratios. The ten documents were retrieved from the Internet by the Google search engine, using keywords such as NAT, IP Address, ONS, etc. The computed odds ratios among all these documents are shown in Figure 2, where we intentionally set the odd ratios of identical pairs of documents to zero (which are supposed to be 100 instead) so that the graph is more visible. We choose six out of the ten documents, i.e., Doco, Doc,, Doc3, Doc4, Doc5, and DoC7, to demonstrate their relative odd ratios as shown in Table II, which indicates that Doco and Doc1 (DoC3 and Doc4, and Doc5 and DoC7, respectively) are more similar to each other than to the other documents. Consider the two documents Doco and Doc1 in the collection and as shown in Table II. Doco (http://www.cisco. com/warp/public/473/lan-switch-cisco.pdf), entitled "How LAN Switches work:' discusses networks, switches, switching topologies, bridging, spanning trees and VLANs, whereas Doc, (http://www.howstuffworks.com/lanswitch.htm/printable), entitled "How Switches work," which introduces switches from a beginner's point of view, explains the functioning of switches, networking basics, traffic, fully switched networks, switch configurations, and other topics related to switches. The degrees of similarity between Doco and Doc1 are Sim(Doco, Doc,) = 0.91 and Sim(Docl, 697 should have already been converted into binary trees, since a semantic hierarchy is not necessary a binary tree. Hereafter, the Tree-Matching algorithm will be called using the binary trees S and T, which is followed by calling algorithm GraphicaliDisplay to display the result, i.e., matched nodes in the semantic hierarchies and matched sentences in a dotplot view. In the TreeMatching algorithm, pointers to the roots of the source tree S and target tree T are passed as input arguments. The algorithm traverses trees S and T in inorder, and calls procedure Node-Comparison to compare sentences in any node n, of S with sentences in any node n2 of T. If n, and n2 contain similar sentences, the total number of similar sentences found in n, (n2, respectively), as well as the NodeJD of ni (Node-D of n2, respectively), are recorded in the node data structure of n2 (nl, respectively). Algorithm Update-Sentence-Pos(T, *Pos) Input: T, a semantic hierarchy in its binary tree format, and a sentence position index *Pos. Output: Updated SPos list in the node data structure of each node in T. BEGIN 1. IF T has a LEFT SUBTREE, call Update-SentenceIPos(T->LEFT, Pos). 2. FOR each sentence (in the given order) in T, DO (a) T.SPos[i++] = *Pos /* i is set to 0 before the Loop */ (b) *Pos = *Pos + 1 3. IF T has a RIGHT SUBTREE, call Update-SentencePos(T--RIGHT, Pos). END Doco) = 0.98, and the odds ratio is 8.242, which is treated as moderately high. B. Tree Matching Algorithm We present a simple tree matching algorithm to match nodes in one semantic hierarchy with nodes in another semantic hierarchy such that the matched nodes contain at least one similar sentence. The matching captures common data content of the corresponding HTML documents, running in n log n time complexity, where log is the logarithmic base 2 function and n is the number of nodes in a semantic hierarchy. The proposed tree matching algorithm is (i) efficient in terms of time complexity, (ii) easy to implement, and (iii) intuitive and natural in expressing matching nodes of two different semantic hierarchies. Intuitively, given two semantics hierarchies, which are called source tree and target tree, we match nodes in the two trees by traversing the trees inorder. Assume that the two semantic hierarchies to be matched are S and T, where S is the source tree and T is the target tree. Nodes in T are considered one-by-one according to the order of inorder traversal. For each node n in T, S is parsed inorder to match n, i.e., determining the (total number of) sentences in each node in S that are similar to the sentences in n. When a match of n is found in S, the node data structure of n is updated with the necessary information of the matching node m in S, which include the Node ID of m and the sentence positions of similar sentences in m. The node structure information of m is also updated accordingly. This process is continued till the entire tree T is parsed. We adopt the following node data structure <D, SPos, MNInf o, LEFT, RIGHT> for each node N in S (T, respectively) in our tree matching algorithm: . Data (D): Sentences in N. . Sentence positions (SPos): Positions of sentences Si, . Sj in N are maintained in the structure (Ls1, Ls,), which are obtained according to (i) the relative order of N in the inorder traversal of the tree and (ii) the relative order of sentences in N. . MatchedNodeinfo (MNInfo): Information on each node M on the other tree that matches N, which are stored in a linked list L, and each component of L is Algorithm Tree.Matching(S, T) Input: Two semantic hierarchies, source tree S and target tree T, which are binary trees. Output: Matched nodes and sentences in S and T. BEGIN 1. IF T has a LEFT SUBTREE, call TreelMatching(S, T--LEFT). 2. Call Node Comparison(S, T). 3. IF T has a RIGHT SUBTREE, call Tree-Matching(S, T--RIGHT). END <NodelD, SP, NextLink> where NodelD is the physical address of M, SP is the sentence position of a similar sentence in M, and NextLink is the pointer to the next component in L. . Left Pointer (LEFT): Pointer to the left subtree. . Right Pointer (RIGHT): Pointer to the right subtree. Prior to invoking the Tree-Matching algorithm (given below), we assign the sentence positions of sentences in the node data structure of each node in S as well as in T. This task can be achieved by traversing S (T, respectively) inorder. A sentence position index, called Pos, is initialized to zero for the first call to algorithm UpdateSentencePos. Note that S and T, which are two semantic hierarchies to be compared, 698 Procedure Node-Comparison(S, N) Input: A semantic hierarchy S and a node N. Output: Detect similar sentences between nodes in S & N. BEGIN 1. IF S has a LEFT SUBTREE, call Node.Comparison(S-+LEFT, N). 2. Determine similar sentences in S and N using the fuzzy set IR approach. (a) Let M be S. /* the node being pointed to by S *l (b) Let sentences in M be ml, . . ., mk, and let sentences in N be nl, ..., np. (c) FOR i = l..k, DO FOR j = l..p, DO IF EQ(mi, nj) = 1, THEN i. N.MNInf o-+Node-ID = M and V. CONCLUSIONS We have presented a new approach that detects similar Web documents, especially HTML documents. We first apply + +++ + ++ a fuzzy set IR copy detection approach to detect similar .1. +l:++ in two documents and then compute the odd ratio ~~~+ +4 ofsentences the documents to reflect the degree of common content. Our copy detection approach is unique, since it (i) performs copy detection among HTML documents, which can act as a (a) Node matching of S & T (b) Sentence matching in S & T filter to various Web search engines/Web crawlers to improve Fig. 3. Matching nodes and sentences in documents S and T efficiency by indexing fewer documents and ignoring similar ones, (ii) detects similar sentences, apart from same sentences, N.MNInfo-*SP = M.SPos[i]. by the fuzzy set IR approach on stemmed and non-stopwords ii. M.MNInfo-*NodeID = N and in HTML documents, and (iii) captures similar sentences M.MNInfo- SP = N.SPosUl. in any two HTML documents graphically, which displays 3. IF S has a RIGHT SUBTREE, call the location of overlapping portions of the documents. Not Node_Comparison(S-*RIGHT, N). only our copy detection approach handles wide range of 4. RETURN documents of varying size, but also since it does not require END static word lists, it is applicable to different (HTML) Web documents in different subject areas, such as sports, news, Algorithm GraphicalDisplay(T, S) science, etc. Experiment results on our copy detection and Input: Semantic hierarchies T and S. in Output: Graphically display the matched sentences S and similar document detection approaches look promising. T in a dotplot view and matched nodes in T. REFERENCES BEGIN K. Kobayashi, "A fuzzy document retrieval Y T. and [1] Ogawa, Morita, 1. IF T has a LEFT SUBTREE, call system using the keyword connection matrix and a learning method," Graphical-Display(T-*LEFT) Fuzzy Sets and Systems, vol. 39, pp. 163-179, 1991. [2] J. Helfman, "Dotplot patterns: A literal look at pattern languages' 2.LetNbeT Theory and Practice of Object Systems, vol. 2, no. 1, pp. 31-41, 1996. (a) MNInfo := N--MNInfo. [3] N. Shivakumar and H. Garcia-Molina, "Scam: A copy detection mecha(b) WHILE MNInfo-*NextLink #4 NULL nism for digital documents," D-Lib Magazine, 1995, http://www.dlib.org. [4] U. Manber, "Finding similar ifies in large file system," in USENIX Winter /* Perform Graphical Display */ Technical Conferences, Jan. 1994. i. M := N-*MNInfo.NodelD [5] S. Brin, J. Davis, and H. Garcia-Molina, "Copy detection mechanisms (I) Cnt := Cnt + 1. /* Cnt is a similar sentences for digital documents:' in Proceedings of the 1995 ACM SIGMOD, 1995, pp. 398-409. counter, initialized to zero */ H. Nevin, "Scalable document fingerprinting," in Proceedings of the 2nd [6] (II) Create an edge from M to N, if there is no USENIX Workshop on Electronic Commerce, 1996, pp. 191-200. edge connecting M and N. [7] G. Lucca, M. Penta, and A. Fasolino, "An approach to identify duplicated web pages," in Proceedings of COMPSAC, 2002, pp. 481-486. (Im) Plot the pair (M-*MNInfo.SP, D. [8] Campbell, W. Chen, and R. Smith, "Copy detection systems for digN-*MNInfo.SP) on a dotplot view. ital documents," in Proceedings of IEEE Advances in Digital Libraries, ii. MNInfo MNInfo-*NextLink 2000, pp. 78-88. [9] A. Pereira and N. Ziviani, "Retrieving similar documents from the web,' END WHILE Journal of Web Engineering, vol. 2, no. 4, pp. 247-261, 2004. (c) Shaded region of M with percentage computed by [10] C. Yu, K. Liu, W. Wu, W. Meng, and N. Rishe, "Finding the most similar Cnt / M.ISPosi. documents across multiple text databases:' in Proceedings of the IEEE Forum on Research and Technology Advances in Digital Libraries, 1999, 3. IF T has a RIGHT SUBTREE, call pp. 150-162. Graphical-Display(T--Right) [11] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. 4. RETURN Addison Wesley, 1999. [12] J. Cooper, A. Coden, and E. Brown, "Detecting similar documents using END salient terms," in Proceedings of CIKM'02, 2002, pp. 245-251. E. Silva, F Femandes, S. Meira, and F Barros, "Activesearch: Example 2: Consider two HTML documents, Sistani, (S) [13] J. Rabelo,for An agent suggesting similar documents based on user's preferences," and Sistani2 (T), which are first converted into binary trees in Proceedings of the Intl. Conf on Systems, Men, and Cybernetics, before applying Tree-Matching to determine the overlap be2001, pp. 549-554. tween the documents. In Figure 3(a), which shows portions [14] S. Lim and Y.-K Ng, "Constructing hierarchical information structures of sub-page level html documents," in Proceedings of the 5th Intl. Conf of the two trees, 33% of the area in Node 45 of T is shaded, of Foundations of Data Organization (FODO'98), 1998, pp. 66-75. since one out of three sentences in Node 45 is similar to the [15] M. Porter, "An algorithm for suffix stripping," Program, vol. 14, no. 3, pp. 130-137, 1980. corresponding sentence in Node 46 of S, whereas Node 49 D. Campbell, "A sentence boundary recognizer for english sentences," of T has 75% shaded area, since three of the four sentences [16] 1997, unpublished work. in the node match with sentences in various nodes in S. The [17] I. Ruthven and M. Lalmas, "Experimenting on dempster-shafer's theory of evidence in information retrieval,' Journal of Intelligent Information dotplot view in Figure 3(b) depicts the relative positions of Systems, vol. 19, no. 3, pp. 267-302, 2002. thirteen similar sentences in the documents. --- = 699
© Copyright 2026 Paperzz