Journal of Chinese Language and Computing 16 (3): 157-168 157 An unsupervised framework for robust web-based Information Extraction Qiaoming Zhu, Zhengxian Gong, Peifeng Li, Guodong Zhou School of Computer Science & Technology Soochow University 215006, Suzhou ,China [email protected], [email protected], [email protected], [email protected] _________________________________________________________________________ Abstract. Web-based information extraction is challenging and has become a hot research topic during the last decade. This paper proposes an unsupervised framework for robust webbased information extraction to mine useful information from given and related webpages. For the given webpages, a webpage mining algorithm is first presented to retrieve similar webpages from the Web. Then, a noise filtering algorithm is presented to filter out noisy information in each webpage. Finally, a novel wrapper induction algorithm is proposed to extract rules from the webpages by first identifying data-intensive segments and then inducing a wrapper from each of them through identifying iterative and optional items. This is based on an observation that iterative items and possible optional items inside them can normally be detected from a webpage or segment itself and do not depend on tag mismatches between two webpages or segments. Evaluation shows that our system works well on data-intensive websites with the popular paging technique. It also shows that our system is more robust and generalized than other popular web-based IE systems on dateintensive web sites. Keywords Web pages, RoadRunner, similar pages, clean web noise, fit fan-out node, extraction rule _______________________________________________________________________ 1. Introduction With the dramatic increase in the amount of textual information available in the Web, there has been growing interest in web-based information extraction (IE). In some sense, the Web can be regarded as the largest “knowledge base” ever developed and made available to the public. During the last decade, web-based IE has become a hot research topic. However, web-based IE is challenging due to the complexity and diversity of the huge Web. Web-based IE is usually performed by wrappers (Kushmerick 1997; Crescenzi et al 1998; Muslea et al 2001; Crescenzi et al 2001). When decoding, wrappers are applied to parse the given webpages and extract useful information, e.g. the list of job records (with job title, company name, location, salary and date) as shown in Figure 1. In literature, various categories of approaches have been proposed in generating wrappers: manually written, supervised wrapper induction learning and unsupervised wrapper induction learning. 158 Qiaoming Zhu, Zhengxian Gong, Peifeng Li, Guodong Zhou Early approaches are based on human expertise (Crescenzi et al 1998; Sahugurt et al 1999). A key problem with manually coded wrappers is that writing them is usually a difficult and labor-intensive task, and that they tend to be brittle and difficult to maintain manually. Figure 1: An example: a list of job records in a hyperlink group Supervised wrapper induction learning approaches depend on the availability of annotated training examples. They have the potential to automate the development process and achieve high performance in a supervised way. Inducing wrappers from a set of webpages corresponds to look at the annotated examples and try to generalize them for the web code. The Rapier system described in Califf et al (1997) used relational learning to construct unbounded pattern-match rules for information extraction. The Stalker system described in Muslea et al (2001) proposed a hierarchical wrapper induction approach by decomposing the hard problem of extracting data from a complex webpage into a series of simpler extraction tasks. Both of the systems generate high-accuracy extraction rules based on annotated training examples. However, annotation of training examples is also much labor-intensive. Therefore, annotating training examples becomes the major bottleneck in supervised learning approaches. The current trend is to apply unsupervised wrapper induction learning approaches, which automate the wrapper induction process to a large extent. That is, they do not rely on annotated training examples and do not require any human interaction during the wrapper induction process. As a representative, RoadRunner as described in Crescenzi et al (2001) relied on a set of similar webpages to extract rules automatically. In order to differentiate meaningful information from meaningless information, it worked with two similar webpages at a time and tried to generalize wrappers by finding the commons and the differences between them. However, there exist some disadvantages in current unsupervised wrapper induction learning approaches, such as the one in RoadRunner. Firstly, their success normally depends on the availability of structure-similar webpages and they only work well on webpages with simple structures. Secondly, they often lack effective and efficient noise filtering mechanisms. Finally, they always lack robustness and the generated wrappers are vulnerable to small changes in the webpage structure. As a representative, RoadRunner was An unsupervised framework for robust web-based Information Extraction 159 based on both tag and string mismatches to find iterative and optional items by backtracking. Although these two mismatches occur widely, the occurrence of tag mismatches in popular data-intensive websites is quite limited. This largely affects the generalization power in its wrapper induction, which makes RoadRunner lack of robustness and vulnerable to small changes in the webpage structure. In order to solve above problems, this paper proposes an unsupervised framework for robust web-based information extraction. For given webpages, a webpage mining algorithm is first presented to automatically retrieve similar webpages from the Web. Then, a noise filtering algorithm is presented to filter out noisy information in each webpage. Finally, a novel wrapper induction algorithm is proposed to extract rules from the webpages by first identifying data-intensive segments and then inducing a wrapper from each of them through identifying iterative and optional items. The rest of this paper is organized as follows. Section 2 presents the similar webpage mining algorithm while the noisy filtering algorithm is described in Section 3. Section 4 proposes the novel wrapper induction algorithm. Finally, we evaluate our unsupervised framework on data-intensive websites in Section 4 and conclude this paper in Section 5. 2. Similar Webpage Mining To automatically induce well-behaved wrappers in an unsupervised way, it is critical to have a set of “good” similar webpages. Here, we mean “good” in two aspects: 1) The webpages should be representative of the task to achieve acceptable recall when using the induced wrappers in decoding; 2) The webpages should be similar enough in both content and structure. Normally, a balance between them should be achieved. However, the stateof-the-art systems, such as RoadRunner in Crescenzi et al (2001), avoid this critical issue by assuming that such “good” similar webpages are already provided. This makes such systems difficult to be deployed to end users since it is very difficult and impractical for them to provide such “good” webpages, especially being representative which means a reasonable number of them with a fairly distribution. It may be more practical for them to provide one or a very few representative webpages for a task, and the computers do all the remaining for them. Moreover, most of the state-of-the-art systems only work well on a set of very similar webpages. This makes the induced wrappers fail to achieve good coverage. For example, RoadRunner made the assumption that given webpages should share the same webpage structure. Therefore, it is critical to automatically obtain enough “good” similar webpages for the success of an automatic wrapper induction system. Given above consideration, the first step of our system is to retrieve enough “good” similar webpages for the given webpages. The basic idea is to first retrieve a large amount of webpages from the Web. e.g. via a public search engine (such as Google), using the terms occurred in the given webpages to form a query and then filtering out “bad” webpages. For filtering, we can first cluster the retrieved webpages according to their content and structure similarity, then eliminate small clusters and finally pick up a few representatives from each of the remaining major clusters. In this way, a set of “good” similar webpages can be achieved with content and structure similarity. Generally, the similarity between webpages can be measured through their webpage structures and contents. Reis et al (2004) adopted Tree Edit Distance to cluster pages. This is based on that each grammatical webpage would form a tree and their similarity can be calculated between their counterpart trees. 160 Qiaoming Zhu, Zhengxian Gong, Peifeng Li, Guodong Zhou Especially for data-intensive web sites, “good” similar webpages can also be effectively and efficiently retrieved from: Related webpages in the paging list of a given webpage when the paging technique is applied. Normally, such pages are so similar with the given webpage that high quality can be assumed. Related hyperlinks in the same directory of a given webpage. 3. Noise Filtering Normally, webpages are full of “noisy” (data-unrelated) information that deters a webbased information extraction system from automatically inducing wrappers. Therefore, noise filtering should be done first before wrapper induction. Otherwise, the robustness of a web-based information extraction system will be largely reduced. For example, RoadRunner extracted rules via strict label and string mismatches in the given webpages. This makes it error prone to noisy information, which strongly affects its robustness. Moreover, the size of webpages can be largely reduced via proper noise filtering. Basically, there are two categories of noisy information: Layout noisy information. For a webpage in HTML, there exist many tags, such as SCRIPT and STYLE, purely for maintaining the webpage style. Generally, such tags occupy much in the webpage size and have no relationship with the content. Moreover, such noisy information can be easily detected and filtered out. Assistant noisy information for browsing, such as navigations and friendly hyperlinks. Normally, this category of noisy information is much complicated and sometimes difficult to detect. In literature, different ways have been applied to deal with noisy information. For example, Yi et al (2003) discovered that there is only small change of noisy information between similar pages and tried to filter out noise by constructing a style-tree based on information entropy. However, their algorithm in constructing the style-tree is too complex to be useful in practice. Instead, Wang (2004) ignored possible change of noisy information and used an easy way to eliminate noise by deleting same nodes in similar pages at the expense of precision. To filter out noisy information in webpages effectively and efficiently, this paper proposes to a two-step strategy by first eliminating easy layout noisy information and then the remaining complicated assistant noisy information for browsing. In this paper, the layout noisy information can be easily filtered out by maintaining a list of possible layout noisy tags, such as the one in Table 1. Due to the limited tag set in a web marking language, e.g. HTML, this list can be easily maintained to the users’ need. The major problem in noisy filtering lies in the remaining assistant noisy information for browsing, such as navigations and friendly hyperlinks. Normally, such noisy information occurs together and forms a hyperlink group with the same structure. Therefore, filtering out such noisy information consists in first identifying possible hyperlink groups and then determining their nature. Due to the complexity and diversity of webpages, identifying possible hyperlink groups is not easy as it seems. Figure 2 shows three possible types of hyperlink groups: Flat hyperlink group, such as the type A1 with several <a> tags inside a <div> tag; 161 An unsupervised framework for robust web-based Information Extraction Table hyperlink group, such as the type A2 with each hyperlink embedded in another tag, forming a sequential hyperlink group; Recursive hyperlink group, such as the type A3 which consists of several similar units with each of them consisting of several tags recursively. Tag SCRIPT STYLE INPUT SELECT TEXTAREA Description Page script, such as javascript, vbscript Page style Webforms including radiobox, checkbox, textbox, submit button Webform: dropdown list Webform: multi-line textbox Table 1 A list of layout noisy tags Html 24 div 3 <a> 0 <a> 0 A1 <a>0 tr 1 tr 1 tr 1 <a>0 <a>0 <a> 0 A2 A3 div 12 table 6 <a>0 sdiv3 <I>0 sdiv 3 <B>0 <a>0 sdiv 3 <a>0 <I>0 <I>0 <B>0 <B>0 Figure 2: Possible types of hyperlink groups Among them, flat and table hyperlink groups do not contain data-related information and can be filtered out directly while recursive hyperlink groups are much more complex. For example, in Figure 1, each item in the first column (Job Title) is a hyperlink with more details about the corresponding job and all of them compose a recursive hyperlink group. In this case, the data in the first column should be extracted instead. The algorithm HyperlinkGroup(…) in Figure 3 identifies possible hyperlink groups and determines whether they are noisy or not. In this paper, this is done by traversing webpage nodes with the depth-first search algorithm to find a hyperlink node (with label <A>). If such node is found, find its nearest upper node with no less than 2 children (since a hyperlink group should have at least 2 members) and call the function IsNoisyNode(…) to determine whether all of its children are hyperlinks or not. If yes, label the nearest upper node as a noisy node and delete it. Through noise filtering, the size of a webpage can be largely reduced. This will largely decrease the burden of the wrapper induction algorithm as well as increasing its robustness. 162 Qiaoming Zhu, Zhengxian Gong, Peifeng Li, Guodong Zhou 01 Algorithm: HyperlinkGroup(Element node) //get node’s all children 02 BEGIN 03 List children=node.getChildren(); 04 if (children.size()==0) { //when node is a leaf 05 if (node.getNodeName.equals("A")) { //when the leaf is a hyperlink 06 Element pt=find_nearest_parent(node); 07 IsNoisyNode (pt); 08 } 09 } 10 else { //no-left node with recursion 11 for(int i=0;i<children.size();i++){ 12 Element tempnode=children.get(i) 13 if (visited(tempnode)) continue; 14 else{ 15 if (node.getNodeName.equals("A")){ 16 Element pt=find_nearest_parent(node) 17 IsNoisyNode (pt) 18 } 19 else HyperlinkGroup (tempnode) 20 } 21 } //end for 22 if (all_children_isgarbage(node)) set_noisy_tag(node); 23 } 24 END //end Algorithm: LinkGroup 25 Algorithm: IsNoisyNode(Element node) 26 BEGIN 27 List childlist=node.getChildren("A"); 27 if (childlist.size()==node.getChildren.size()) { //flat hyperlink group 28 node.setAttribute("noisy","true"); 29 return; 30 } 31 else{ 32 for(int i=0;i<childlist.size();i++){ 33 Element t=(Element) childlist.get(i); 34 Attribute at=t.getAttribute("noisy"); 35 if (at!=null) continue; 36 if (t.getName().equals("A")) continue; 37 if (IsNoisyNode ) continue 38 else break 39 }//end for 40 if(i>=childlist.size()) { //table hyperlink group 41 node.setAttribute("noisy","true"); 42 return; 43 } 44 else return; //recursive hyperlink group 45 } Figure 3: The algorithm for identifying hyperlink groups 4. Wrapper Induction In literature, there are several unsupervised algorithms in wrapper induction. As the most representative one, the ACME kernel algorithm proposed in the RoadRunner (Crescenzi et al 2001) depends on tag and string mismatches via pair-wise webpage comparison. If a tag mismatch occurs, it tries to find out iterative items by backtracking and, when fails, processes it as an optional item. If a string mismatch occurs, it detects the mismatched string as the data and generalizes it using a common macro #pcdata. Although these two kinds of mismatches exist widely, the kind of tag mismatch may not apply to those webpages adopting the paging technique, which dominates data-intensive websites nowadays. Take the webpage segment in Figure 1 as an example. When a user clicks the “next page” hyperlink, the server will pick up next 20 records from the database, and fill a 163 An unsupervised framework for robust web-based Information Extraction new webpage. In this case, there normally do not exist tag mismatches among different webpages/webpage segments. That is, those webpages or segments may only have string mismatches. This makes this algorithm fail to generalize on iterative items due to the lack of tag mismatches. Since then, some researchers have extended the ACME algorithm from the character flow to the tree structure (Wang et al 2004; Li et al 2004). However, none of them tried to deal with above problem in the ACME algorithm. In order to solve above problem, this paper proposes a novel approach to first identify “fit fan-out” nodes as possible data-intensive segments from a webpage, and then adopts a tree similarity measure to find out iterative and optional items. This is based on an observation that iterative items and possible optional items inside them can normally be detected from a webpage or segment itself and do not depend on tag mismatches between two webpages or segments. 4.1 Extracting Data-intensive Segments by Identifying “Fit Fan-out” Nodes In this paper, data-intensive segments are found through “fit fan-out” nodes by automatically limiting their structures and offspring numbers as follow: Step 1: Create a structure tree for each webpage and traverse the tree with the depth-first search algorithm to find out the offspring number for each node. For example, in the Figure 2, the root node HTML has 24 offspring. This processing can be done in the noise filtering step. Therefore, no additional process is needed. Step 2: Find out nodes with a number of children and similar children structures. First, the algorithm searches such nodes from the root node with the number of children not less than a threshold. Then, for each node, e.g. X, it calculates the standard deviation (sd) among X’s children using Formula (1) where n is the total child number of node X, Xi is the offspring number of the i-th child of node X, x means the average offspring number of the n children of node X. n sd ( (xi x) )/(n 1) i 1 2 (1) Sd measures the difference over the distributions of their offspring numbers among X’ children. The less standard deviation for X, the more similarity among the X’ children and the more probable X covers a data-intensive segment. If the standard deviation of a node X is less than a threshold, X is selected as a “fit fan-out” node. In this paper, the threshold is set as 3. Step 3: Filter out those “fit fan-out” nodes which are covered by other “fit fan-out” nodes, and return all the remaining “fit fan-out” nodes. That is, our algorithm can effectively extract multiple data-intensive segments. Here, each sub-tree covered by a “fit fan-out” node represents a data-intensive segment. For example, if all the three sub-trees A1, A2 and A3 in Figure 2 survive after noisy filtering and selected as “fit fan-out” nodes in Step 2, all the three sub-trees will be returned as data-intensive segments. However, since A1 and A2 are noisy hyperlink groups and will be filtered by the noisy filtering algorithm as described in Section 3, only A3 in Figure 2 will be returned as a data-intensive segment. In literature, Zhou et al (2004) needed users’ interaction to find data-intensive segments while our algorithm does so in a completely automatic way. Compared with the one in 164 Qiaoming Zhu, Zhengxian Gong, Peifeng Li, Guodong Zhou Embley et al (1999), which identified “max fan-out” nodes to locate data-intensive segments, our algorithm identifies “fit fan-out” nodes to locate them via the standard deviation measure, which makes our algorithm more robust to abnormal phenomena. Moreover, our algorithm also applies to a webpage which has multiple data-intensive segments. In this paper, in order to filter out unlikely noisy segments, we have set the threshold for the child number in Step 2 to 10. 4.2 Wrapper Induction for Data-Intensive Segments In this paper, one wrapper is induced for each data-intensive segment by identifying and generalizing iterative and optional items. The key here is how to model different exposures of a data-intensive segment and induce shared structures among them. For example, in Figure 4(a), there are two iterative items (tr, tr) while, in Figure 4(b), there seems to have two iterative items (p, x). However, p(2) in Figure 4(b) does not follow by a x node. This phenomenon is very common in webpages. Assume the data-intensive segment for a “fit fan-out” node C has n children C1 C2 …Ci…Cn with C1 the first child and Cn the last child. Firstly, for each child node, the algorithm finds another node Cx in its right side whose tree similarity with the given node is larger than a given threshold. If not found, no iterative item exists. In this paper, the similarity threshold is set to 80% to gain balance in number between iterative items and optional items. In theory, any of tree kernels can be used to measure the similarity between the trees. In this paper, the tree similarity is based on the Simple_Tree_Matching algorithm described in Wang (1991), which returns the max number M of the similar nodes between two trees. It’s clear that, the bigger the M is, the more similar nodes the two trees have. Obviously, the Simple_Tree_Matching algorithm is biased towards bigger trees. In order to resolve this problem, we normalize M by the offspring number of the smaller node. Secondly, assume C1 and Cx be two iterative items, the algorithm tries to find bigger iterative items recursively. For example, in Figure 4(a), assume that tr(1) is similar with tr(2). However, we are not sure that the iterative item is tr since (tr(1),tr(2)) may be similar with (tr(3), tr(4)). In this case, the iterative item should be two trs, which can be expressed as (tr, tr)+. This can be done by checking whether the space between C1 and Cx is bigger than 2. If not, then the iterative item is C1; if yes, the algorithm can first build a visual tree with V1 from C1 to Cx-1 and V2 from Cx to Cx+(x-1) and then compares the similarity between V1 and V2. If they are similar, the iterative item should be (C1, C2, …, Cx-1); if not, this algorithm deletes the rightmost child from V1 and V2, and compares them again. For example, since, in Figure 4(b), p(1) and p(3) are similar, the algorithm compares the visual trees of (p(1), x(2)) with that of (p(2),p(3)). Since they are not similar, the algorithm narrows down the scope and identifies node p as an iterative item and node x as an optional item. In this way, Figure 4(b) can be represented as (p(x?))+. When C1 and Cx be two iterative items Thirdly, the algorithm finds out more iterative and optional items among the children of the iterative items. On the one hand, iterative items can occur recursively. On the other hand, just as stated above, the threshold of the tree similarity is set to 80% in this paper. Therefore, it is possible that there exist some mismatches among the children of the iterative items. Finally, the algorithm compares the iterative items, locate the positions where the contained data change, and mark those positions with #pcdata. Moreover, related information of the data-intensive segment for a “fit fan-out” node (such as node name, An unsupervised framework for robust web-based Information Extraction 165 position and spring number) and its included iterative/optional items are kept as a reference to the induced wrapper. block tr(1) tr(2) (a) block tr(3) tr(4) p(1) x(1) p(2) p(3) x(2) (b) Figure 4 Data-intensive segments: two examples 5. Experimentation We have evaluated our system on several data-intensive websites, such as sina.com.cn and ebay.com.cn. Table 2 shows some statistics about these websites. Due to the nature of these websites, we simply use the two ad-hoc rules, as described in Section 2, to retrieve similar webpages. It shows our simple retrieval algorithm can succeed to extract many similar webpages from these data-intensive websites. It also shows that such similar webpages are quite complex (normally nested) and have many iterative items and some possible optional items. Table 3 shows the overall performance and its comparison with the state-of-the-art system, RoadRunner, as described in Crescenzi (2001). For comparison, we use the same sets of similar webpages as shown in Table 2 with the same noise filtering algorithm in Section 3 in wrapper induction for both systems. It shows our system can successfully extract information from data-intensive websites with the record list structure and induced wrappers are normally much generalized. However, it fails to do so from those without the record list structure, such as from the similar news webpages in sina.com.cn. This indicates that our system is very effective and robust to those data-intensive webpages with the paging technique. Comparably, although RoadRunner can successfully extract information from data-intensive websites with or without the record list structure, it fails to generalize on the iterative items for those with the record list structure while it can only generalizes partially for those without the record list structure. Detailed evaluation shows that, compared with RoadRunner, our system can reduce the size of extracted wrappers by about 40% on average, largely due to better generalization of iterative items and optional items in our system. This makes RoadRunner lack the generalization ability and vulnerable to small changes in the webpage structure when decoding. For example, it fails when the number of iterative items in a webpage changes. Table 3 also shows that our system is much faster than RoadRunner, especially when there exist a large number of similar webpages. The reason is that RoadRunner depends on pair-wise matching of similar webpages while our system works on similar webpages (more preciously, data-intensive segments) one by one. 166 Qiaoming Zhu, Zhengxian Gong, Peifeng Li, Guodong Zhou #r: the record number in the webpages; #sim: the number of similar webpages; #pcd: the attribute number per record; ?nest: whether the record is nested; #opt: the number of optional items in each webpage Test web sites Site Description Sina.com.cn News Sina.com.cn House taobao.com Computer taobao.com Bicycle ebay.com.cn Child suit ebay.com.cn Cell phone dangdang.com Novel #r 20 40 40 50 45 40 Features of web page #sim #pcd ?nest 10 147 6 Y 2065 6 Y 75 6 Y 239 4 Y 9 4 Y 8 4 Y #opt 0 0 0 2 1 0 Table 2 Statistics about the test websites ?DE: whether it can extract data from the webpages; ?GW: whether it can induce generalized wrappers on iterative/optional items ; T: processing time, excluding noise filtering, in mini-seconds; P: partial Site names and contents Site Description Sina.com.cn News sina.com.cn House taobao.com Computer taobao.com Bicycle ebay.com.cn Child suit ebay.com.cn Cell phone dangdang.com Novel ?DE N Y Y Y Y Y Y Our system ?GW T(ms) N Y 343 Y 391 Y 392 Y 325 Y 394 Y 298 ?DE Y Y Y Y Y Y Y RoadRunner ?GW T(ms) P 1156 N 812 N 968 N 970 N 3656 N 3667 N 2281 Table 3 Comparison of our system vs. RoadRunner Table 4 shows the effect of noise filtering on webpage size reduction. It shows that noise filtering almost halves the webpage sizes on average. This indicates that our noise filtering algorithm is very effective. It also shows that filtering out layout noisy information reduce the webpage sizes by about 30% on average while filtering out the remaining assistant noisy information for browsing, such as navigations and friendly hyperlinks, further reduce the webpage sizes by about 35% on average. Table 5 shows the effect of noisy filtering on the wrapper induction time on both our system and RoadRunner. It shows that our noise filtering algorithm is very efficient and its processing time can be ignored compared with that used in wrapper induction. It also shows that it can largely reduce the wrapper induction time for our system by about 30% while its effect for RoadRunner is significant (~40) when a large number of similar webpages are available. Site name Description sina.com.cn sina.com.cn taobao.com taobao.com ebay.com.cn ebay.com.cn dangdang.com News House Computer Bicycle Child suit Cellphone Original size(kb) 81 45 98 101 72 78 136 After filtering layout noisy info(kb) 51 38 70 72 40 47 121 After filtering layout and assistant noisy info(kb) 36 22 44 52 30 31 59 Novel Table 4: Effectiveness of noise filtering on size reduction An unsupervised framework for robust web-based Information Extraction 167 t_NF: the time for noise filtering. Site name Descr. t_NF (ms) sina.com.cn sina.com.cn Taobao.com Taobao.com ebay.com.cn ebay.com.cn dangdang.com News House Computer Bicycle Child suit Cellphone 16 31 62 62 47 48 62 Our system (ms) w/ NF w/o NF Less - - - 343 483 140 391 601 210 392 602 210 325 435 110 394 510 116 298 496 198 RoadRunner (ms) w/ NF w/o NF Less 1156 2672 1516 812 1500 688 968 1625 657 970 1635 665 3656 4250 594 3667 4260 593 2281 3547 1266 Novel Table 5: Effectiveness of noise filtering on time reduction in wrapper induction 6. Conclusion Web-based information extraction on data-intensive websites has become a hot research topic during the last decade. This paper proposes an unsupervised framework for robust web-based information extraction. Evaluation shows that our system works well on dataintensive web sites with the popular paging technique and can achieve better robustness with wrappers more generalized than other popular web-based IE systems. Due to time and space limitation, this paper mainly evaluates our system on those dataintensive websites using the paging technique. In the future work, we will extend to other types of data-intensive websites and explore more effective and efficient algorithms, such as tree kernels, to mine similar webpages with both wider coverage and better representation ability. Moreover, this paper induces a wrapper from each of extracted dataintensive segments. It will be possible to cluster these automatically induced wrappers and further generalize them for each cluster to achieve better generalization. Finally, it will be of great impact to further extend our system to deal with non-data-intensive websites and achieve a universal and robust web-based information extraction system. Acknowledgement The authors would like to thank the anonymous reviewers for their comments on this paper. We would also like to thank Li Junhui for providing us the code in improving the system. This research was supported by the High Technology Plan of Jiangsu Province, China under Grant No.BG2005020. References Califf M.E. & Mooney R.J, 1997, Relational Learning of Pattern-Match Rules for Information Extraction, CoNLL’1997, pp. 9-15. Crescenzi V. & Mecca G. (1998), Grammars have exceptions, Information Systems, Vol.23, no.8, pp. 539-565. Crescenzi V. & Mecca G, 2001, RoadRunner:Towards Automatic Data Extraction from large Web Sites[C], VLDB’2001. Embley D.W., Jiang Y.S. & NgY.K, 1999, Record-Boundary Discovery in Web Documents, SIGMOD'1999. 46 END //end Algorithm: IsNoisyNode 168 Qiaoming Zhu, Zhengxian Gong, Peifeng Li, Guodong Zhou Kushmerick N, 1997, Wrapper Induction for Information Extraction, Ph.D. Thesis, Univ. of Washington, Tech Report UW-CSE-97-11-04. Li W.Q. & Zhang Z.N, 2004, An improved algorithm for automatic webpage wrapper generation, Computer Engineering and Application (Chinese), vol. 40, no. 22, pp. 113115. Muslea I., Minton S. & Knoblock C.A, 2001, Hierarchical Wrapper Induction for Semistructured Information Sources, Autonomous Agents and Multi-Agent Systems, vol. 4, no. (1/2), pp. 93-114. Reis D.C., Golgher P.B. & Silva A.S, 2004, Automatic web news extraction using tree edit distance, WWW’2004(poster), pp. 504-505. Sahuguet A. & Azavant F, 1999, Web ecology: Recycling HTML pages as XML documents using W4F, WebDB’1999. Wang J.Y, 2004, Information extraction and integration for Web databases, http://repository.ust.hk/dspace/handle/1783.1/2058, Ph.D. Thesis, HKUST. Wang R., Song H.T. & Lu Y.C, 2004, An automatic extraction system from the Web, Computer Engineering and Application (Chinese), vol. 40, no. 19, pp. 135-138. Yang W, 1991, Identifying syntactic differences between two programs, Software Practice and Experience, Vol. 21, no. 7, pp. 739-755. Yi L., Liu B. & Li X.L, 2003,. Eliminating Noisy Information in Web Pages for Data Mining, SIGKDD ’2003. Zhou J., Zhu M. & Wang S, 2004, XML-based automatic web information extraction, Computer Application (Chinese), vol. 24, no. s1, pp. 225-227.
© Copyright 2026 Paperzz