Word File

Journal of Chinese Language and Computing 16 (3): 157-168
157
An unsupervised framework for robust web-based
Information Extraction
Qiaoming Zhu, Zhengxian Gong, Peifeng Li, Guodong Zhou
School of Computer Science & Technology
Soochow University
215006, Suzhou ,China
[email protected], [email protected], [email protected], [email protected]
_________________________________________________________________________
Abstract.
Web-based information extraction is challenging and has become a hot research topic
during the last decade. This paper proposes an unsupervised framework for robust webbased information extraction to mine useful information from given and related webpages.
For the given webpages, a webpage mining algorithm is first presented to retrieve similar
webpages from the Web. Then, a noise filtering algorithm is presented to filter out noisy
information in each webpage. Finally, a novel wrapper induction algorithm is proposed to
extract rules from the webpages by first identifying data-intensive segments and then
inducing a wrapper from each of them through identifying iterative and optional items. This
is based on an observation that iterative items and possible optional items inside them can
normally be detected from a webpage or segment itself and do not depend on tag
mismatches between two webpages or segments. Evaluation shows that our system works
well on data-intensive websites with the popular paging technique. It also shows that our
system is more robust and generalized than other popular web-based IE systems on dateintensive web sites.
Keywords
Web pages, RoadRunner, similar pages, clean web noise, fit fan-out node, extraction rule
_______________________________________________________________________
1. Introduction
With the dramatic increase in the amount of textual information available in the Web, there
has been growing interest in web-based information extraction (IE). In some sense, the Web
can be regarded as the largest “knowledge base” ever developed and made available to the
public. During the last decade, web-based IE has become a hot research topic. However,
web-based IE is challenging due to the complexity and diversity of the huge Web.
Web-based IE is usually performed by wrappers (Kushmerick 1997; Crescenzi et al 1998;
Muslea et al 2001; Crescenzi et al 2001). When decoding, wrappers are applied to parse the
given webpages and extract useful information, e.g. the list of job records (with job title,
company name, location, salary and date) as shown in Figure 1. In literature, various
categories of approaches have been proposed in generating wrappers: manually written,
supervised wrapper induction learning and unsupervised wrapper induction learning.
158
Qiaoming Zhu, Zhengxian Gong, Peifeng Li, Guodong Zhou
Early approaches are based on human expertise (Crescenzi et al 1998; Sahugurt et al
1999). A key problem with manually coded wrappers is that writing them is usually a
difficult and labor-intensive task, and that they tend to be brittle and difficult to maintain
manually.
Figure 1: An example: a list of job records in a hyperlink group
Supervised wrapper induction learning approaches depend on the availability of
annotated training examples. They have the potential to automate the development process
and achieve high performance in a supervised way. Inducing wrappers from a set of
webpages corresponds to look at the annotated examples and try to generalize them for the
web code. The Rapier system described in Califf et al (1997) used relational learning to
construct unbounded pattern-match rules for information extraction. The Stalker system
described in Muslea et al (2001) proposed a hierarchical wrapper induction approach by
decomposing the hard problem of extracting data from a complex webpage into a series of
simpler extraction tasks. Both of the systems generate high-accuracy extraction rules based
on annotated training examples. However, annotation of training examples is also much
labor-intensive. Therefore, annotating training examples becomes the major bottleneck in
supervised learning approaches.
The current trend is to apply unsupervised wrapper induction learning approaches, which
automate the wrapper induction process to a large extent. That is, they do not rely on
annotated training examples and do not require any human interaction during the wrapper
induction process. As a representative, RoadRunner as described in Crescenzi et al (2001)
relied on a set of similar webpages to extract rules automatically. In order to differentiate
meaningful information from meaningless information, it worked with two similar
webpages at a time and tried to generalize wrappers by finding the commons and the
differences between them.
However, there exist some disadvantages in current unsupervised wrapper induction
learning approaches, such as the one in RoadRunner. Firstly, their success normally
depends on the availability of structure-similar webpages and they only work well on
webpages with simple structures. Secondly, they often lack effective and efficient noise
filtering mechanisms. Finally, they always lack robustness and the generated wrappers are
vulnerable to small changes in the webpage structure. As a representative, RoadRunner was
An unsupervised framework for robust web-based Information Extraction
159
based on both tag and string mismatches to find iterative and optional items by
backtracking. Although these two mismatches occur widely, the occurrence of tag
mismatches in popular data-intensive websites is quite limited. This largely affects the
generalization power in its wrapper induction, which makes RoadRunner lack of robustness
and vulnerable to small changes in the webpage structure.
In order to solve above problems, this paper proposes an unsupervised framework for
robust web-based information extraction. For given webpages, a webpage mining algorithm
is first presented to automatically retrieve similar webpages from the Web. Then, a noise
filtering algorithm is presented to filter out noisy information in each webpage. Finally, a
novel wrapper induction algorithm is proposed to extract rules from the webpages by first
identifying data-intensive segments and then inducing a wrapper from each of them through
identifying iterative and optional items.
The rest of this paper is organized as follows. Section 2 presents the similar webpage
mining algorithm while the noisy filtering algorithm is described in Section 3. Section 4
proposes the novel wrapper induction algorithm. Finally, we evaluate our unsupervised
framework on data-intensive websites in Section 4 and conclude this paper in Section 5.
2. Similar Webpage Mining
To automatically induce well-behaved wrappers in an unsupervised way, it is critical to
have a set of “good” similar webpages. Here, we mean “good” in two aspects: 1) The
webpages should be representative of the task to achieve acceptable recall when using the
induced wrappers in decoding; 2) The webpages should be similar enough in both content
and structure. Normally, a balance between them should be achieved. However, the stateof-the-art systems, such as RoadRunner in Crescenzi et al (2001), avoid this critical issue
by assuming that such “good” similar webpages are already provided. This makes such
systems difficult to be deployed to end users since it is very difficult and impractical for
them to provide such “good” webpages, especially being representative which means a
reasonable number of them with a fairly distribution. It may be more practical for them to
provide one or a very few representative webpages for a task, and the computers do all the
remaining for them. Moreover, most of the state-of-the-art systems only work well on a set
of very similar webpages. This makes the induced wrappers fail to achieve good coverage.
For example, RoadRunner made the assumption that given webpages should share the same
webpage structure. Therefore, it is critical to automatically obtain enough “good” similar
webpages for the success of an automatic wrapper induction system.
Given above consideration, the first step of our system is to retrieve enough “good”
similar webpages for the given webpages. The basic idea is to first retrieve a large amount
of webpages from the Web. e.g. via a public search engine (such as Google), using the
terms occurred in the given webpages to form a query and then filtering out “bad”
webpages. For filtering, we can first cluster the retrieved webpages according to their
content and structure similarity, then eliminate small clusters and finally pick up a few
representatives from each of the remaining major clusters. In this way, a set of “good”
similar webpages can be achieved with content and structure similarity. Generally, the
similarity between webpages can be measured through their webpage structures and
contents. Reis et al (2004) adopted Tree Edit Distance to cluster pages. This is based on that
each grammatical webpage would form a tree and their similarity can be calculated between
their counterpart trees.
160
Qiaoming Zhu, Zhengxian Gong, Peifeng Li, Guodong Zhou
Especially for data-intensive web sites, “good” similar webpages can also be effectively
and efficiently retrieved from:
 Related webpages in the paging list of a given webpage when the paging technique
is applied. Normally, such pages are so similar with the given webpage that high
quality can be assumed.
 Related hyperlinks in the same directory of a given webpage.
3. Noise Filtering
Normally, webpages are full of “noisy” (data-unrelated) information that deters a webbased information extraction system from automatically inducing wrappers. Therefore,
noise filtering should be done first before wrapper induction. Otherwise, the robustness of a
web-based information extraction system will be largely reduced. For example,
RoadRunner extracted rules via strict label and string mismatches in the given webpages.
This makes it error prone to noisy information, which strongly affects its robustness.
Moreover, the size of webpages can be largely reduced via proper noise filtering. Basically,
there are two categories of noisy information:
 Layout noisy information. For a webpage in HTML, there exist many tags, such as
SCRIPT and STYLE, purely for maintaining the webpage style. Generally, such
tags occupy much in the webpage size and have no relationship with the content.
Moreover, such noisy information can be easily detected and filtered out.
 Assistant noisy information for browsing, such as navigations and friendly
hyperlinks. Normally, this category of noisy information is much complicated and
sometimes difficult to detect.
In literature, different ways have been applied to deal with noisy information. For
example, Yi et al (2003) discovered that there is only small change of noisy information
between similar pages and tried to filter out noise by constructing a style-tree based on
information entropy. However, their algorithm in constructing the style-tree is too complex
to be useful in practice. Instead, Wang (2004) ignored possible change of noisy information
and used an easy way to eliminate noise by deleting same nodes in similar pages at the
expense of precision.
To filter out noisy information in webpages effectively and efficiently, this paper
proposes to a two-step strategy by first eliminating easy layout noisy information and then
the remaining complicated assistant noisy information for browsing. In this paper, the
layout noisy information can be easily filtered out by maintaining a list of possible layout
noisy tags, such as the one in Table 1. Due to the limited tag set in a web marking language,
e.g. HTML, this list can be easily maintained to the users’ need.
The major problem in noisy filtering lies in the remaining assistant noisy information for
browsing, such as navigations and friendly hyperlinks. Normally, such noisy information
occurs together and forms a hyperlink group with the same structure. Therefore, filtering
out such noisy information consists in first identifying possible hyperlink groups and then
determining their nature. Due to the complexity and diversity of webpages, identifying
possible hyperlink groups is not easy as it seems.
Figure 2 shows three possible types of hyperlink groups:
 Flat hyperlink group, such as the type A1 with several <a> tags inside a <div>
tag;
161
An unsupervised framework for robust web-based Information Extraction

Table hyperlink group, such as the type A2 with each hyperlink embedded in
another tag, forming a sequential hyperlink group;
Recursive hyperlink group, such as the type A3 which consists of several similar
units with each of them consisting of several tags recursively.

Tag
SCRIPT
STYLE
INPUT
SELECT
TEXTAREA
Description
Page script, such as javascript, vbscript
Page style
Webforms including radiobox, checkbox, textbox, submit button
Webform: dropdown list
Webform: multi-line textbox
Table 1 A list of layout noisy tags
Html 24
div 3
<a> 0
<a> 0
A1
<a>0
tr 1
tr 1
tr 1
<a>0
<a>0
<a> 0
A2
A3
div 12
table 6
<a>0
sdiv3
<I>0
sdiv 3
<B>0
<a>0
sdiv 3
<a>0
<I>0
<I>0
<B>0
<B>0
Figure 2: Possible types of hyperlink groups
Among them, flat and table hyperlink groups do not contain data-related information and
can be filtered out directly while recursive hyperlink groups are much more complex. For
example, in Figure 1, each item in the first column (Job Title) is a hyperlink with more
details about the corresponding job and all of them compose a recursive hyperlink group. In
this case, the data in the first column should be extracted instead. The algorithm
HyperlinkGroup(…) in Figure 3 identifies possible hyperlink groups and determines
whether they are noisy or not. In this paper, this is done by traversing webpage nodes with
the depth-first search algorithm to find a hyperlink node (with label <A>). If such node is
found, find its nearest upper node with no less than 2 children (since a hyperlink group
should have at least 2 members) and call the function IsNoisyNode(…) to determine
whether all of its children are hyperlinks or not. If yes, label the nearest upper node as a
noisy node and delete it.
Through noise filtering, the size of a webpage can be largely reduced. This will largely
decrease the burden of the wrapper induction algorithm as well as increasing its robustness.
162
Qiaoming Zhu, Zhengxian Gong, Peifeng Li, Guodong Zhou
01 Algorithm: HyperlinkGroup(Element node) //get node’s all children
02 BEGIN
03
List children=node.getChildren();
04
if (children.size()==0) { //when node is a leaf
05
if (node.getNodeName.equals("A")) { //when the leaf is a hyperlink
06
Element pt=find_nearest_parent(node);
07
IsNoisyNode (pt);
08
}
09
}
10
else { //no-left node with recursion
11
for(int i=0;i<children.size();i++){
12
Element tempnode=children.get(i)
13
if (visited(tempnode)) continue;
14
else{
15
if (node.getNodeName.equals("A")){
16
Element pt=find_nearest_parent(node)
17
IsNoisyNode (pt)
18
}
19
else HyperlinkGroup (tempnode)
20
}
21
} //end for
22
if (all_children_isgarbage(node)) set_noisy_tag(node);
23
}
24 END //end Algorithm: LinkGroup
25 Algorithm: IsNoisyNode(Element node)
26 BEGIN
27
List childlist=node.getChildren("A");
27
if (childlist.size()==node.getChildren.size()) {
//flat hyperlink group
28
node.setAttribute("noisy","true");
29
return;
30
}
31
else{
32
for(int i=0;i<childlist.size();i++){
33
Element t=(Element) childlist.get(i);
34
Attribute at=t.getAttribute("noisy");
35
if (at!=null) continue;
36
if (t.getName().equals("A")) continue;
37
if (IsNoisyNode ) continue
38
else break
39
}//end for
40
if(i>=childlist.size()) { //table hyperlink group
41
node.setAttribute("noisy","true");
42
return;
43
}
44
else return; //recursive hyperlink group
45
}
Figure 3:
The algorithm for identifying hyperlink groups
4. Wrapper Induction
In literature, there are several unsupervised algorithms in wrapper induction. As the most
representative one, the ACME kernel algorithm proposed in the RoadRunner (Crescenzi et
al 2001) depends on tag and string mismatches via pair-wise webpage comparison. If a tag
mismatch occurs, it tries to find out iterative items by backtracking and, when fails,
processes it as an optional item. If a string mismatch occurs, it detects the mismatched
string as the data and generalizes it using a common macro #pcdata. Although these two
kinds of mismatches exist widely, the kind of tag mismatch may not apply to those
webpages adopting the paging technique, which dominates data-intensive websites
nowadays. Take the webpage segment in Figure 1 as an example. When a user clicks the
“next page” hyperlink, the server will pick up next 20 records from the database, and fill a
163
An unsupervised framework for robust web-based Information Extraction
new webpage. In this case, there normally do not exist tag mismatches among different
webpages/webpage segments. That is, those webpages or segments may only have string
mismatches. This makes this algorithm fail to generalize on iterative items due to the lack
of tag mismatches.
Since then, some researchers have extended the ACME algorithm from the character flow
to the tree structure (Wang et al 2004; Li et al 2004). However, none of them tried to deal
with above problem in the ACME algorithm.
In order to solve above problem, this paper proposes a novel approach to first identify “fit
fan-out” nodes as possible data-intensive segments from a webpage, and then adopts a tree
similarity measure to find out iterative and optional items. This is based on an observation
that iterative items and possible optional items inside them can normally be detected from a
webpage or segment itself and do not depend on tag mismatches between two webpages or
segments.
4.1 Extracting Data-intensive Segments by Identifying “Fit Fan-out” Nodes
In this paper, data-intensive segments are found through “fit fan-out” nodes by
automatically limiting their structures and offspring numbers as follow:
Step 1: Create a structure tree for each webpage and traverse the tree with the depth-first
search algorithm to find out the offspring number for each node. For example, in the Figure
2, the root node HTML has 24 offspring. This processing can be done in the noise filtering
step. Therefore, no additional process is needed.
Step 2: Find out nodes with a number of children and similar children structures. First,
the algorithm searches such nodes from the root node with the number of children not less
than a threshold. Then, for each node, e.g. X, it calculates the standard deviation (sd)
among X’s children using Formula (1) where n is the total child number of node X, Xi is the
offspring number of the i-th child of node X, x means the average offspring number of
the n children of node X.
n
sd  (  (xi x) )/(n  1)
i 1
2
(1)
Sd measures the difference over the distributions of their offspring numbers among X’
children. The less standard deviation for X, the more similarity among the X’ children and
the more probable X covers a data-intensive segment. If the standard deviation of a node X
is less than a threshold, X is selected as a “fit fan-out” node. In this paper, the threshold is
set as 3.
Step 3: Filter out those “fit fan-out” nodes which are covered by other “fit fan-out” nodes,
and return all the remaining “fit fan-out” nodes. That is, our algorithm can effectively
extract multiple data-intensive segments. Here, each sub-tree covered by a “fit fan-out”
node represents a data-intensive segment. For example, if all the three sub-trees A1, A2 and
A3 in Figure 2 survive after noisy filtering and selected as “fit fan-out” nodes in Step 2, all
the three sub-trees will be returned as data-intensive segments. However, since A1 and A2
are noisy hyperlink groups and will be filtered by the noisy filtering algorithm as described
in Section 3, only A3 in Figure 2 will be returned as a data-intensive segment.
In literature, Zhou et al (2004) needed users’ interaction to find data-intensive segments
while our algorithm does so in a completely automatic way. Compared with the one in
164
Qiaoming Zhu, Zhengxian Gong, Peifeng Li, Guodong Zhou
Embley et al (1999), which identified “max fan-out” nodes to locate data-intensive
segments, our algorithm identifies “fit fan-out” nodes to locate them via the standard
deviation measure, which makes our algorithm more robust to abnormal phenomena.
Moreover, our algorithm also applies to a webpage which has multiple data-intensive
segments. In this paper, in order to filter out unlikely noisy segments, we have set the
threshold for the child number in Step 2 to 10.
4.2 Wrapper Induction for Data-Intensive Segments
In this paper, one wrapper is induced for each data-intensive segment by identifying and
generalizing iterative and optional items. The key here is how to model different exposures
of a data-intensive segment and induce shared structures among them. For example, in
Figure 4(a), there are two iterative items (tr, tr) while, in Figure 4(b), there seems to have
two iterative items (p, x). However, p(2) in Figure 4(b) does not follow by a x node. This
phenomenon is very common in webpages. Assume the data-intensive segment for a “fit
fan-out” node C has n children C1 C2 …Ci…Cn with C1 the first child and Cn the last child.
Firstly, for each child node, the algorithm finds another node Cx in its right side whose
tree similarity with the given node is larger than a given threshold. If not found, no iterative
item exists. In this paper, the similarity threshold is set to 80% to gain balance in number
between iterative items and optional items. In theory, any of tree kernels can be used to
measure the similarity between the trees. In this paper, the tree similarity is based on the
Simple_Tree_Matching algorithm described in Wang (1991), which returns the max
number M of the similar nodes between two trees. It’s clear that, the bigger the M is, the
more similar nodes the two trees have. Obviously, the Simple_Tree_Matching algorithm is
biased towards bigger trees. In order to resolve this problem, we normalize M by the
offspring number of the smaller node.
Secondly, assume C1 and Cx be two iterative items, the algorithm tries to find bigger
iterative items recursively. For example, in Figure 4(a), assume that tr(1) is similar with
tr(2). However, we are not sure that the iterative item is tr since (tr(1),tr(2)) may be similar
with (tr(3), tr(4)). In this case, the iterative item should be two trs, which can be expressed
as (tr, tr)+. This can be done by checking whether the space between C1 and Cx is bigger
than 2. If not, then the iterative item is C1; if yes, the algorithm can first build a visual tree
with V1 from C1 to Cx-1 and V2 from Cx to Cx+(x-1) and then compares the similarity
between V1 and V2. If they are similar, the iterative item should be (C1, C2, …, Cx-1); if
not, this algorithm deletes the rightmost child from V1 and V2, and compares them again.
For example, since, in Figure 4(b), p(1) and p(3) are similar, the algorithm compares the
visual trees of (p(1), x(2)) with that of (p(2),p(3)). Since they are not similar, the
algorithm narrows down the scope and identifies node p as an iterative item and node x as
an optional item. In this way, Figure 4(b) can be represented as (p(x?))+. When C1 and Cx
be two iterative items
Thirdly, the algorithm finds out more iterative and optional items among the children of
the iterative items. On the one hand, iterative items can occur recursively. On the other
hand, just as stated above, the threshold of the tree similarity is set to 80% in this paper.
Therefore, it is possible that there exist some mismatches among the children of the
iterative items.
Finally, the algorithm compares the iterative items, locate the positions where the
contained data change, and mark those positions with #pcdata. Moreover, related
information of the data-intensive segment for a “fit fan-out” node (such as node name,
An unsupervised framework for robust web-based Information Extraction
165
position and spring number) and its included iterative/optional items are kept as a reference
to the induced wrapper.
block
tr(1) tr(2)
(a)
block
tr(3)
tr(4)
p(1) x(1) p(2)
p(3)
x(2)
(b)
Figure 4 Data-intensive segments: two examples
5. Experimentation
We have evaluated our system on several data-intensive websites, such as sina.com.cn and
ebay.com.cn. Table 2 shows some statistics about these websites. Due to the nature of these
websites, we simply use the two ad-hoc rules, as described in Section 2, to retrieve similar
webpages. It shows our simple retrieval algorithm can succeed to extract many similar
webpages from these data-intensive websites. It also shows that such similar webpages are
quite complex (normally nested) and have many iterative items and some possible optional
items.
Table 3 shows the overall performance and its comparison with the state-of-the-art
system, RoadRunner, as described in Crescenzi (2001). For comparison, we use the same
sets of similar webpages as shown in Table 2 with the same noise filtering algorithm in
Section 3 in wrapper induction for both systems. It shows our system can successfully
extract information from data-intensive websites with the record list structure and induced
wrappers are normally much generalized. However, it fails to do so from those without the
record list structure, such as from the similar news webpages in sina.com.cn. This indicates
that our system is very effective and robust to those data-intensive webpages with the
paging technique. Comparably, although RoadRunner can successfully extract information
from data-intensive websites with or without the record list structure, it fails to generalize
on the iterative items for those with the record list structure while it can only generalizes
partially for those without the record list structure. Detailed evaluation shows that,
compared with RoadRunner, our system can reduce the size of extracted wrappers by about
40% on average, largely due to better generalization of iterative items and optional items in
our system. This makes RoadRunner lack the generalization ability and vulnerable to small
changes in the webpage structure when decoding. For example, it fails when the number of
iterative items in a webpage changes. Table 3 also shows that our system is much faster
than RoadRunner, especially when there exist a large number of similar webpages. The
reason is that RoadRunner depends on pair-wise matching of similar webpages while our
system works on similar webpages (more preciously, data-intensive segments) one by one.
166
Qiaoming Zhu, Zhengxian Gong, Peifeng Li, Guodong Zhou
#r: the record number in the webpages; #sim: the number of similar webpages;
#pcd: the attribute number per record; ?nest: whether the record is nested; #opt: the
number of optional items in each webpage
Test web sites
Site
Description
Sina.com.cn
News
Sina.com.cn
House
taobao.com
Computer
taobao.com
Bicycle
ebay.com.cn
Child suit
ebay.com.cn
Cell phone
dangdang.com
Novel
#r
20
40
40
50
45
40
Features of web page
#sim
#pcd
?nest
10
147
6
Y
2065
6
Y
75
6
Y
239
4
Y
9
4
Y
8
4
Y
#opt
0
0
0
2
1
0
Table 2 Statistics about the test websites
?DE: whether it can extract data from the webpages; ?GW: whether
it can induce generalized wrappers on iterative/optional items ; T:
processing time, excluding noise filtering, in mini-seconds; P: partial
Site names and contents
Site
Description
Sina.com.cn
News
sina.com.cn
House
taobao.com
Computer
taobao.com
Bicycle
ebay.com.cn
Child suit
ebay.com.cn
Cell phone
dangdang.com
Novel
?DE
N
Y
Y
Y
Y
Y
Y
Our system
?GW T(ms)
N
Y
343
Y
391
Y
392
Y
325
Y
394
Y
298
?DE
Y
Y
Y
Y
Y
Y
Y
RoadRunner
?GW
T(ms)
P
1156
N
812
N
968
N
970
N
3656
N
3667
N
2281
Table 3 Comparison of our system vs. RoadRunner
Table 4 shows the effect of noise filtering on webpage size reduction. It shows that noise
filtering almost halves the webpage sizes on average. This indicates that our noise filtering
algorithm is very effective. It also shows that filtering out layout noisy information
reduce the webpage sizes by about 30% on average while filtering out the remaining
assistant noisy information for browsing, such as navigations and friendly hyperlinks,
further reduce the webpage sizes by about 35% on average.
Table 5 shows the effect of noisy filtering on the wrapper induction time on both our
system and RoadRunner. It shows that our noise filtering algorithm is very efficient and its
processing time can be ignored compared with that used in wrapper induction. It also shows
that it can largely reduce the wrapper induction time for our system by about 30% while its
effect for RoadRunner is significant (~40) when a large number of similar webpages are
available.
Site name
Description
sina.com.cn
sina.com.cn
taobao.com
taobao.com
ebay.com.cn
ebay.com.cn
dangdang.com
News
House
Computer
Bicycle
Child suit
Cellphone
Original
size(kb)
81
45
98
101
72
78
136
After filtering layout noisy info(kb)
51
38
70
72
40
47
121
After filtering layout and
assistant noisy info(kb)
36
22
44
52
30
31
59
Novel
Table 4: Effectiveness of noise filtering on size reduction
An unsupervised framework for robust web-based Information Extraction
167
t_NF: the time for noise filtering.
Site name
Descr.
t_NF
(ms)
sina.com.cn
sina.com.cn
Taobao.com
Taobao.com
ebay.com.cn
ebay.com.cn
dangdang.com
News
House
Computer
Bicycle
Child suit
Cellphone
16
31
62
62
47
48
62
Our system (ms)
w/ NF w/o NF
Less
-
-
-
343
483
140
391
601
210
392
602
210
325
435
110
394
510
116
298
496
198
RoadRunner (ms)
w/ NF
w/o NF
Less
1156
2672
1516
812
1500
688
968
1625
657
970
1635
665
3656
4250
594
3667
4260
593
2281
3547
1266
Novel
Table 5: Effectiveness of noise filtering on time reduction in wrapper induction
6. Conclusion
Web-based information extraction on data-intensive websites has become a hot research
topic during the last decade. This paper proposes an unsupervised framework for robust
web-based information extraction. Evaluation shows that our system works well on dataintensive web sites with the popular paging technique and can achieve better robustness
with wrappers more generalized than other popular web-based IE systems.
Due to time and space limitation, this paper mainly evaluates our system on those dataintensive websites using the paging technique. In the future work, we will extend to other
types of data-intensive websites and explore more effective and efficient algorithms, such
as tree kernels, to mine similar webpages with both wider coverage and better
representation ability. Moreover, this paper induces a wrapper from each of extracted dataintensive segments. It will be possible to cluster these automatically induced wrappers and
further generalize them for each cluster to achieve better generalization. Finally, it will be
of great impact to further extend our system to deal with non-data-intensive websites and
achieve a universal and robust web-based information extraction system.
Acknowledgement
The authors would like to thank the anonymous reviewers for their comments on this paper.
We would also like to thank Li Junhui for providing us the code in improving the system.
This research was supported by the High Technology Plan of Jiangsu Province, China
under Grant No.BG2005020.
References
Califf M.E. & Mooney R.J, 1997, Relational Learning of Pattern-Match Rules for
Information Extraction, CoNLL’1997, pp. 9-15.
Crescenzi V. & Mecca G. (1998), Grammars have exceptions, Information Systems, Vol.23,
no.8, pp. 539-565.
Crescenzi V. & Mecca G, 2001, RoadRunner:Towards Automatic Data Extraction from
large Web Sites[C], VLDB’2001.
Embley D.W., Jiang Y.S. & NgY.K, 1999, Record-Boundary Discovery in Web
Documents, SIGMOD'1999.
46 END
//end Algorithm:
IsNoisyNode
168
Qiaoming Zhu, Zhengxian Gong, Peifeng Li, Guodong Zhou
Kushmerick N, 1997, Wrapper Induction for Information Extraction, Ph.D. Thesis, Univ. of
Washington, Tech Report UW-CSE-97-11-04.
Li W.Q. & Zhang Z.N, 2004, An improved algorithm for automatic webpage wrapper
generation, Computer Engineering and Application (Chinese), vol. 40, no. 22, pp. 113115.
Muslea I., Minton S. & Knoblock C.A, 2001, Hierarchical Wrapper Induction for Semistructured
Information Sources, Autonomous Agents and Multi-Agent Systems, vol. 4, no. (1/2), pp. 93-114.
Reis D.C., Golgher P.B. & Silva A.S, 2004, Automatic web news extraction using tree edit distance,
WWW’2004(poster), pp. 504-505.
Sahuguet A. & Azavant F, 1999, Web ecology: Recycling HTML pages as XML documents using
W4F, WebDB’1999.
Wang J.Y, 2004, Information extraction and integration for Web databases,
http://repository.ust.hk/dspace/handle/1783.1/2058, Ph.D. Thesis, HKUST.
Wang R., Song H.T. & Lu Y.C, 2004, An automatic extraction system from the Web, Computer
Engineering and Application (Chinese), vol. 40, no. 19, pp. 135-138.
Yang W, 1991, Identifying syntactic differences between two programs, Software Practice and
Experience, Vol. 21, no. 7, pp. 739-755.
Yi L., Liu B. & Li X.L, 2003,. Eliminating Noisy Information in Web Pages for Data Mining,
SIGKDD ’2003.
Zhou J., Zhu M. & Wang S, 2004, XML-based automatic web information extraction, Computer
Application (Chinese), vol. 24, no. s1, pp. 225-227.