Prudent Schema Matching For Web Forms

Prudent Schema Matching For Web Forms
Thanh Nguyen
Hoa Nguyen
Juliana Freire
School of Computing
University of Utah
School of Computing
University of Utah
School of Computing
University of Utah
[email protected]
[email protected]
[email protected]
ABSTRACT
There is an increasing number of data sources on the Web
whose contents are hidden and can only be accessed through
form interfaces. Several applications have emerged that aim
to automate and simplify the access to such content, from
hidden-Web crawlers and meta-searchers to Web information integration systems. Since for any given domain there
many different sources, an important requirement for these
applications is the ability to automatically understand the
form interfaces and determine correspondences between elements from different forms. While the problem of form
schema matching has received substantial attention recently,
existing approaches have important limitations. Notably,
they assume that element labels can be reliably extracted
from the forms and normalized—most adopt manually extracted data for experiments, and their effectiveness is reduced in the presence of incorrectly extracted as well as rare
labels. In large collections of forms, however, not only there
can be a substantial number of rare attributes, but also automated approaches to label extraction are required, which
invariably lead to errors and noise in the data. In this paper,
we propose PruSM, an approach for matching form schemas
which prudently determines matches among form elements.
PruSM does not require any manual pre-processing of forms,
and it effectively handles both noisy data and rare labels. It
does so by: carefully selecting matches with the highest confidence; and then using these to resolve uncertain matches
and to identify additional matches that include infrequent
labels. We have carried out an extensive experimental evaluation using over 2,000 forms in multiple domains and the
results show that our approach is highly effective and able
to accurately determine matches in large form collections.
1.
INTRODUCTION
It is estimated there there are millions of databases on
the Web [27]. Because the contents of these databases are
hidden and are only exposed on demand, as users fill out
and submit Web forms, they are out of reach for tradi-
Permission to copy without fee all or part of this material is granted provided
that the copies are not made or distributed for direct commercial advantage,
the VLDB copyright notice and the title of the publication and its date appear,
and notice is given that copying is by permission of the Very Large Data
Base Endowment. To copy otherwise, or to republish, to post on servers
or to redistribute to lists, requires a fee and/or special permission from the
publisher, ACM.
VLDB ‘08, August 24-30, 2008, Auckland, New Zealand
Copyright 2008 VLDB Endowment, ACM 000-0-00000-000-0/00/00.
tional search engines, whose crawlers are are only able to
crawl through explicit HTML links. Several applications
have emerged which attempt to uncover hidden-Web information and make it more easily accessible, including:
metasearchers [16, 17, 39, 40], hidden-Web crawlers [1, 4,
31, 28], and Web information integration systems [10, 21,
38]. These applications face several challenges, from locating relevant forms [6] and determining their domains [8, 7],
to understanding the semantics of the form elements [41]. In
this paper, we focus on the problem of form schema matching, i.e., identifying matches among elements across different forms, a step that is essential to form understanding.
Consider the forms in Figure 1. To build a meta-searcher
for these forms, we need to identify the correspondences
among their elements so that a query posed through the
meta-searcher unified interface can be translated into a series of queries that are fired against these forms [16, 17, 40,
10]. For instance, if a user is interested in cars manufactured by Honda, the value Honda should be used as input
for the elements that correspond to auto make. But finding
these correspondences is challenging due to the great variability in how forms are designed, even within well-defined
domains like auto search. Different element labels can represent the same semantic concept and similar labels can represent different concepts. As illustrated in Figure 1, there are
a several distinct labels used to represent the concept auto
make: make, please select an auto make, manufacturer,
by brand, start here-choose make,car make-model. Also,
there is a wide variation in the frequency of some labels
within a domain (Figure 6): while some are frequent others
are very rare. And sometimes, elements that have no apparent similarity represent the same concept (i.e., they are
synonyms). In forms for searching for books, {First name,
Last name} is often a synonym for Author, and in airfare
search, {Children, Adult, Senior} is often a synonym for
Passenger.
This problem is compounded when one is faced with large
form collections. Before the form elements can be matched,
their labels must be extracted [41, 29]. But although effective and automated approaches have been developed for
label extraction tasks, they are not perfect. For example,
the accuracy of LabelEx [29] for extracting element labels
varies between 86% and 94%. As a result, the matching process must take into account not only heterogeneity in form
design but also noise introduced by incorrectly extracted
labels.
Several approaches have been proposed for form-schema
matching [38, 32, 19, 35, 27, 36, 37]. However, all of these
Figure 1: Matching form schemas in the Auto domain. Different labels are used to represent the same concept.
approaches assume that labels for the form elements are correctly extracted. And some require additional manual preprocessing of forms. For example, He and Chang [19] require
that different variations of a label be grouped together (all
variations of make e.g., select a car make and vehicle
make must be normalized to “make”); and Wu et al. [38]
require that a hierarchical representation of the form be accurately extracted. In the presence of noisy (incorrectly)
extracted labels, the effectiveness of these techniques can
be greatly reduced. Incorrectly extracted labels can lead
to initial incorrect matches which are subsequently propagated to other matches. As we discuss in Section 4, noisy
form schemas can lead up to 30% reduction in the matching
accuracy of previous works. Another limitation of existing approaches is related to rare attributes, i.e., labels (or
groups of labels) that have low frequency in a form collection. Such attributes tend to confuse statistical approaches
which require a sufficient number of samples [19].
Our Approach. In this paper, we propose PruSM (Prudent Schema Matching), a new, scalable approach for schema
maching in large form collections. PruSM is designed to deal
with and reduce the effect of the noise present in such collections. A novel feature of PruSM is how it combines the
various sources of similarity information for form elements.
Each element is associated with a label (i.e., a textual description), a name defined in the HTML standard, the element type (e.g., text box, selection list), a data type (e.g.,
integer, string) and in some cases, a set of domain values.
But not all elements contain all these pieces information,
and as we discussed above, the information can be incorrect. Furthermore, if considered in isolation, an individual
component can be insufficient for matching purposes. Consider the following example.
Example 1. In the Auto domain, using domain values can
result in an incorrect matching between mileage and price,
since they sometimes have a similar value range. On the
other hand, if only label similarity is considered, an incorrect
match can be derived for model and model year. Although
they correspond to different concepts, they share a term that
is important in the collection—the term model occurs as a
label in most forms in this domain.
Besides element similarity, co-occurrence statistics can be
used to identify mappings [19, 35]. For example, by observing that manufacturer and brand co-occur with a similar set
of attributes but rarely co-occur together, it is possible to
infer that they are synonyms. However, when used in isolation, attribute correlation can lead to incorrect matches.
In particular, correlation matching scores can be artificially
high for rare attributes, since rare attributes seldom cooccur with (all) other attributes.
The matcher we propose prudently combines and validates matches across elements using multiple feature spaces,
including label similarity, domain values similarity and at-
Figure 2: Web form matching framework
tribute correlation. This is in contrast to other approaches
that use only a subset of the available information: some just
consider the labels associated with elements while the others concentrate on internal structure, attribute type or the
relationships with other entities [34]. It is worthy of note
that: match quality does not monotonically increase with
the number of information sources—using all the available
information is not always a good solution; and combining
this information in an effective manner is a non-trivial task,
since the importance of the different components can vary
from form to form, and some form elements do not contain all components. As we discuss later in the paper, when
combining the different components prudently, we obtain
reliable and confident matches, including matches for rare
attributes.
Contributions. Our contributions can be summarized as
follows:
• We propose PruSM, a new schema matching approach that
prudently combines different sources of similarity information. PruSM is also prudent in deriving matches: it first
derives matches for frequent attributes and uses these to
both resolve uncertain matches as well as to identify new
matches that inclure rare labels. As a result, PruSM is robust to noise and rare attributes, and it is also able to deal
with attribute fragmentation.
• We present a detailed experimental evaluation, using a
collection with over 2,000 forms, which shows that PruSM
obtains high precision and recall without any manual preprocessing of form data. Furthermore, PruSM has higher
accuracy (between 10% and 57%) than other form-schema
matching approches. These results indicate that PruSM is
scalable and effective for large collections of automatically
gathered forms.
Outline. The rest of this paper is organized as follows.
The problem of Web-form schema matching is defined in
Section 2 and we present the PruSM approach in Section 3.
In Section 4, we discuss our experimental evaluation. Section 5 describes the related work. We conclude in Section 6,
where we outline directions for future work.
2.
MATCHING WEB FORMS
In this paper, we propose a form-schema matching approach that is scalable and effective for large form collections. The specific scenario we consider is illustrated in Figure 2. Forms are automatically gathered by a Web crawler [6]
Figure 3: Web forms components
and stored in a form repository. These forms are then parsed,
their element labels extracted [29], and associated schemas
derived. Finally, matches are derived between elements in
different form schemas.
2.1
Problem Definition
A Web form F contains set of elements E = {e1 , e2 , ..., en }.
Each element ei is represented by a tuple (li , vi ), where li is
the label of ei (i.e., a textual description) and vi is a vector
that contains a list of possible values for ei . An element label l is a bag of terms {t1 , t2 , ..., tm } , where each term ti is
associated a weight wi . For example, element labels of the
form in Figure 3 are Make, Model, Maximum price, Search
within and Your ZIP; domain values for element Model are
{All, MDX, RDX, RL, TL, TSX}. The composite label Your
ZIP consists of two terms, and intuitively, ZIP is the most
important term since it conveys the meaning of the element.
In our model, a higher weight is associated with the term
ZIP than with Your.
Definition 1. The form-schema matching problem
S can be
stated as follows: given a set of Web forms F = n
i=1 fi in
a given domain D, a schema matching process identifies all
the correspondences (matches) among elements across forms
fi ∈ F , and groups them into clusters Ci = {C1 , C2 , ..., Ck },
where each cluster contains only elements that share the
same meaning.
2.2
Computing similarities
In order to compute the similarity between form elements,
we first group elements that have the same label. Let Ai be
an attribute which corresponds to the set of elements e that
have the same label l. Given two attributes (Ai , Aj ), we
quantify the similarity between them using three different
measures: label similarity, domain-value similarity, and correlation.
Label Similarity. Because forms are designed for human
consumption, labels are descriptive and are often the most
important source for identifying similar elements.
We define the label similarity between two attributes (Ai , Aj )
as the cosine distance [2] between the term vectors for their
labels:
lsim(Ai , Aj ) = cos(li , lj )
(1)
where
P
xi yi
cos(x, y) = pP i2 pP 2
x
i i
i yi
(2)
To determine term weight, we can use TF-IDF. TF-IDF
provides a measure of term importance within a collection:
the importance of a term increases proportionally to the
number of times that word appears in a document but it
is offset by the frequency of that word in other documents
in the collection [2]. In addition to term frequency (TF),
we also use the singular token frequency (STF) [29] to capture the importance of terms. The intuition behind STF
comes from the fact that generic terms such as "Please",
"Select", "of", "available" rarely appear alone in a label, and thus are unlikely to represent an important concept—
they do not have a complete meaning by themselves. These
terms usually have high frequency, since they appear in
many composite labels. For instance, "select" appears in
"Select a car make", "Select a State", "Select a model",
etc. We use STF to distinguish between high-frequency
terms that appear alone (which are likely to be important)
and high-frequency terms that always appear with other labels. Term weight wi is computed according to the equations
below:
p
(3)
w(ti ) = T F (ti ) ∗ ST F (ti )
n(ti )
T F (ti ) = P
t
ST F (ti ) =
n(ti appear alone)
n(ti )
(4)
(5)
However, initially, since terms in the labels are not normalized, the above weighting scheme mail fail to accurately
represent term importance. As we discuss in Section 3.3,
after identifying a set of matches, we simplify labels from
long-form into short-form, and this boosts the STF score
and the weight of important terms. This boosting is very
important to obtain additional matches which include rare
labels (see Example 8).
Domain-Value Similarity. We compute the similarity
between the domain values from two elements as the cosine distance between their corresponding of value vectors.
To compute the domain similarity between two attributes,
we first need to aggregate all the domain values for each
attribute. Given an attribute Ak , we build a vector that
contains all occurrences of values associated
with the label
S
l for Ak and their frequencies: Dk = i=1..n vi : f requency,
where li = l. The cosine distance (Equation 2) is then used
to measure the similarity between the vectors for the two
attributes:
dsim(Ai , Aj ) = cos(Di , Dj )
(6)
Example 2. Consider the following scenario: element e1
and e2 have the same label, and so do e3 and e4 . The values
associated with these elements are, respectively: v1 ={a,b,d},
v2 ={a,d}, v3 ={a,b,c}, v4 ={a,b}. These four elements are
aggregated into two attributes A1 = {e1 , e2 }, A2 = {e3 , e4 }
with D1 ={a:2,b:1,d:2}, D2 ={a:2,b:2,c:1}. The similarity
between A1 and A2 is: dsim(A1 , A2 ) = cos(D1 , D2 ) = 0.67.
The domain values are a good source of similarity information. However, they are not always available (see e.g.,
the elements Your ZIP or Title in Figure 3). Therefore, we
should consider values as supporting information to validate
(or reinforce) a match.
Correlation. By holistically looking at the form schemas
and considering them all simultaneously, we can leverage
an implicit source of similarity information: attribute correlation. Correlation is a statistical measure which indicates the strength and direction of a linear relationship between two variables—it can be positive or negative. For
Algorithm 1 Prudent Schema Matching
1: Input: Set of attributes A of a certain domain, configuration
Conf , grouping threshold T g
Figure 4: Prudent schema matching framework
Web forms, we exploit the fact that synonym attributes are
semantic alternatives and rarely co-occur in the same query
interface (push away)—they are negatively correlated (e.g.,
Make and Brand ). On the other hand, grouping attributes
are semantic complements and often co-occur in the same
query interfaces (pull together)—they are positively correlated (e.g., First name and Last name, Departure date and
Return date).
Although there are several correlation measures (e.g., Jaccard, cosine, Jini index, Laplace, Kappa), none is universally
good [30]. In the context of Web forms, two such measures
have been proposed: the H-measure [20] and the X/Y measure [35]. For PruSM, we use the latter:
(
X(Ap , Aq ) =
0
(Cp −Cqp )(Cq −Cqp )
(Cp +Cq )
if p,q ⊂ form F
otherwise
(7)
Cpq
Y (Ap , Aq ) =
min(Cp , Cq )
(8)
The score X–matching score captures negative correlation
while the grouping score Y–grouping score capture positive
correlation. To understand these scores, consider the contingecy table for the attributes Ap and Aq which encodes the
co-location patterns of these attributes [9]. The cells in this
matrix are f00 , f01 , f10 , f11 , and the subscripts indicate the
presence (1) or absence (0) of an attribute. For example, f11
corresponds to the number of interfaces that contain both
Ap and Aq and f10 corresponds to the number of interfaces
that contain Ap but not Aq . In the equations above, Cqp
corresponds to f11 , Cp to (f10 + f11 ), and Cq to (f01 + f11 ).
Note that correlation is not always an accurate measure,
in particular, when insufficient instances are available. However, as we describe below, correlation is very powerful and
effective when combined with other measures.
3.
PRUDENT MATCHING
The high-level architection of PruSM is depicted in Figure 4. Given set of form schemas, the Aggregation module groups together similar schema attributes and produces
outputs a set of frequent attributes (S1 ) to the Matching
Discovery module, and a set infrequent attributes (S2 ) to
the Matching Growth module. Matching Discovery finds
complex matchings among frequent attributes. These complex matchings, together with infrequent attributes, are used
by Matching Growth to obtain additional matchings that
include infrequent attributes. Algorithm 1 describes the
matching process in detail.
3.1
Aggregation
The aggregation module groups together similar elements
by stemming terms [2] and removing stop words1 (e.g., “the”,
“an”, “in”). For example, the labels "Select a make" and
1
http://members.unine.ch/jacques.savoy/clef/englishST.txt
2: Output: Set of attribute clusters C
3: begin
4: /*1. Aggregation */
5: Aggregate similar elements
6: Create set of frequent and infrequent attributes S1 , S2
7:
Let Xpq be set of any two attributes in S1 and compute
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
the label similarity lsim, value similarity dsim and correlation
score X between them
/*2. Matching Discovery for frequent attributes*/
M ←∅
while Xpq 6= ∅ do
Choose {Ap , Aq } that has the highest Xpq
if prudent check (Conf, Ap , Aq ) then
M ← ExploreAMatch(Ap , Aq , M )
else
/*buffering uncertain matches to revise later*/
B ← {Ap , Aq }
end if
Remove {Ap , Aq } from Xpq
end while
Resolve uncertain matches in buffer B
/*3. Matching Growth for rare attributes*/
Update ST F and term weights
Create a set of finer clusters Ci according to Mi
Use 1NN + co location check to assign rare attributes
to their closest frequent attribute
Cluster remaining attributes by HAC and add them to
C
end
"Select makes" are aggregated as a new attribute "select
make"; children and child are aggregated as child. Note
that, unlike DCM [20] and HSM [35], we do not need to detect and remove domain specific stop words (e.g., "car",
"vehicle", "auto", "books") or terms that are generic
and occur frequently in multiple domains (e.g., select, choose,
find, enter, please). Both DCM and HSM require manual pre-processing to simplify the labels and remove these
irrelevant terms. For instance, Search for book titles is
simplified to title, all variations of “make” e.g., select a
car make and vehicle make are simplified to make. However, automatically simplifying these labels is challenging,
and simplification errors can be propagated and result in
incorrect matches. As our experiments show, PruSM is effective even without label simplification. For example, it is
able to correctly find a match between year(44) ∼ select
year(15) ∼ year rang(16) which contains the stop words
select and range.
An attribute is considered frequent if it occurs above a
frequency threshold Tc (i.e., the number of occurrences of
the attribute over the the total number of schemas). Note
that whereas we set Tc =5% comparing, in DCM and HSM,
Tc is set to 20% [20, 35].
3.2
Matching Discovery (MD)
As part of our prudent strategy, only frequent attributes
participate in the Matching Discovery. We consider frequent attributes first because they contain less noise and
contribute to finding correct matches early (high precision),
consequently, avoiding error propagation. For each pair in
the frequent attribute list, we compute label and value similarity (Equations 1 and 6), and correlation matching and
grouping scores (Equations 7 and 8).
As shown in Example 1, it is not easy to determine whether
two attributes correspond to the same concept. Weak matchers are less reliable and can negatively impact the overall
accuracy. Thus, strong matchers should take precedence,
while weak matchers are used later in the process, possibly
combined with additional evidence. For example, correlation can be effective if we prudently combine it with an
additional evidence, e.g., a strong label similarity or value
similarity.
In Matching Discovery, we iteratively find the most confident match by combining different information sources using
a prudent match, i.e., a configuration C to validate a match.
C consists of a set of constraints and parameters of correlation score, label similarity and value similarity that are
used to validate a match, and is defined as: X(Ap ,Aq ) >
TM atching score AND [dsim(Ap , Aq ) > TV alue sim OR
lsim(Ap , Aq ) > TLabel sim ].
The intuition behind this prudent check is that although
attributes Ap and Aq may have a high correlation score, they
do not provide not a good match unless additional evidence
is available. Only attribute pairs passing the prudent check
are considered by ExploreAMatch (Algorithm 1, line 13),
where they will be checked and integrated with the existing
matches. If the attribute pair is an uncertain match (high
correlation but one or both of the attributes do not have
similar domain values), it is added to the buffer B to be
revisited later (line 16).
ExploreAMatch (Algorithm 2) is modified from [35]. Instead of attribute pairs with the highest correlation, it selects the most confident pair, i.e., the pair with highest correlation that passes the prudent check. ExploreAMatch supports complex matches. We denote M = {Mj , j = 1..k} as
a set of complex matches, where each Mj comprises a set
of grouping attributes G that are matched together Mj =
{Gj1 ∼ Gj2 ... ∼ Gjw } (there can be one or more than one
grouping attributes in each group G). The algorithm proceeds as a sequential floating process, where only the highest
pair emerges at each point in time. If attribute Ap and Aq
do no appear in M and they are not close to any of the
existing matches, a completely new match is created (line
6). If they are close, the attribute pair will become a new
matching component in an existing match Mj (line 8, line
13), or a grouping component of the existing match (lines
15) is created if two attributes that match to the same set
of other attributes have a high grouping score. Note that
the new attribute needs to be negatively correlated with at
least one attribute in every group of Mj , otherwise it will
be discarded.
A nice property of the Matching Discovery is that by iteratively choosing the most confident match, it automatically merges negatively-correlated attributes together and
gradually groups positively-correlated attribute while, at the
same time, it pushes away non-negative-correlated ones. As
time progresses, floating matches become richer and bigger
to form the final complex matchings. The example below
illustrates this process.
Example 3. Initially, the set of matches is null. Suppose we
are considering the Aifare domain, and X(depart,departure
date) has the highest matching score. These attributes will
form a new match Mj = (depart ∼ departure date). The
next highest pair is (depart, return date), however the
matching score X(return date, departure date)=0 and
grouping score Y(return date, departure date) is above
Algorithm 2 ExploreAMatch
1: Input: a candidate pair (Ap , Aq ), set of current matches M ,
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
grouping threshold Tg
Output: set of matches M
begin
if neither Ap nor Aq appears ∈ M then then
if neither Ap nor Aq not close to any Mj ∈ M then
M ← M + {{Ap } ∼ {Aq }}
else
Mj ← Mj + ({Ap } ∼ {Aq })
end if
else if only one of Ap and Aq appears in M then
/* suppose Ap appear and Aq not appear*/
if F or each Ai in Mj , Xqi > 0 then
Mj ← Mj + (∼ {Aq })
else if F or each Am ⊂ {Mj \ Gjk }, Xqm >
0 and f or each Al ⊂ Gjk , Yql > Tg then
Gjk ← Gjk + {Aq }
end if
end if
end
Figure 5: No validation can lead to incorrect matchings and
consequent errors
Tg , therefore return date is added as a grouping component with departure date: Mj = (depart ∼ {departure
date, return date}). Then the next highest pair is (return
date, return), similarly Mj=({depart, return} ∼ {departure
date, return date})
As the next example illustrates, systems that fail to perform validation can make incorrect decisions.
Example 4. In Figure 5, let attributes A, D, E and F be
correct matchings. Because X(A,B) > X(A,D), HSM [35]
will consider A match with B and then match with C. Because D is not matched with B, attribute A will never be
matched with attribute D and E and F. Thus, if a bad decision is made in an early step, it may not be corrected
and negatively effect the following steps. One concrete example that we encountered when not using the validation
in the auto domain is the incorrect matching of year and
body style. This match eliminates the chance of matching
between type and body style because type and year are
negatively correlated. By identifying high confident matches
first, it is possible to avoid a potentially large number of incorrect matches.
Attribute fragmentation happens when attributes co-occur
with different sets of attributes that belong to the same concept. A consequence of fragmentation is that it makes the
grouping and matching scores between attributes lower.
Example 5. Let S be a small set of schemas S={{A, C1 },
{A, C1 , D}, {B1 , C2 }, {B1 }, {B2 , C1 }}. Attributes B1 and
B2 belong to concept B, C1 and C2 belong to concept C.
The matching score of A and B1 is thus lower than the
matching score of A and B; and the grouping score of B1
and C2 is lower than grouping score of B and C. For example, X(A,B)=1.2 while X(A,B1 )=1; and Y(B,C)=1 while
Y(B1 , C1 )=0.
Because of validation, we can afford to use a low matching score and yet obtain high accuracy. To overcome the
shortcoming of low grouping score, we apply a group rein-
forcement process to propagate the correlation and reinforce
the grouping chance of two attributes although their grouping score is not sufficiently high.
Example 6. Suppose we have the match {make, model}
∼ {select make}. Because of fragmentation, the grouping score Y(select make, select model) is low. However,
the matching scores X(make, select make) and X(model,
select model) are high, and so is Y(make, model). Therefore, through reinforcement, select make and select model
be grouped together.
The last step in MD is to resolve the buffer B which contains uncertain but potential matches (Algorithm 1, line 20).
Here, we apply ExploreAMatch (Algorithm 2) again to resolve each uncertain pair. By buffering them and revising
later, we take the advantage of having new extra constraints.
It is worth it to perform this relaxation because some attributes do not have any domain values (e.g., departure
city and arrival city in the Airfare domain) and some of
the domain values are too coarse, leading to a low similarity
between them (e.g., category and subject in the Book domain). However, this relaxation could affect the matching
accuracy and we should be careful by choosing only highlycorrelated pairs (80% of the maximum matching score).
Example 7. Let A1 A2 be an uncertain match in buffer B.
After finding certain match A1 ∼ A3 ∼ A4 , we can take the
advantage of having additional constraints to reject A1 A2
because there is no connection between A3 A2 .
3.3
Matching Growth (MG)
The second phase of PruSM is the Matching Growth,
which finds additional matches for rare attributes. Initially,
based on the certain matches found in MD phase, we update
STF frequency (Section 2.2) to improve the term weight.
Consider the following example.
Example 8. In the Matching Discovery step, we can find
the match: year(44) ∼ select year(15) ∼ year rang(16)
which contains two stop words select and range. Using this
match, we can update the weight of token year and downgrade the weight of the tokens select and range. Identifying anchor terms like "year", "make", "model", etc.
is very helpful for the later phase of Matching Growth, where
greater variability is present in attribute labels.
Fragmentation also affects the quality of correlation signal and leads to incorrect ordering of complex matches, as
shown in Example 9. We can use the attribute proximity
information to break the tie and find finer matches for m:n
complex matches and create a set of clusters corresponding
to the finer matching set.
Example 9.
Given that correlation score X(departure
date, return on) > X(departure date, leave on). In
this case, domain values do not help because they are similar. Proximity information or label similarity can help to
break the tie: {departure date, return date} ∼ {return,
depart} ∼ {return on, leave on} will become {departure
date, return date} ∼ {depart, return} ∼ {leave on,
return on}.
Next, we use 1-Nearest-Neighbor Clustering (1NN) to assign rare attributes to their most similar frequent attributes.
Moreover, using the form context by looking at the list of resolved (matched) and unresolved (unmatched) elements of
each form, we ensure that two elements in the same form
cannot be in the same cluster (co-location check). Besides,
data type compatibility is used to prevent incorrect matches.
Example 10 illustrates how additional matches are derived
for infrequent attributes. Note that improving the weight of
important terms like "price","model" is very important to
correctly identify these matches. Finally, for the remaining
unmatched attributes, we run HAC algorithm (Hierarchical
Agglomerative Clustering) to group similar attributes into
new clusters and add these new clusters to the current set.
Example 10.
Additional matches in Auto domain derived by 1NN include: price up to ∼ price, price range
in euro ∼ price range, model example mustang ∼ model,
approximate mileage ∼ mileage, color of corvett ∼ color,
if so, what is your trade year ∼ year. HAC derives
also adds new clusters, for example: {within one month,
within one week, within hour}, {dealer list, dealer
name, omit dealer list}.
3.4
Discussion
Comparison against other clustering methods. We
can create homogeneous clusters by applying correlation clustering algorithms [3] to produce an optimal partition of attributes that minimizes the number of disagreements inside
clusters and maximizes disagreement between clusters. Although with correlation clustering it is not necessary to specify the number of clusters k, that problem is NP-complete [3].
As our experiments show, our ExploreAMatch algorithm is
both fast and effective for the data we considered.
With clustering methods like K-means [22] and HAC[14],
we have to decide beforehand the number of clusters or the
stopping threshold. PruSM is a data-driven process, and
as such, it can naturally reveal the shape of the clusters
regarding to the attribute distribution and their internal interactions.
Noise and rare attributes. Noise can negatively impact
the correlation signal [20] while rare attributes can artificially exacerbate it [35]. Both noise and rare attributes lead
to cascading errors which are irreversible and could be potentially propagated and magnified. PruSM is designed to
work with a low correlation signal and yet find accurate
matches. Strong matchers and strong matches take precedence to find the most confident matches first and help to
prevent errors during the initial steps.
To find an accurate match, the validating configuration
is not required to be perfect. It suffices to favor relatively
high thresholds (strict behavior) to obtain a high-precision
matches first, which can be extended later. Label and value
similarity thresholds in the configuration are learnt from a
small set of high correlated pairs that have a clear signal of
similarity [33]. By using label and value to validate a match,
as we discuss in later in the experimental evaluation, PruSM
is robust to a wide variation of correlation thresholds.
4.
EXPERIMENTAL EVALUATION
We empirically evaluate the performance PruSM over a
large of number of Web forms in multiple domains. We run
the experiment on two datasets: one small, clean, manually gathered dataset, and one, large and noisy, automatically gathered set of forms. Our experiments show that
PruSM is superior to other schema-matching approaches in
both datasets, even without performing any manual preprocessing of the data. We also study how the different
PruSM components (i.e., Matching Discovery and Matching
Growth) contribute to the overall accuracy. Last, but not
least, we compare our approach with other holistic match-
Table 1: Comparing different approaches
DCM
HSM
Find synonyms attributes
Yes
Correlation
Performance degrade when working with low frequent attributes
Only attributes that appear frequent can be matched
Find all possible combinations of
Iteratively find the highest correlated
positive and negative correlated set
match first and integrate it with the
then combine and rank them to
set of existing matches
choose the best one
Performance decrease when working with rare attributes or nonpreprocessed data
-Huge search space with many pos-Correlation-centric leads to incorsible combinations of positive and
rect matches and consequent errors
negative correlated set
-Impacted by attribute fragmentness
Target
Pre-processing
Information
Rare attributes
Strategy
Limitation
PSM
Find all the attributes correspondences
No
Label, Domain values, Correlation
Robust to rare attributes
Can identify matches of infrequent attributes
Combining multiple information to iteratively find the most confident match first,
then grow from these certain matches to extend the result
(a) TEL8
Domain
Airfare
Auto
Book
Movies
Music
Hotel
CarRental
Job
Number of forms
49
98
68
74
79
46
25
49
Number of elements
334
533
403
393
420
281
191
267
(b) WebDB
Domain
Auto
Book
Airfare
Number of forms
1254
747
678
Sample size
136
165
70
Number of elements
811
1285
766
Table 2: Database domains
Figure 6: WebDB label histogram
cluster [19]. The average precision, recall, and F-measure
are defined as:
ing approaches. Since to the best of our knowledge HSM has
the best performance among existing form-schema matching
approaches, we implemented it and use it as our baseline.
4.1
3
http://metaquerier.cs.uiuc.edu/repository
http://www.deeppeep.org
X
Ci
Experimental Setup
Datasets. We conduct the experiments on TEL8 and WebDB
datasets. Table 2 provides a summary of the characteristics
of these two datasets. The TEL8 dataset 2 contains manually extracted schemas for 447 deep Web sources from 8
domains. The WebDB dataset 3 contains 2,884 Web forms.
These forms were harvested using a focused crawler [5, 6]
and automatically classified into different domains. We use
LabelEx [29] to extract all the mappings between labels and
elements in these forms. Note that the data in this dataset is
representative and reflects the characteristics and distribution of form labels, thus, enable us to evaluate the robustness
of our approach. Figure 6 shows a histogram of the top 35
attribute labels in this dataset—note that there is a wide
variability in the frequency distribution of these labels. In
particular, there is a large number of rare attributes which
tend to confuse statistical matching approaches
Effectiveness measure.To evaluate PruSM performance,
we use precision and recall. Precision can be seen as a measure of fidelity, whereas recall is a measure of completeness.
F-measure is the harmonic mean between precision and recall. A high F-measure means that both recall and precision
are high – a perfect value of F-measure is 1. We also implement a GUI tool to support the manual creation of the gold
data for both TEL8 and WebDB dataset.
Since there are
many clusters, we measure the PruSM performance as the
average precision and recall according to the sizes of each
2
P recisionavg =
Recallavg =
X
Ci
F − measureavg =
4.2
|C |
P i ∗ PCi
j |C|
|C |
P i ∗ RCi
j |C|
2 ∗ P recision ∗ Recall
P recision + Recall
(9)
(10)
(11)
Evaluating the Effectiveness of PruSM
In this section, we compare the effectiveness of PruSM
against HSM in both datasets: TEL8 and WebDB. Although
PruSM outperforms HSM in both datasets, the improvement is much bigger in WebDB dataset, e.g., 73% in ”Book”
domain. It shows that our approach is more robust to the
variability and cleanliness of the data than HSM.
Effectiveness of PruSM in TEL8 dataset. Figure 7
summaries the accuracy of HSM and PruSM in TEL8 dataset.
Overall, both precision and recall of PruSM are much higher
than HSM in all five domains: PruSM outperforms HSM
from 10% in Book domain, to 40% in Auto domain. We
should note that HSM performance is low because we do
not apply the Syntactic Merging process. HSM has lowest
accuracy (F-measure is 0.61) in Car Rental domain due to
the sparseness of labels, and it can obtains rather high accuracy in other cleaner domains, like Book (F-measure is
0.86). PruSM is able to gain more improvement in sparse
domains (e.g., 40% in Auto domain and 36% in Car Rental)
and less improvement in clean domains (e.g., Book (10%)
and Airfare (12%)).
Domain Matchings
Books
Author(54) ∼ [Last Name(6), First Name(6)]
Subject(17) ∼ Category(7)
Format(12) ∼ Binding(6)
Airfare [return(12), depart(11)] ∼ [departure date(21), return
date(24)]
[Adult(31)Children(25)Infant(13)Senior(10)]∼Passenger(8)
[from(23) to(22)] ∼ [departure city(5) arrival city(5)]∼
[leave from(5) go to(4)]
cabin(5) ∼ class(11)
Autos
year rang(13) ∼ year(34)
make(72) ∼ manufacture(9)∼vehicle make(5)∼vehicle(4)
vehicle model(5) ∼ model(63)
[max price(9),max mileage(4)]∼price rang(29)∼price(15)
type(6) ∼ body type(4)
vehicle body style(3) ∼ body style(4)
exterior color(3) ∼ color(6)
number of door(3) ∼ cylinder(6)
location(5) ∼ state(7)
Movies category(13) ∼ genre(18)
cast crew(6) ∼ director(38) star(16) ∼ keyword(19)
Music
title(49) ∼ album(16)
catalog #(10) ∼ song(31)
artist(54) ∼ conductor(6) composer(8)
category(9) ∼ genre(5)
movie title(3) ∼ movie(3)
perform(7) ∼ artist perform(3)
Hotel
[check in date(16), check out date(12)] ∼ [check in(9), check
out(7)] ∼ arrival(4)
number of room(3) ∼ room(11)
guest(3) ∼ adult(19)
Car
pick up city(6) ∼ pick up location(9) ∼ where to pick up(3)
rental
car type(6) ∼ car class(3)
[pick up date(14), pick up time(12), drop off date(11), drop
off time(9)] ∼ [drop off(4), pick up(4)]
drop off city(5) ∼ drop off location(5)
Job
location(18) ∼ state(22)
category(6) ∼ job type(12) ∼ industry(6)
job title(4) ∼ title(4)
Table 3: Matching results of PruSM MD in TEL8 dataset
Figure 7: PruSM performance in the TEL8 dataset
Because PruSM does not perform syntactic merging and it
uses a low frequency threshold, the correlation signal might
be weaker. However, as shown in Table 3, the PruSM Matching Discovery identifies more than 80% of all the good matches.
Besides complex matches e.g., Passenger ∼ {Children, Adult,
Senor}, etc., PruSM can find syntactic matches that HSM
and DCM did not find, e.g., make ∼ vehicle make ∼ vehicle,
vehicle model ∼ model, job title ∼ title.
As reported in [35], the synonym target accuracy of DCM
and HSM varied according to the attribute frequency threshold. After the syntactic merging, they consider only attributes occurring above a frequency threshold Tc =20%. Note
that when more rare attributes are taken into consideration
(Tc =5%), the occurrence pattern of infrequent attributes is
not very obvious, thus, the accuracy decreases significantly.
As they reported, the average HSM precision decreased from
100% to 70%; the average DCM precision decreased from
95% to 48% and recall decreased from 98% to 58%. Be-
Figure 8: PruSM performance in the WebDB dataset
cause the prudent matching process can help to avoid a lot
of incorrect matchings with rare attributes and accordingly
consequent errors, PruSM can work robustly and accurately
with low attribute frequency (Tc =5%).
Another problem is that both DCM and HSM require
pre-processing. As reported in [20], DCM performance decreased seriously when the syntactic merging was not conducted; for example in the Hotel domain, precision went
down from 86% to 47% and recall decreased from 87% to
46%. Again PruSM does not encounter this problem because it does not require the syntactic merging. By considering only frequent attributes, certain matches that are
found in the Matching Discovery step are used to resolve rare
attribute variants. PruSM does not require pre-processing
but still less effected by rare attributes.
Effectiveness of PruSM in WebDB dataset. WebDB
dataset is an automatic, scalable and heterogeneous crawled
dataset. A nice capability of PruSM is that we don’t need to
manually clean data or conduct the syntactic merging process (which is an extensive work for a large real dataset). All
we need to do is applying a simple aggregation step (word
stemming and stop word removal) to aggregate similar elements together. It ’s worthy to note that PruSM can work
sufficiently with automatically crawled data given its heterogeneous vocabulary (Figure 6) and a few incorrect mappings
from the label extraction process by mapping wrong label
to element.
Figure 8 shows the overall performance of PruSM in WebDB
dataset. Overall, PruSM leads to a substantial performance
improvement in all three domains (42% in Auto, 38% in Airfare, 57% in Book) with both precision and recall of PruSM
are higher than the baseline. PruSM performance is significantly high in Auto (90%) and Airfare (88%) which are
well-defined domains. Precision is higher because of a prudent matching process. Recall is higher because MG helps
to find additional matches of infrequent attributes. Recall
of the MG in Airfare domain does not increase much (10%
compared to 25% in Auto) because there are few infrequent
attributes in this domain as shown in the low tail in Figure 6. Note that Book domain is more heterogeneous and
thereby has a lower result. This could be explained by the
lowest Simpson index as in [29] and there are many complex
labels in Book domain like “word in title abstract”, “title to
display per page”, “journal title abbreviation”, “title series
keyword” or “title author ISBN keyword” which are easy to
confuse with terms like “title”, “abstract”, “journal”, “series” or “keyword”.
Take a closer look to the Auto domain, precision and recall of the HSM are 65% and 63%. In the PruSM Matching
Discovery, they are 98% and 69%. Although the recall is still
select make model(9) ∼ make(63) model(59)
∼ select model(8) select make(7)
year(44) ∼ select year(15) ∼ year rang(16) ∼
select rang of model year(6)
price(14) ∼ price rang(23) ∼ price rang is(6)
zip(13) ∼ zip code(7)
valid email(6) ∼ email(8)
type(5) ∼ body style(6)
search within(7) ∼ distance(4)
mile(12) ∼ mileage(4)
Table 4: Basic seeds
Figure 10: Different combination strategies in MD
(a) Without Validation
Figure 9: Contribution of different component in multiple
domains
low, the matching set includes all the basic seeds as shown in
Table 4. Based on these good seeds, the PruSM MG significantly improved recall from 69%to 87%, (precision slightly
decreases) increased the final F-measure to 90%. Overall,
PruSM leads to 41% gain of F-measure in Auto domain.
Note that both DCM and HSM performance are affected
by the schemas extraction errors which reduce the negative
correlation signal, thereby affect the ranking and lead to cascading errors. Their performance decreased up to 30% performance by the extraction errors reported in [20]. PruSM
does not affected much because it can can afford low grouping and matching score and still have good results
4.3
Evaluating the Effectiveness of the PruSM
Components
Figure 9 shows the efficiency and contribution of different
components in PruSM. With prudent matching, F1 gains
41% in Auto, 33% in Airfare and 18% in Book. With 1NN,
F1 gains 11% in Auto and 6% in Airfare and Book. Clustering the remainders increase recall (F1 gains 9% in Book but
not significantly in Auto and Airfare). Update STF helps
to increase F1 about 3.5% in all three domains. Reinforce
and Buffering helps F1 to gain more than 2% in Airfare and
Book and 1% in Auto. With TieBreak, F1 gains 20% in
Airfare. As we observed, Prudent matching significant improves precision while 1NN and HAC improve recall (and
slightly decreases precision). Update STF and TieBreak
help to improve precision while Reinforce and Buffering help
to increase recall. In all three domains, Prudent matching
significantly improves precision. 1NN plays an important
role in Auto, so does HAC in Book and TieBreak in Airfare.
To evaluate the effect of syntactic merging in PruSM, we
manually select and remove domain stop words and specific stop words, the overall performance increases about 2%
which indicates that those stop words do not affect PruSM
very much.
As mentioned earlier, validation is very important: without validation, precision is very low in the MD phase and
consequently the MG phase. As in Figure 9, precision is degraded up to 45% without validation. That’s why we need a
(b) With Validation
Figure 11: MD Performance with a wide variation of Correlation Threshold
high precision in the Matching Discovery phase. Figure 10
illustrates the performance of different combination strategy.
As we observed, prudent matching has the highest precision
(comparing with single matchers or linear combination of
different matchers) and contains all the good seeds for the
next MG step.
The next experiment in Figure 11 illustrates that by using
prudent matching, PruSM is robust to a wide variation of
correlation threshold. Without validation, precision is low
when the Correlation Threshold is low. This leads to lower
precision in MG and lower F-measures. Figure 11(a) implies that without validation, correlation threshold must be
very high in order to have a clear signal and an acceptable
precision. However, recall is very low (30%) and we do not
have all the good seeds. This shares the same observation
with [15] where a higher similarity measure is an indication
of more precise mapping. On the other hands, with validation, we can gain a high precision (the most important goal
of MD phase) even with a very low Correlation Threshold
as shown in Figure 11(b) and the precision is still high with
a wide variation of Correlation Threshold.
5.
RELATED WORK
Although this problem is related to database schema matching [18, 11, 23, 26, 34], there are fundamental differences [24].
First and foremost, whereas database schemas include information about attribute names, data type, constraints (key,
foreign key), for Web form schemas only the association be-
tween a label and an element are known, and this schema
may not correspond to the schema of the data hidden behind the form. On the other hand, because Web forms are
designed for human consumption, there are not as many
acronyms as database schemas and the vocabulary used for
labels must be descriptive so that users can easily understand their semantics. Another important difference is that
whereas pair-wise matching approaches have been used to
match database schemas, these do no scale well for a large
number of form schemas in a given hidden-Web domain.
Three distinct classes of approaches have been proposed
for form-schema matching: clustering [38, 32, 21], instancebased approach [36, 37] and holistic approach [19, 20, 35,
25]. Clustering approaches [38, 21, 32] need to define a precise similarity function which combines different similarity
components between any two form elements. While Wu et
al. [38] used pre-defined coefficients, Pei et al. [32] leveraged
the distribution of Domain Cluster (DC) and Syntactic Cluster (SC) to automatically determine weights of linguistic
and domain similarity. He et al. [21] used pre-defined coefficients and a hierarchical combination framework that leverages high quality matchers first and then predicts matches.
A more principled approach like LSD [12] requires a mediated schema and human users to manually construct the
semantic mappings in the training examples to learn these
weights. By using linguistic and domain similarity to validate correlation, PruSM does not need to define a similarity
function and associated weights. One drawback of some of
the clustering-based approaches is that they use Wordnet
[21] or leverage domain values [32] to find synonyms. Using
WordNet is not sufficient to find domain-specific synonyms;
and by leveraging only domain values, they can not find synonyms of attributes that have sparse domain value or do not
have any domain values associated with. Furthermore, these
approaches [38, 21] work with (and have only been tested)
a very small number of sources, require data pre-processing
and rely on high-quality (noise-free) data as input. Except
for [38], all of these approaches are limited to 1:1 mappings.
Wu et al. [38] support complex matching by modeling form
interfaces as trees. These trees are first used to identify complex mapping and isolate possible composite attributes, and
then, HAC is applied to cluster the remaining attributes to
find all 1:1 mappings, which are combine with initial complex mappings to obtain additional complex mappings (using the “bridging effect”). Using an ordered tree identify
1:m mappings for each form interface is expensive and does
not scale to a large number of sources. The effectiveness
of this approach is highly-dependent on the quality of this
structure, which for the experiments discussed in the paper,
was manually constructed. Besides, users are required to
reconcile a potentially large number of uncertain mappings
so that similarity thresholds are learned.
Pei et al. [32] exploited two kinds of attributes clusters:
SC and DC and aim to optimize SC by using DC (use certain
attributes in DC to resolve uncertain SC attributes). To
resolve the uncertainties when merging SC DC, they used
a criteria function which combines syntactic similarity and
domain similarity with the coefficients are automatically determined based on the distribution of SC and DC: the more
elements in DC, the higher the coefficient of linguistic similarity in that cluster. However, the distribution of SC and
DC clusters in different domains vary differently and this
approach will have less impact when the domain values are
scarce or not available. To minimize the noise effect, they
do the re-sampling clusters multiple times to filter unstable
attributes (outliers) while we pay attention to the frequent
attributes first to avoid propagation errors. We note that
they only handle simple 1:1 matching. The two-step clustering approach employed by the WISE-integrator[21] shares
some basic idea with our PruSM, since it tries to derive confident matches first. However, WISE relies on the quality
of input data to linearly combine different component similarities and its experimental dataset is small and manually
pre-processed. Besides, the WISE-integrator only handles
1:1 matching and uses WordNet to find synonyms.
Holistic approaches benefit from considering a large number of schemas simultaneously [19, 20, 35, 25]. A limitation
shared by these approaches is that they require clean data
and their performance decreases significantly when the input data is noisy. He and Chang [19] proposed MGS (Model,
Generation, Selection) which assumed a hypothesis that labels are generated by a hidden generative model which contains a few schema concepts, each concept composing by
synonym attributes with different probabilities. MGS exhaustively generates all possible models and uses statistical
hypothesis test to select a good one. MGS evaluates and
choose the best global schema models while we explore one
match at a point of time.
Among the holistic approaches, the most closely related to
PruSM are DCM (Dual Correlation Mining) [20] and HSM
(Holistic Schema Matching) [35] which use the attributes occurrence patterns to find complex matching by mining positive and negative correlation. DCM proposes H-measure
and exploits the “apriori” property (i.e., downward closure)
to first discover all possible positive-correlated groups, then
adds these groups to the original schemas set and again
mines all possible negative-correlated groups. DCM has a
huge “search space” because there are many possible combinations of positive and negative correlated groups. Finally
they select the matches that have the highest negative correlation score and remove matches that are inconsistent with
the chosen one. Su et al. [35] proposed a slightly different correlation measure (matching and grouping score) and
a greedy algorithm to discover synonym matchings between
pairs of attributes by iteratively choosing the highest matching pair and use the grouping score to decide whether two
attributes that match to the same set of other attributes
belong to the same group. Although HSM has a better
performance than DCM, it still suffers the from the same
limitation: it requires clean data. Besides, the score-centric
greedy algorithm could produce incorrect matches with rare
attributes and consequent errors. Instead of choosing pairs
that have the highest matching score, we modify the HSM
greedy framework and exploit additional evidences to prudently choose the most confident matches first and integrate it with the existing matches. This helps to minimize
irreversible incorrect matches and their consequent errors.
Moreover, PruSM can afford a low matching score (but still
find accurate matches) by using validation and overcome
the shortcoming of low grouping score by propagating the
matching score and strengthen grouping chance of two attributes although their grouping score is not high enough.
We note that DCM [20] also attempts to deal with noise
stemming from incorrect extraction. It does so by collecting multiple sample sets of forms and apply DCM matchers
several times over each sample. The intuition is that when
there are few mistakes in the data, only a minority of trials
will be affected. Thus, the matches that are shared by the
majority of trials are considered correct. However, sampling
can exacerbate the effect of rare attributes and degrade the
DCM performance. Moreover, DCM has a huge search space
and it is expensive to do the sampling and run DCM multiple times. By simply leveraging the frequent attributes and
their significant data, PruSM first find certain matches and
use them to obtain additional matches of rare attributes.
The intuition is that frequent attribute is authentic, can
dominant noise, rare attributes and infrequent mistake and
provide strong clues to find confident matches. Starting with
good seeds also help to minimize the propagation errors: we
can leverage these high quality matches to extend the later
results. Another issue of DCM and HSM is that their target is to find synonym attributes which is a sub-problem of
finding all the correspondences among elements even though
they have synonyms or not. For example, besides synonyms
between make and brand and manufacture, PruSM also find
many variations of auto make like car make, vehicle make,
select make, choose a make, etc.. The differences between
DCM, HSM and PruSM are summarized in Table 1.
Wang et al. [36] proposed an instance-based approach that
consists of an ensemble of 3-layer schemas: a manuallydefined global schema, form-interface schemas, and result
schemas (extracted by wrapper). The rely on probing to exhaustively submit all attribute values of the sample records
to each input element of the search interface and count the
re-appearance of each query value in the result pages. This
approach is costly and requires substantion network bandwidth. Besides, the global schema and web instances are
expensive to create. WebIQ [37] retrieves instances from
the Web by using hyponymy patterns e.g., “... such as
NP1, NP2” and exploiting the sentence completion feature
of search engines. These instances are used to support the
schema matching task. A potential problem of WebIQ is
that those instances may be biased toward popular instances
or noisy data from the Web. In our approach, beside available data instances of domain values, we also exploit the
contextual information to support matching.
6.
CONCLUSIONS AND FUTURE WORK
In this paper, we propose PruSM, a scalable approach to
form-schema matching which prudently combines similarity
information from multiple sources to accurately determine
matches among elements from different forms. We have conducted a detailed experimental evaluation which shows that
PruSM is effective and robust to noise in the data, and it also
outperforms previous form-schema matching approaches.
There are several directions we intend to pursue in future work.
By providing PruSM with a better starting
set of labels, its effectiveness can be improved. We would
like to experiment with alternative procedures to automatically simplify the labels and perform syntactic merging, and
study their trade-offs.
An important challenge in devising such procedures is the possibility of introducing errors.
Other potential sources of improvement for PruSM include
improving the term weighting scheme, mapping and combining different scores into a unique score [13] or exploiting more information like element proximity, DOM structure, name, element type to find the most confident match.
Last, but not least, using the match information derived by
PruSM, we plan to investigate techniques for automatically
filling out the forms and retrieve the contents hidden behind
them.
7.
REFERENCES
[1] M. Abrol, N. Latarche, U. Mahadevan, J. Mao,
R. Mukherjee, P. Raghavan, M. Tourn, J. Wang, and
G. Zhang. Navigating large-scale semi-structured data
in business portals. In VLDB ’01: Proceedings of the
27th International Conference on Very Large Data
Bases, pages 663–666, San Francisco, CA, USA, 2001.
Morgan Kaufmann Publishers Inc.
[2] R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern
Information Retrieval. ACM Press/Addison-Wesley,
1999.
[3] N. Bansal, A. Blum, and S. Chawla. Correlation
clustering. Foundations of Computer Science, Annual
IEEE Symposium on, 0:238, 2002.
[4] L. Barbosa and J. Freire. Siphoning hidden-web data
through keyword-based interfaces. In In SBBD, pages
309–321, 2004.
[5] L. Barbosa and J. Freire. Searching for hidden-web
databases. In WebDB, pages 1–6, 2005.
[6] L. Barbosa and J. Freire. An adaptive crawler for
locating hidden-web entry points. In In Proc. of
WWW, pages 441–450, 2007.
[7] L. Barbosa and J. Freire. Combining classifiers to
identify online databases. In In Proc. of WWW, pages
431–440, 2007.
[8] L. Barbosa, J. Freire, and A. Silva. Organizing
hidden-web databases by clustering visible web
documents. In ICDE, pages 621–633, 2007.
[9] H. D. BRUNK. An Introduction to Mathematical
Statistics. Blaisdell Pub. Co, New York, 1965.
[10] K. C.-C. Chang, B. He, and Z. Zhang. Toward
Large-Scale Integration: Building a MetaQuerier over
Databases on the Web. In CIDR, pages 44–55, 2005.
[11] R. Dhamankar, Y. Lee, and A. Doan. imap:
Discovering complex semantic matches between
database schemas. In ACM SIGMOD, pages 383–394.
ACM Press, 2004.
[12] A. Doan, P. Domingos, and A. Y. Halevy. Reconciling
schemas of disparate data sources: a machine-learning
approach. SIGMOD Rec., 30(2):509–520, 2001.
[13] C. F. Dorneles, C. A. Heuser, V. M. Orengo, A. S.
da Silva, and E. S. de Moura. A strategy for allowing
meaningful and comparable scores in approximate
matching. In Proceedings of CIKM, pages 303–312,
New York, NY, USA, 2007. ACM.
[14] R. Feldman and J. Sanger. The Text Mining
Handbook: Advanced Approaches in Analyzing
Unstructured Data. Cambridge University Press,
December 2006.
[15] A. Gal. Why is schema matching tough and what can
we do about it? SIGMOD Rec., 35(4):2–5, 2006.
[16] Google Base. http://base.google.com/.
[17] L. Gravano, H. Garcı́a-Molina, and A. Tomasic. Gloss:
text-source discovery over the internet. ACM Trans.
Database Syst., 24(2):229–264, 1999.
[18] H. hai Do and E. Rahm. Coma - a system for flexible
combination of schema matching approaches. In
VLDB, pages 610–621, 2002.
[19] B. He and K. C.-C. Chang. Statistical schema
matching across web query interfaces. In SIGMOD
’03: Proceedings of the 2003 ACM SIGMOD
international conference on Management of data,
pages 217–228, New York, NY, USA, 2003. ACM.
[20] B. He and K. C.-C. Chang. Automatic complex
schema matching across web query interfaces: A
correlation mining approach. ACM Trans. Database
Syst., 31(1):346–395, 2006.
[21] H. He and W. Meng. Wise-integrator: An automatic
integrator of web search interfaces for e-commerce. In
VLDB, pages 357–368, 2003.
[22] J. Kogan. Introduction to Clustering Large and
High-Dimensional Data. Cambridge University Press,
New York, NY, USA, 2007.
[23] M. Lenzerini. Data integration: a theoretical
perspective. In PODS ’02: Proceedings of the
twenty-first ACM SIGMOD-SIGACT-SIGART
symposium on Principles of database systems, pages
233–246, New York, NY, USA, 2002. ACM.
[24] B. Liu. Web Data Mining. Exploring Hyperlinks,
Contents, and Usage Data. Springer, 2007.
[25] J. Madhavan, P. Bernstein, K. Chen, A. Halevy, and
P. Shenoy. Corpus-based schema matching. In In
ICDE, pages 57–68, 2005.
[26] J. Madhavan, P. A. Bernstein, and E. Rahm. Generic
schema matching with cupid. In In The VLDB
Journal, pages 49–58, 2001.
[27] J. Madhavan, S. Cohen, X. L. Dong, A. Y. Halevy,
S. R. Jeffery, D. Ko, and C. Yu. Web-scale data
integration: You can afford to pay as you go. In
CIDR, pages 342–350, 2007.
[28] J. Madhavan, D. Ko, L. Kot, V. Ganapathy,
A. Rasmussen, and A. Y. Halevy. Google’s deep web
crawl. PVLDB, 1(2):1241–1252, 2008.
[29] H. Nguyen, T. Nguyen, and J. Freire. Learning to
extract form labels. Proc. VLDB Endow.,
1(1):684–694, 2008.
[30] P. ning Tan. Selecting the right interestingness
measure for association patterns. KDD, 2002.
[31] A. Ntoulas, P. Zerfos, and J. Cho. Downloading
textual hidden web content through keyword queries.
In JCDL, pages 100–109, 2005.
[32] J. Pei, J. Hong, and D. Bell. A robust approach to
schema matching over web query interfaces. In
Proceedings of the 22nd International Conference on
Data Engineering Workshops, ICDE 2006, pages
46–55. IEEE Press, 2006.
[33] S. Sarawagi and A. Bhamidipaty. Interactive
deduplication using active learning. In Knowledge
Discovery and Data Mining (KDD), pages 269–278,
2002.
[34] P. Shvaiko and P. Shvaiko. A survey of schema-based
matching approaches. Journal on Data Semantics,
4:146–171, 2005.
[35] W. Su, J. Wang, and F. Lochovsky. Holistic query
interface matching using parallel schema matching. In
Advances in Database Technology - EDBT, pages
77–94. Springer Berlin / Heidelberg, 2006.
[36] J. Wang, J.-R. Wen, F. Lochovsky, and W.-Y. Ma.
Instance-based schema matching for web databases by
[37]
[38]
[39]
[40]
[41]
domain-specific query probing. In Proceedings of
VLDB, pages 408–419. VLDB Endowment, 2004.
W. Wu and A. Doan. Webiq: Learning from the web
to match deep-web query interfaces. In Proc. of 22nd
Intl. Conf. on Data Engineering (ICDE 06, pages
44–54, 2006.
W. Wu, C. Yu, A. Doan, and W. Meng. An interactive
clustering-based approach to integrating source query
interfaces on the deep web. In Proceedings of
SIGMOD, pages 95–106, New York, NY, USA, 2004.
ACM.
J. Xu and J. Callan. Effective retrieval with
distributed collections. In Proceedings of the 21st
International ACM SIGIR Conference on Research
and Development in Information Retrieval, pages
112–120, 1998.
C. Yu, K.-L. Liu, W. Meng, Z. Wu, and N. Rishe. A
methodology to retrieve text documents from multiple
databases. IEEE Trans. on Knowl. and Data Eng.,
14(6):1347–1361, 2002.
Z. Zhang, B. He, and K. Chang. Understanding web
query interfaces: best-effort parsing with hidden
syntax. In ACM SIGMOD, pages 107–118, 2004.