PruSM: A Prudent Schema Matching Strategy for Web Forms

PruSM: A Prudent Schema Matching Strategy for Web
Forms
Thanh Nguyen and Juliana Freire
ABSTRACT
There is an increasing number of data sources on the Web whose
contents are hidden and can only be accessed through form interfaces. Several applications have emerged that aim to automate
and simplify the access to such content, from hidden-Web crawlers
and meta-searchers to Web information integration systems. In
this paper, we focus on the problem of matching elements in form
schemata. Although this problem has been studied in the literature,
existing approaches have important limitations. Notably, they assume that the number of forms are small and the form elements
are clean and normalized, often through manual pre-processing.
Matching form schemata at a large scale presents new challenges:
data heterogeneity is compounded with the Web-scale. As a result, when matching is applied over large and heterogeneous data,
the accuracy is greatly reduced. In this paper, we propose PruSM,
an approach that addresses these limitations by proposing a prudent matcher and a prudent matching strategy which determines
high quality matches for form elements. PruSM starts by selecting
matches with the highest confidence by combining super-features at
a high level of sets of elements which incorporates both visible and
latent information so that these features can reinforce each other.
These confident matches are used to resolve uncertain matches and
incrementally grow the set of (certain) matches. We have evaluated
PruSM using a large number of forms in multiple domains in TEL8
and FFC dataset. Our results show that PruSM is effective and able
to accurately match a large number of form schemata and without
requiring manual pre-processing.
1.
INTRODUCTION
It is estimated there are millions of databases on the Web [15]
whose contents are hidden and are only exposed on demand, as
users fill out and submit Web forms. Several applications have
emerged which attempt to uncover hidden-Web information and
make it more easily accessible, including: metasearchers [10, 11,
28, 29], hidden-Web crawlers [1, 3, 16, 19], and Web information
integration systems [8, 14, 27]. Many metasearchers and Web information integration systems are to provide an unified access with
a mediated search form to multiple sites of the same domain be-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.
cause this would allow users to search and compare similar articles
from multiple sources with ease. Their success heavily relies on
how to build the mediated Web form and maintaining the mapping
from that form to other data sources. To build such mediated search
form, we have to solve the problem of form schema matching:
Given a set of web forms, identify the semantic correspondences
among elements in search interfaces of different sources. Although
it is similar to schema matching problem in database literature [21,
23], form schema matching problem faces new challenges from traditional approaches where schemas are not explicitly defined and
need to be extracted from the HTML sources [17]. Existing approaches to form schema matching [27, 5, 24, 15, 25, 26] only
address parts of the problem where they assume existed, clean, or
manually pruned schema and usually require substantial manual
effort both in creating local schema [5, 24, 27, 14] and building
mediated schema [25, 26]. A major challenge is the heterogeneity
among local schemas. Figure 1 shows the example of correspondences among different form elements and also illustrates some of
the variability in how forms are designed: several variations of labels or even elements with no apparent similarity (i.e., they are synonyms) are used to represent the same concept while elements with
similar labels or values might be different, which makes it difficult
to identify the correspondences.
Example 1. In the Auto domain, mileage and price sometimes
have a similar value range, using domain values can result in an
incorrect matching between. On the other hand, if only label similarity is considered, an incorrect match can be derived for model
and model year because they share a term that is important in the
collection. Co-occurrence statistics can also be useful to identify
mappings [5, 24]. For example, by observing that manufacturer
and brand co-occur with a similar set of attributes but rarely cooccur together, it is possible to infer that they are synonyms. However, when used in isolation, attribute correlation can lead to incorrect matches. In particular, correlation matching scores (section
3.3 can be artificially high for rare attributes, since rare attributes
seldom co-occur with (all) other attributes.
The Web-form Schema Matching problem is compounded with
the Web-scale and the heterogeneity of Web data. Approaches [14,
9, 6, 27, 15, 25, 26] which consider forms in isolation and assume
certain matchers are strong, may fail in this scenario because the
importance of different matchers not only varies within domain but
also within forms in a domain. Such incorrect matches lead to many
propagated errors. Therefore, there is a need for a comprehensive
way to combine and reinforce different matchers and to consider
not only visible features at the level of individual elements but also
super (meta) features and latent information at a high level of sets
of elements. Holistic approaches [24, 12] can take the benefit of
having a large number of forms but their performance depends on
Figure 1: Matching form schemata in the Auto domain. Different labels are used to represent the same concept.
manually preprocessed data and do not work in the presence of rare
attributes, i.e., labels that have frequency lower than 20%. Given
the long tail of the form label histogram (see Figure 2), for example
in Auto domain, the total frequency of attributes whose frequency
is less than 20% is 83% and the total frequency of attributes whose
frequency is less than 5% is 57%. These large numbers of rare attributes will notably affect the matching coverage. Given the scale
and heterogeneity of Web data, manual cleaning is infeasible.
Example 2. Labels representing a concept Model may vary a lot.
Consider the following labels in Auto domain: Choose vehicle
model example mustang, If so what is your model, Select
a range of model years, Please choose a vehicle, etc.
While Model is usually an important term, it is not important in
Model Year. While vehicle is usually not important, it is important in Please choose a vehicle, where it denotes make/model.
Reliance on label simplification is problematic since not only
can the manual simplification process be expensive for a large heterogeneous data set, but also the automated simplification is not
trivial and is error prone. Consequently, the applicability of these
techniques [24, 12] is greatly reduced. For instance, the matching accuracy can be reduced by as much as 44% in the absence of
manual pre-processing or 47% when attributes are considered that
appear in less than 20% of the forms in a collection [24, 12].
Given the scale and heterogeneity of data on the Web, there is
a call for an automatic approach that can accurately find matches
for large number of forms and obtain high coverage. In this paper,
we propose PruSM (Prudent Schema Matching), an effective and
automated approach can work for a large and heterogenous forms
collection with minimal human effort.
Contributions
- A prudent matcher that comprehensively incorporates both visible and latent information, together with a prudent matching strategy which derives confident matches first and minimizes propagation errors.
- PruSM is able to find matches for infrequent attributes which
are commonplace in form collections, robust to rare attributes and
does not require manual pre-processing.
- Comprehensive experiments on two form collections TEL8 and
FFC show that PruSM obtains high precision and recall without any
manual pre-processing and has higher accuracy (between 10% and
68%) than existing form schema matching holistic approaches.
Outline The rest of paper is organized as follows. In Section 2, we
give a brief definition on basic concepts used in the paper, followed
by our overview description on PruSM framework. In Section 3,
we go through different components of PruSM in detail. The experiment and Related work is presented in Section 4 and 5.
2.
MATCHING WEB FORMS
We first introduce basic concepts and the form model used by
PruSM and then describe the Prudent Matching framework.
Figure 2: Label histogram in the TEL8 and FFC datasets
Figure 3: Components of Web forms
2.1
Problem Definitions
Definition 1 (Web Form) A Web form f contains set of elements
E = {e1 , e2 , ..., en }. Each element ei is represented by a tuple
(li , vi ), where li is the label of ei (i.e., a textual description) and vi
is a vector that contains possible domain values for ei . An element
label l consists of a bag of terms {t1 , ..., tm }, where each term ti
is associated a weight wi .
Consider, for example, the from in Figure 3. The element labels
of this form are Make, Model, Maximum price, Search within
and Your ZIP. The domain values for element Model are {All,
MDX, RDX, RL, TL, TSX}.
Definition 2 (Attribute) An attribute A is an aggregation of a set
of elements with the same label across forms. Each attribute A is
represented by a label and its domain is a set of vector values of
containing elements
The form schema matching problem is stated as follows.
Definition 3 (Form Schema Matching) Given a set of Web forms
{F } in a domain, the schema matching process first identifies all
the matches [13, 24, 22] among elements across forms: M =
{Mi }, Mi = {Gi1 ∼ ... ∼ Giw }, Giw = {Ak }, k ≥ 1. G
is group of attributes. When k = 1 we have 1:1 matches, when
k > 1, we have complex matches. According to M , we then create
a set of clusters C = {C1 , ..., Cm }, where each cluster contains
only elements that share the same meaning.
Consider for example the forms in Fig. 1. One of the clusters
derived for these forms that corresponds to the attribute auto model
would include the following matches: { by Brand ∼ manufacturer
∼ choose make }.
2.2
Prudent Schema Matching Framework
The high-level architecture of PruSM is depicted in Fig. 4. Given
a set of form schemata, the Aggregation module groups together
similar attributes, divide and send the set of frequent attributes
(S1 ) to the Matching Discovery module, and the set infrequent attributes (S2 ) to the Matching Growth module. An attribute is considered frequent if its frequency is above the frequency threshold
Tc . Matching Discovery finds matches (both 1:1 and 1:n) among
frequent attributes. These matches M, together with infrequent
attributes S2 , are used by Matching Growth to obtain additional
matches that include infrequent attributes.
The matching process is outlined in Algorithm 1, which includes
Aggregation, Matching Discovery and Matching Growth.
Algorithm 1 Prudent Schema Matching
1: Input: Set of attributes A present in a collection F of forms in domain
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
3.
3.1
D, configuration Conf , grouping threshold T g
Output: Set of attribute clusters {C1 , . . . , Cn } corresponding to the
identified matches
begin
/*1. Aggregation */
Stem terms in labels and remove stop words
Aggregate elements that have the same label
Create set S1 of frequent and S2 of infrequent attributes
/*2. Matching Discovery for frequent attributes*/
For each pair of attributes (Ap , Aq ) ∈ S1 , compute their label similarity lsimpq , value similarity dsimpq and correlation score Xpq ,
Ypq , and add tuple < (Ap , Aq ), lsimpq , dsimpq , Xpq , Ypq > to set
X.
M ←∅
while X 6= ∅ do
Choose attribute pair (Ap , Aq ) that have the highest Xpq
if prudent_check (Conf, Ap , Aq ) then
M ← ConstructMatch((Ap , Aq ), M )
else
/*buffering uncertain matches*/
B ← (Ap , Aq )
end if
Remove (Ap , Aq ) from Xpq
end while
/* Resolve uncertain matches in buffer B using ConstructMatch */
M ← ConstructMatch(B, M )
/*3. Matching Growth for rare attributes*/
Create a set of clusters {C} according to {Mi }
Update ST F and compute new term weights
Use 1NN and co_location constraint to assign rare attributes to the
cluster of closest match
Cluster unmatched attributes by HAC and add them to {C}
PRUDENT SCHEMA MATCHING–PRUSM
Aggregation
Recall that one challenge in Form Schema Matching problem
is the values sparseness i.e., there are many elements that do not
have any associated domain values or can not be extracted correctly because of the Javascripts. Furthermore, the value distribution of similar elements might be different. For example, different
price ranges of price or different car models which correspondent
to specific car make. For this reason, PruSM starts with combining
similar elements to improve value distribution and reduce domainvalue sparseness to take the benefit of most common and available
domain values. Aggregation is also fundamental for next steps of
PruSM where measurement are based on set of elements (attribute
correlations and domain-value similarity). The Aggregation component is based on the element labels, that is, the cases when elements uses exactly the same label for the same concept. In particular, Aggregation creates new attributes by grouping elements
whose labels are the same after automatically stemming and removing stop-words 1 . For instance, Select a make and Select
makes are aggregated as a new attribute with label select make.
Note that, current holistic approaches [5, 12, 24] all require
manual pre-processing to remove irrelevant terms and simplify the
labels. For instance, Search for book titles is simplified to
title. However, as mentioned in example 2, automatically simplifying labels is not trivial and simplification errors can lead to incorrect results. In contrast, PruSM does not need to detect and remove
domain-specific words (e.g., car, vehicle, auto, books) or
generic search terms (e.g., select, choose, find, enter, please)
or modification terms (e.g., a range of, if so, what is your,
...)
3.2
Matching Discovery (MD)
Matching Discovery component leverages the fact that matched
attributes are rarely co-occurred in the same Web forms and they
would have similar labels or value distributions. It uses combination of correlation measure, label and domain value similarities to
extract matches. As part of the prudent strategy, Matching Discovery aims to discover only confident matches, therefore, only
frequent attributes participate in this component because they contain significant evidences, help to derive strong matches and consequently, avoid error propagation.
In this section, we first present the computation of label, domain value and correlation similarities, then we present Prudent
Matcher, a novel approach on combining those scores to find Matches.
3.3
Computing similarities
Given two attributes (Ai , Aj ), we quantify the similarity between them using three measures: label similarity, domain-value
similarity, and correlation. Below we describe these measures, their
benefits and limitations and then we present the matching algorithm
that prudently combines these measures.
Label Similarity. Because forms are designed for human consumption, labels are descriptive and are often an important source
for identifying similar elements. We define the label similarity between two attributes (Ai , Aj ) as the cosine distance [2] between
the term vectors for their labels:
lsim(Ai , Aj ) = cos(li , lj )
(1)
P
xi yi
where
(2)
cos(x, y) = pP i2 pP 2
x
i i
i yi
Each term has different term weight. To determine term weight,
we use the TF-IDF measure [2]. To capture the importance of
terms, in addition to term frequency (TF), we also use the singular
1
http://members.unine.ch/jacques.savoy/clef/englishST.txt
token frequency (STF) when we find the mapping between labels
and form elements [17]. The intuition behind STF comes from the
fact that generic terms such as "Please", "Select", "of",
"available" usually have high frequency, since they appear in
many composite labels (e.g., "select" appears in "Select a
car make", "Select State", "Select a model") but rarely
appear alone. We use STF to distinguish between labels that frequently appear alone and are thus likely to be important, and labels
that only appear together with other labels—they do not have a
complete meaning by themselves and are thus unlikely to represent
an important concept.
The term weight wi is computed using the equations below:
p
(3)
w(ti ) = T F (ti ) ∗ ST F (ti )
n(ti )
T F (ti ) = P
(4)
t
n(ti _appear_alone)
(5)
n(ti )
Note that using STF early might not be very sufficient because initially, element labels are not normalized i.e., some important attributes barely appear in short form, for example, Select make
appears more frequent than Make To address this problem, as we
discuss in Section 3.4, after identifying a set of initial matches, we
normalize labels from long-form into short-form to boost the STF
score and the weight of important terms. This boosting is very important to obtain additional matches of rare labels.
Example 3. In the MD step, we find the match: year(44) ∼
select year(15) ∼ year rang(16) which contains two stop
words select and range.2 From this match, we can increase the
term weight for year and decrease the term weight of select and
range.
Domain-Value Similarity. To compute the domain similarity between two attributes, we first aggregate all the domain values for
each attribute. Given an attribute Ak , we build a vector that contains all occurrences of values
S associated with the label l for Ak and
their frequencies: Dk = i=1..n (vi , f requency). Given two attributes Ai and Aj , the cosine distance (Equation 2) is then used to
measure the similarity between their corresponding value vectors:
ST F (ti ) =
dsim(Ai , Aj ) = cos(Di , Dj )
(6)
Example 4. Consider the following scenario: element e1 and
e2 have the same label, and so do e3 and e4 . The values associated with these elements are, respectively: v1 ={a,b,d}, v2 ={a,d},
v3 ={a,b,c}, v4 ={a,b}. These four elements are aggregated into two
attributes A1 = {e1 , e2 }, A2 = {e3 , e4 } with D1 ={a:2,b:1,d:2},
D2 ={a:2,b:2,c:1}. The similarity between A1 and A2 is:
dsim(A1 , A2 ) = cos(D1 , D2 ) = 0.67.
Domain values are a good source of similarity information. However, they are not always available (see e.g., the elements Your ZIP
or Title in Figure 3). Therefore, we consider them as supporting
information to validate a match.
Correlation. By holistically and simultaneously analyzing a set of
forms, we can leverage an implicit source of similarity information:
attribute correlation. Correlation is a statistical measure which indicates the strength and direction of a linear relationship between
two variables. For Web forms, we exploit the fact that synonym
attributes are semantic alternatives and rarely co-occur in the same
form interface (push away)—they are negatively correlated (e.g.,
Make and Brand). On the other hand, grouping attributes are semantic complements and often co-occur in the same form interfaces
2
The number next to the label corresponds to the frequency of the
label in the form collection.
Figure 4: Prudent schema matching framework
(pull together)—they are positively correlated (e.g., First name and
Last name, Departure date and Return date).
Although there are several correlation measures (e.g., Jaccard,
Jini index, Laplace, Kappa), none is universally good [18]. In the
context of Web forms, two correlation measures have been proposed: H-measure [13] and X/Y measure [24]. For PruSM, we use
the latter, which was shown to be better and defined as follows:
(
0
if Ap , Aq ⊂ form f
X(Ap , Aq ) =
(Cp −Cqp )(Cq −Cqp )
otherwise
(Cp +Cq )
(7)
Cpq
(8)
min(Cp , Cq )
The matching score X captures negative correlation while the
grouping score Y captures positive correlation. To understand these
scores, consider the contingency table for the attributes Ap and Aq
which encodes the co-location patterns of these attributes [7].
Also note that correlation is not always an accurate measure, in
particular, when insufficient instances are available. However, as
we discuss below, correlation is very powerful and effective when
combined with other measures.
Y (Ap , Aq ) =
3.3.1
Prudent Matcher
The goal of Prudent Matcher is to identify confident matches.
Given the Web form heterogeneity, it is challenging to determine
whether two attributes correspond to the same concept. We define
singular matchers [23, 22] which corresponds to the above features:
label similarity, domain-value similarity and correlation. Although
these features are super-features i.e., features obtained at the level
of set of elements, using them in isolation is still insufficient and
lead to error propagation. However, we observe that correlation can
be effective if prudently combined and reinforced with additional
evidences, such as strong label or value similarity. The combination ensures that even if two attributes Ap and Aq have a high correlation score, they do not provide a good match unless additional
evidence is available.
Definition 4. A prudent matcher is a configuration Conf that
consists of a set of constraints over correlation score, label similarity and value similarity which are used to validate a match.
A prudent match is valid if X(Ap ,Aq ) > TM atching_score AND
[dsim(Ap , Aq ) > Tdsim OR lsim(Ap , Aq ) > Tlsim ].
The prudent matcher is a composite matcher. It is simple yet
powerful and comprehensive because it can incorporate both visible and latent information at a high level of set of elements. We
will illustrate its efficiency and robustness in the later experiments.
3.3.2
Construct Matches
Only prudent matches are considered by ConstructMatch (Algorithm 2), which constructs a set of confident matches M by iteratively choosing highest negatively correlated attribute pairs, and
Algorithm 2 ConstructMatch
1: Input: a candidate pair or list of candidate pairs (Ap , Aq ), current
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
matches M , grouping threshold Tg
Output: set of matches M
begin
if neither Ap nor Aq appears ∈ M then then
if neither Ap nor Aq are close to a Mj ∈ M then
M ← M + {{Ap } ∼ {Aq }}
else
Mj ← Mj + ({Ap } ∼ {Aq })
end if
else if only one of Ap and Aq appears in M then
/* suppose Ap appears in Mj and Aq does not */
if F or each Ai ∈ Mj , s.t. Xqi > 0 then
Mj ← Mj + (∼ {Aq })
else if ∃Gjk ∈ Mj s.t. Xqx > 0 ∀Ax ∈ Gjx,x6=k
and Yqk > Tg ∀Ak ∈ Gjk then
Gjk ← Gjk + {Aq }
end if
end if
end
Figure 5: Cascading errors of imprudent matching
Table 1: Example of input schemata
#
1
2
3
4
5
6
Schema
make/model(mm);zip;distance(dis);price(pr)
make(mk); model(md); price
select make(smk);model;distance;zip;vehicle price(vpr)
select make;select model(smd);distance;zip;vehicle price
make, within, zip code(zc), vehicle price
make; select model; zip
F
15
20
6
4
3
2
Table 2: Attribute pairs with normalized matching score > 0.2
#
1
2
3
4
5
6
7
8
9
10
X
12.5
9.51
9.47
9.37
7.77
7.14
6.96
6
5.12
4.87
Pair
mk_dis
mm_md
pr_vpr
mm_mk
smk_pr
smk_mk
mm_vpr
mm_smk
pr_smd
md_smd
Note
F
P
P
P
F
P
F
P
F
P
#
11
12
13
14
15
16
17
18
19
20
X
4.28
2.76
2.76
2.7
2.7
2.68
2.68
2.67
2.67
2.5
Pair
mm_smd
zc_pr
within_pr
zip_zc
zip_within
zc_md
within_md
zc_dis
within_dis
zc_mm
Note
P
B
F
P
B
B
F
B
P
B
decides whether two attributes that match to the same set of other
attributes should become a new matching component (line 13) or
a grouping component (line 15). The process is illustrated in the
example below:
Example 5. Consider the set of input schemata and their associated frequency F in Table 1. Table 2 shows attribute pairs that have
high matching score (the normalized matching score is greater than
0.2). The column Note shows the outcome of the prudent matcher:
F stands for Fail (fail the prudent test), P for Pass, and B for Buffering (uncertain match). Initially, M0 ={make/model ∼ make}.
By iteratively integrating confident matches i.e., the P pairs (pair
number 2, 3, 4, 6, 8, 10, 11, 14 and 19), the final matching result includes M0 = {make/model ∼(make, model) ∼(select
make, select model)}, M1 = {price ∼ vehicle price},
M2 = {zip ∼ zip code}, M3 = {within ∼ distance}
Although ConstructMatch is inspired by [24], there are important distinctions: it considers only confident matches to avoid cascading errors. It also takes the benefit of having confident matches
as additional constraints to reconcile less certain matches. In particular, using pairs that fail to perform validation can make incorrect
decisions: if a bad decision is made in an early step, it will not
be corrected and negatively effect the following steps. By identifying high confident matches, PruSM can avoid a potentially large
number of incorrect matches. The last step in MD is to resolve
uncertain but potential matches. By buffering and revising uncertain matches, we can take the advantage of having extra constraints
from the confident matches. It is worth it to perform this relaxation
because the domain values sometimes are too coarse; contain no
values or cannot be extracted correctly, leading to a low similarity
between them.
Example 6. Depicted in Figure 5 and Table 2, the incorrect match
between select make and price (pair 5) will eliminate the correct match between select make and make (pair 6) (make and
price are often co-occurred in the same interfaces, they can not
be synonym). Pair 12 is uncertain because zipcode does not have
domain values. However, having pair 3 as a certain match will help
to resolve uncertain match in pair 12 between zipcode and price
3.4
Matching Growth (MG)
After the Matching Discovery step where initial high-confidence
matches are derived, PruSM performs the Matching Growth (MG)
which finds additional matches for rare attributes (Algorithm 1,
lines 22-26).
First of all, based on the certain matches found in MD phase,
the algorithm updates the STF frequency values to obtain more accurate weights for important terms (see Example 3). Identifying
anchor terms which are representative of a domain (e.g., "year",
"make", "model" for the auto domain) is very helpful for the latter phase of MG, where greater variability is present in attribute
labels.
As mentioned, attribute fragmentation can affect the quality of
correlation, lead to incorrect ordering. For instance, we encountered the following scenario in Airfare domain: The correlation
score X(departure date, return on) is greater than X(departure
date, leave on). In this case, domain values do not help because they are similar—they all contain values corresponding to
months and days. To address this problem, we use attribute proximity information to break ties, find a finer resolution for complex matches and then create a set of corresponding clusters (Algorithm 1, line 23). In this case, the match {departure date,
return date} ∼ {return, depart} ∼ {return on, leave
on} can be re-ordered as {departure date, return date} ∼
{depart, return} ∼ {leave on, return on}. and then be
broken into clusters of {departure date, depart, leave on}
and {return date, return, return on}. An alternative solution to improve the quality of the correlation is to use a “lookahead buffer”. The idea is to consider the top-k correlated pairs and
combine this with other information to augment the correlation and
improve the imperfect local ordering. For the above example, by
buffering top-k correlated pairs and use the label similarity as the
reward, (departure date, depart) should be ranked higher
than (departure date, return) and hence be considered first.
To assign different rare attributes to the cluster of the most similar frequent attributes that we discovered, we use 1-Nearest-Neighbor
Table 3: Experimental hidden-Web domains
(a) TEL8
Domain
Auto
Airfare
Book
Hotel
CarRental
Forms
98
49
68
46
25
Elements
533
334
403
281
191
Table 4: Some of the basic seeds in FFC Auto
select make model(9) ∼ make(63) model(59) ∼ select model(8) select make(7)
year(44) ∼ select year(15) ∼ year rang(16) ∼ select rang of model year(6)
price(14) ∼ price rang(23) ∼ price rang is(6)
zip(13) ∼ zip code(7)
valid email(6) ∼ email(8)
type(5) ∼ body style(6)
search within(7) ∼ distance(4)
mile(12) ∼ mileage(4)
(b) FFC
Domain
Auto
Book
Airfare
Total forms
1254
747
678
Sampled forms
136
165
70
Sampled elements
811
1285
766
Clustering (1NN). By exploiting the form context and checking the
list of resolved (matched) and unresolved (unmatched) elements of
each form, we ensure that two elements in the same form cannot be
in the same cluster (namely, the co-location constraint). Finally, for
the remaining unmatched attributes, we run HAC algorithm (Hierarchical Agglomerative Clustering) to group similar attributes into
new clusters and add them to the set of matches. Example of additional matches derived by 1NN in Auto domain include: price up
to ∼ price, price range in euro ∼ price range, model
example mustang ∼ model, approximate mileage ∼ mileage,
color of corvett ∼ color, if so, what’s your trade year
∼ year.
HAC derives and adds the following new clusters: {within
one month, within one week, within hour}, {dealer list,
dealer name, omit dealer list} ,etc.
The set of clus-
ters is then used to update the set of discovered matches.
As we show later in the experimental evaluation, the contribution
of different components is different for different domains and even
for different forms in a domain. For example 1NN is efficient only
if the MD can discover enough good seeds, and 1NN only might
not be effective unless it is combines with the form context or using datatype information. Therefore, it is crucial that the different
components be combined prudently.
4.
4.1
EXPERIMENTAL EVALUATION
Experimental Setup
We empirically evaluate the PruSM performance over two public data sets that consist of Web forms in multiple domains. One
of the data sets is small and clean (TEL83 ), while the other is a
large, heterogeneous collection of forms automatically gathered by
a focused crawler [4] and automatically classified into different domains (FFC4 ). As shown in Fig. 2, there is a wide variability in the
frequency distribution of these labels. In particular, there is a large
number of rare attributes, especially in the longer and lower tail of
FFC. We compare PruSM against previous holistic form matching
approaches [12, 24]. Since HSM [24] outperformed DCM [12], we
compare PruSM against HSM over the TEL8 and FFC data sets, on
657 Web forms with 4604 form elements. We re-implement HSM
and used a low frequency threshold Tc =5%, and used the labels as
they are—no syntactic merging was applied.
Effectiveness measures. To evaluate the effectiveness of PruSM,
3
4
http://metaquerier.cs.uiuc.edu/repository
http://www.deeppeep.org
we use precision, recall and F-measure. Precision can be seen as
a measure of fidelity, whereas recall is a measure of completeness.
F-measure is the harmonic mean between precision and recall—
a perfect F-measure has value 1.
Given a match Mj , which
corresponds to an attribute cluster j and class i, precision and recall
are defined as:
nij
(9)
P (Mj , i) =
nj
nij
R(Mj , i) =
(10)
ni
where nij is the number of members of class i in match Mj , nj
is the number of members in Mj , ni is the number of members
of class i (which corresponds to a concept). As there are many
matches, we measure the average precision, recall and F-measure
according to the sizes of each match.
X |Mi |
Pavg (Mj ) =
∗ PM i
(11)
|M |
M
i
X |Mi |
∗ RMi
Ravg (Mj ) =
|M |
M
(12)
i
Favg (Mj ) =
2 ∗ Pavg (Mj ) ∗ Recallavg (Mj )
P recisionavg (Mj ) + Recallavg (Mj )
(13)
where |M | is the total number of elements present in all matches. In
addition to F-measure, for each match Mj , we compute the cluster
entropy according to the probability pij that a member of cluster j
belongs to class i.
X
Entropy(Mj ) = −
pij ∗ logpij
(14)
i
The total entropy is the sum of the entropy values for all clusters, weighted by the size of each clusters. Intuitively, the more
homogeneous are the clusters, the lower is the entropy.
4.2
Prudent Matcher
We evaluate the effectiveness of the prudent matcher by comparing it against other matchers including: single matchers (correlation, lsim, dsim); combination of different matchers like linear combination of lsim and dsim (Avg2) or lsim, dsim and correlation(Avg3) or the maximum value among them (Max2, Max3).
Fig. 6 shows that the prudent matcher obtains substantially higher
precision than the others, which is the most important goal in MD.
Fig. 7 shows the sensitivity of the Matching Discovery to different correlation thresholds, obtained by using the correlation matcher
(Figure 7(a)) and by using the prudent matcher (Figure 7(b)). Note
that Schema Discovery with correlation matcher is similar to HSM[24].
As we can see, when the (normalized) correlation threshold increases, the Precision increases but Recall decreases. When the correlation threshold is lower (from 0.05 to 0.35), the Precision is very
low (less than 60%). To obtain an acceptable precision, the correlation threshold must be very high (greater than 0.5), but this also
leads to a substantial reduction in recall. On the other hand, when
the prudent matcher is used (Figure 7(b)), the precision is high even
Figure 9: PruSM result entropy values
Figure 6: Different combination strategies in MD
Table 5: Contribution of PruSM components to F-measure (%)
Domain Prudent
matcher
Car
41.8
Airfare 33.3
Book
18.3
(a) Without Validation
(b) With Validation
Figure 7: The effectiveness of Matching Discovery with different correlation thresholds
with a very low and a wide variation of Correlation Threshold (from
0.05 to 0.55) and therefore, we can obtain decent values of Recall.
In the experiment of PruSM, we choose Label Similarity Threshold 0.75 and Value Similarity Threshold 0.5. Note that with decent
values of Label Similarity Threshold and Value Similarity Threshold, the Prudent Matcher is very effective and the effectiveness of
PruSM is robust to a wide range of Correlation Threshold.
4.3
Effectiveness of PruSM
Fig. 8 shows the resulting accuracy values of HSM and PruSM
where the PruSM matching process are split into MD and MG, to
show the accuracy obtained in the different phases. HSM has lower
accuracy because we apply a low frequency threshold (Tc =5%) and
use the labels as they are. In particular, HSM has low accuracy
in heterogeneous domains like FFC-Book and higher accuracy in
’clean’ domains like TEL8-Book. For both data sets, and in all
domains, the precision, recall and F-measure values of PruSM are
higher than those for HSM. The gains in F-measure of PruSM compared to HSM vary between 10% and 39% in TEL8, and between
38% and 68% in FFC — Smaller improvements are observed in
clean domains, where data is cleaner and HSM is expected to per-
MD
Precedence
matcher
0
20.0
0
Reinforce
&Buff
0.8
2.4
2.1
Update
STF
3.7
3.3
3.1
MG
1NN HAC
10.9
5.8
6.3
1.3
0.9
9.4
form well. While the gains in precision can be attributed to the
prudent matching process, the gains in recall come mostly from the
Matching Growth phase, which finds additional matches for rare
attributes.
To evaluate the homogeneity of the clusters corresponding to the
final matches, we compute the cluster entropy using Eq. 14. Fig. 9
shows entropy values obtained for FFC and TEL8, which indicate
that good clusters are derived for both data sets. The entropy values
obtained for TEL8 are usually slightly lower than those for FFC,
which is predictable because TEL8 is cleaner and less heterogeneous than FFC.
To show the effectiveness of different PruSM components, table 5 summarizes the contribution of different components in Matching Discovery (MD) and Matching Growth (MG). For example,
without using the prudent matcher, F1 decreases 41.8%. The table shows that prudent matching leads to the most significant gains
in F-measure. Also, the different components have different impact
depending on the domain. For example, 1NN plays an important
role in Auto while HAC does so in Book.
We followed the strategy adopted in [13] and manually created
a list with generic search-related terms that are present in labels
(e.g., “search”, “enter”,“page”, etc.), as well as domain specific
ones (e.g., “book”, “movie”, “vehicle”, etc.). We then ran PruSM
and HSM on two variations of the data sets: the original version
(raw), and a version where the terms in the list were removed from
the element labels (clean). While there was no significant change
in F-measure and entropy for PruSM in the presence or absence of
these words, there was an increase in F-measure and decrease in
entropy for HSM on the clean data (see Figure 11).
Through the experiment, the high accuracy obtained by PruSM
in the FFC data set shows that it is effective to deal with large,
heterogeneous form collections, robust to rare attributes and is able
to obtain high-quality matches without the need to manually clean
labels or perform syntactic merging.
5.
RELATED WORK
Even though form-schema matching is related to the problem of
database schema matching ([21, 23]), there are fundamental differences, notably, the underlying database schemata is unknown and
the form local schemas are more heterogeneous. Recognizing these
differences, a number of approaches have been proposed for match-
Figure 8: Effectiveness of PruSM versus HSM on TEL8 and FFC
(a) F-measure
(a) F-measure
(b) Entropy
(b) Entropy
Figure 10: PruSM performance with raw and clean data
Figure 11: HSM performance with raw and clean data
ing forms [25, 26, 27, 14, 20, 5, 12, 24]. Instance-based approaches
[25, 26] perform probing to infer matching. Clustering-based approaches ([27, 14, 20, 6]) usually combine visible information to
define a similarity function between two elements across forms.
Besides requiring clean data as input, most of these approaches are
only effective for a small number of forms. To identify synonyms,
they use Wordnet [14] is impossible to identify domain-specific
synonyms (like Vehicle and Make) or leverage the domain values [20] which is not sufficient for attributes that range over a sparse
domain, or attributes that are not associated with any values. Only
[27] can derive 1:m mappings, but it requires a tree structure be
manually derived for each form.
Most related to PruSM are the holistic approaches which benefits from considering a large number of schemata simultaneously
[5, 12, 24]. DCM exploits the “apriori” property to discover all
possible positively and negatively correlated groups. HSM uses
a greedy algorithm based on negative correlation to discover element synonyms. However, a critical limitation shared by them
is they require manually clean and normalized data which limits
their scalability, and ignore rare attributes which is commonplace
for Web data. Otherwise, their performance decreases significantly
[12, 24]. As a result, they cannot be directly applied to large, heterogeneous form collections, such as forms obtained by a focused
crawler. Besides, due to its score-centric nature, HSM can produce many incorrect matches in the presence of rare attributes. In
contrast, PruSM exploits additional evidence in a prudent matching
process which helps minimize ’irreversible’ incorrect matches. The
time complexity of DCM is exponential in the number of attributes
while the complexity of HSM and PruSM is polynomial. However,
HSM is correlation-centric which leads to incorrect matches and
propagated errors.
Although two-phase matching has been used for matching ontology and database schemata [14, 9, 6], they assume certain matchers
are strong and combine them in a fixed manner. Given the Web data
heterogeneity, using a fixed combination of simple matchers is not
sufficient enough. PruSM not only comprehensively combines
meta-features in a prudent matcher but also utilizes strong matches
as extra constraints to verify weaker and uncertain matches.
6.
CONCLUSIONS AND FUTURE WORK
In this paper, we propose PruSM, an efficient approach and strategy to Web form-schema matching which prudently combines a
rich set of features to accurately determine confident matches among
form elements. Our experiments show that PruSM outperforms
previous form-schema matching approaches. We have also shown
that it is effective and robust to rare attributes and without requiring pre-processing. We believe PruSM provides a practical and efficient solution to the form matching problem. In the future work,
we plan to test the sensitivity of PruSM to noisy data generated by
the automatic process and also using the match information derived
by PruSM, we plan to investigate techniques for automatically filling out the forms and retrieve the contents hidden behind them.
7.
REFERENCES
[1] M. Abrol, N. Latarche, U. Mahadevan, J. Mao,
R. Mukherjee, P. Raghavan, M. Tourn, J. Wang, and
G. Zhang. Navigating large-scale semi-structured data in
business portals. In Proceedings of VLDB 01, pages
663–666, San Francisco, CA, USA, 2001.
[2] R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern
Information Retrieval. ACM Press/Addison-Wesley, 1999.
[3] L. Barbosa and J. Freire. Siphoning hidden-web data through
keyword-based interfaces. In In SBBD, pages 309–321, 2004.
[4] L. Barbosa and J. Freire. An adaptive crawler for locating
hidden-web entry points. In WWW, pages 441–450, 2007.
[5] H. Bin and C. K. Chen-Chuan. Statistical schema matching
across web query interfaces. In SIGMOD, pages 217–228.
ACM, 2003.
[6] N. Bozovic and V. Vassalos. Two-phase schema matching in
real world relational databases. In ICDE Workshops, pages
290–296, 2008.
[7] H. D. BRUNK. An Introduction to Mathematical Statistics.
Blaisdell Pub. Co, New York, 1965.
[8] K. C.-C. Chang, B. He, and Z. Zhang. Toward Large-Scale
Integration: Building a MetaQuerier over Databases on the
Web. In CIDR, pages 44–55, 2005.
[9] A. Gal. Managing uncertainty in schema matching with
top-k schema mappings. pages 90–114, 2006.
[10] Google Base. http://base.google.com/.
[11] L. Gravano, H. García-Molina, and A. Tomasic. Gloss:
text-source discovery over the internet. ACM Trans.
Database Syst., 24(2):229–264, 1999.
[12] B. He and K. C.-C. Chang. Automatic complex schema
matching across web query interfaces: A correlation mining
approach. ACM Trans. Database Syst., 31(1):346–395, 2006.
[13] B. He, K. C.-C. Chang, and J. Han. Discovering complex
matchings across web query interfaces: a correlation mining
approach. In KDD ’04, pages 148–157, 2004.
[14] H. He and W. Meng. Wise-integrator: An automatic
integrator of web search interfaces for e-commerce. In
VLDB, pages 357–368, 2003.
[15] J. Madhavan, S. Cohen, X. L. Dong, A. Y. Halevy, S. R.
Jeffery, D. Ko, and C. Yu. Web-scale data integration: You
can afford to pay as you go. In CIDR, pages 342–350, 2007.
[16] J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen,
and A. Y. Halevy. Google’s deep web crawl. PVLDB,
1(2):1241–1252, 2008.
[17] H. Nguyen, T. Nguyen, and J. Freire. Learning to extract
form labels. Proc. VLDB Endow., 1(1):684–694, 2008.
[18] P. ning Tan. Selecting the right interestingness measure for
association patterns. KDD, 2002.
[19] A. Ntoulas, P. Zerfos, and J. Cho. Downloading textual
hidden web content through keyword queries. In JCDL,
pages 100–109, 2005.
[20] J. Pei, J. Hong, and D. Bell. A robust approach to schema
matching over web query interfaces. In ICDE, pages 46–55.
IEEE Press, 2006.
[21] E. Rahm and P. A. Bernstein. A survey of approaches to
automatic schema matching. VLDB Journal, 10(4):334–350,
2001.
[22] A. D. Sarma, X. Dong, and A. Halevy. Bootstrapping
pay-as-you-go data integration systems. In In Proc. of
SIGMOD, 2008.
[23] P. Shvaiko and P. Shvaiko. A survey of schema-based
matching approaches. Journal on Data Semantics,
4:146–171, 2005.
[24] W. Su, J. Wang, and F. Lochovsky. Holistic query interface
matching using parallel schema matching. In Advances in
Database Technology - EDBT, pages 77–94, 2006.
[25] J. Wang, J.-R. Wen, F. L, and W.-Y. Ma. Instance-based
schema matching for web databases by domain-specific
query probing. In VLDB, pages 408–419, 2004.
[26] W. Wu and A. Doan. Webiq: Learning from the web to match
deep-web query interfaces. In ICDE, pages 44–54, 2006.
[27] W. Wu, C. Yu, A. Doan, and W. Meng. An interactive
clustering-based approach to integrating source query
interfaces on the deep web. In SIGMOD, pages 95–106,
2004.
[28] J. Xu and J. Callan. Effective retrieval with distributed
collections. In Proceedings ACM SIGIR, pages 112–120,
1998.
[29] C. Yu, K.-L. Liu, W. Meng, Z. Wu, and N. Rishe. A
methodology to retrieve text documents from multiple
databases. IEEE Trans. on Knowl. and Data Eng.,
14(6):1347–1361, 2002.

Download Report

PruSM: A Prudent Schema Matching Strategy for Web Forms

Paperzz.com

Your Paperzz