Holistic Query Interface Matching using Parallel Schema Matching

Holistic Query Interface Matching using Parallel Schema Matching
Weifeng Su
Uni. of Sci. & Tech.
Kowloon, Hong Kong
[email protected]
Jiying Wang
City University
Kowloon, Hong Kong
[email protected]
Abstract
Using query interfaces of different Web databases, we
propose a new complex schema matching approach, Parallel Schema Matching (PSM). A parallel schema is formed
by comparing two individual schemas and deleting common
attributes. The attribute matching can be discovered from
the attribute-occurrence patterns if many parallel schemas
are available. A count-based greedy algorithm identifies
which attributes are more likely to be matched. Experiments
show that PSM can identify both simple matching and complex matching accurately and efficiently.
1 INTRODUCTION
Considering that there are thousands of Web databases
available today, it is very time consuming for an ordinary
user to query and retrieve information from all the relevant
databases. Since most Web databases are only accessible
through a query interface, a system/tool that helps a user locate information in numerous Web databases must be able
to understand the query interfaces and help dispatch user
queries to suitable fields of those interfaces. The main challenge of such a system is that different databases may use
different fields or terms to represent the same concept. For
example, to describe the genre of a CD in the MusicRecords
domain, Category is used in some databases while Style is
used in others. In the Books domain, First Name and Last
Name are used in some databases while Author is used in
others to denote the writer of a book.
This paper specifically focuses on matching attributes
across query interfaces of structured Web databases. We define an entry or field in a query interface as an attribute, and
all attributes in the query interface as a schema of the interface. When matching the attributes, we call a 1:1 matching,
such as Category with Style, a simple matching and a 1:n
or m:n matching, such as First Name, Last Name with
Author, a complex matching. In the latter case, attributes
First Name and Last Name form a concept group before
Frederick Lochovsky
Uni. of Sci. & Tech.
Kowloon, Hong Kong
[email protected]
they are matched to attribute Author. We call attributes that
are in the same concept group grouping attributes and attributes that are semantically identical or similar to each
other synonym attributes.
We propose a new interface schema matching approach,
Parallel Schema Matching (PSM), that matches all input
schemas holistically instead of matching two schemas at a
time to take advantage of the occurrence patterns of the attributes. To better discover the frequent/rare co-presence of
the attributes, we form parallel schemas by comparing two
schemas and deleting their common attributes. Both time
complexity analysis and experimental results show that the
PSM approach can efficiently discover simple and complex
matchings at the same time with high accuracy while requiring no domain knowledge or user interaction.
2 RELATED WORK
Current solutions for the schema matching problem [1,
3, 4, 6, 7, 8, 9, 10, 12] suffer from the following limitations:
1. simple matching: most schema matching methods can
only discover simple matchings between schemas.
2. low accuracy: the accuracy of methods that can identify complex matchings is generally unsatisfactory.
3. time consuming: some schema matching methods have
time complexity exponential in the number of attributes.
4. domain knowledge required: some schema matching
methods require domain knowledge or user interaction
before or during the matching process.
DCM [5], which discovers complex matchings holistically using data mining techniques, is based on similar
observations as PSM, but has the following disadvantages
compared to PSM:
01 f10
1. DCM uses H = ff+1
f1+ to measure the negative correlation between two attributes, by which the synonym
attributes are discovered. Such a measure may give a
high score for rare attributes, while PSM’s matching
score measure will not.
2. The time complexity of DCM is exponential with respect to the number of attributes n, while PSM’s time
complexity is polynomial with respect to n.
3 PARALLEL SCHEMA MATCHING
We observe that Web databases in the same domain usually share the following characteristics:
1. They are usually semantically similar to each other.
2. An attribute is usually unambiguous in a domain although it may have more than one meaning in an ordinary comprehensive dictionary.
3. Synonym attributes are rarely co-present in the same
interface schema.
4. Grouping attributes are usually co-present in the same
interface schema to form a “larger” concept.
Figure 1: Parallel Schema Matching Workflow.
We formalize the schema matching problem as the same
problem described in [5]. The input is a set of schemas
S = {S1 , . . . , Su }, in which each schema Si (1 ≤ i ≤ u)
contains a set of attributes extracted from a query interface
and the set of attributes A = ∪ui=1 Si = {A1 , . . . , An } includes all attributes in S. We assume that these schemas
come from the same domain. The schema matching problem is to find all matchings M = {M1 , . . . , Mv } including both simple and complex matchings. A matching Mj
(1 ≤ j ≤ v) is represented as Gj1 = Gj2 = . . . = Gjw ,
where Gjk (1 ≤ k ≤ w) is a group of attributes and Gjk
is a subset of A, i.e., Gjk ⊂ A. Each matching Mj should
represent the semantic synonym relationship between two
attribute groups Gjk and Gjl (l 6= k), and each group Gjk
should represent the grouping relationship between the attributes within it1 . More specifically, based on observations
2 and 4 above, we restrict each attribute to appear no more
than one time in M.
The workflow of PSM is shown in Figure 1. Before the
matching discovery, two scores, matching score, which is
used to evaluate the possibility that two attributes are synonym attributes, and grouping score, which is used to evaluate the possibility that two attributes are in the same group
in a matching, are calculated between every two attributes.
Synonym Attribute Candidate Generation takes all
schemas as input and generates all synonym attribute candidates based on observation 3. If there are n attributes in
the input schemas, the maximum number of synonym attribute candidates is Cn2 = n(n−1)
. However, not any two
2
attributes from A can be actual candidates for synonym attributes. We assume that two attributes (Ap , Aq ) are synonym attribute candidates if Ap and Aq are co-present in
less than Tpq schemas where Tpq is defined as:
Tpq =
1 An
(Cp + Cq ) ln u
,
u
and Cp and Cq are the count of attribute Ap and Aq in S.
Parallel Schema Construction takes every pair of input
schemas and constructs a set of parallel schemas in three
steps. First, every two different schemas are paired to form
preliminary parallel schemas. If there are u input schemas,
Cu2 = u(u−1)
preliminary parallel schemas will be ob2
tained. Second, in every preliminary parallel schema, common attributes that are shared by its two attribute sets are
deleted based on observation 2. Finally, parallel schemas
that contain at least one empty attribute set are discarded.
After deleting the common attributes, the relationship
between attribute Ap ’s count in the input schema set Q with
Ap ’s count in the parallel schema set S is
Dp = Cp (u − Cp ),
(2)
where Cp and Dp are the count of Ap in S and Q, respectively.
Matching Score Calculation calculates matching scores
based on the co-presence count of the attributes in the parallel schemas.
Definition 1 Given a parallel schema Ql = (Sl1 , Sl2 ) and
two attributes Ap ∈ Ql and Aq ∈ Ql , if Ap ∈ Sl1 and
Aq ∈ Sl2 or vice versa, Ap and Aq are defined as being
˙ l.
cross co-present in Ql , denoted as (Ap , Aq )∈Q
The cross co-presence count of two attributes in the parallel
schema set is used to calculate the matching score between
them. During the calculation, we assume that each parallel schema provides the same amount of information about
matching, thus each cross co-present attribute pair in it will
gain equal cross co-presence weight.
(1)
Definition 2 Given a set of parallel schemas Q = {Ql =
(Sl1 , Sl2 ), l = 1 . . . m} and two attributes Ap and Aq , a
attribute group can have just one attribute.
2
Domain
Airfares
CarRentals
Discovered Matching
{departure date (datetime), return date (datetime)} = {depart (datetime), return (datetime)}
{adult (integer), children (integer), infant (integer), senior (integer)} ={passenger (integer)}
{destination (string)} = {from (string), to (string)} ={arrival city (string), departure city (string)}
{cabin (string)} = {class (string)}
{drop off city (string), pick up city (string)} ={drop off location (string), pick up location (string)}
{drop off (datetime), pick up (datetime)= { pick up date (datetime),
drop off date (datetime), pick up time (datetime), drop off time (datetime)}
Correct?
Y
Y
P
Y
Y
Y
Table 1: Discovered matchings for Airfares and CarRentals (T =10%).
weighted cross co-presence count of Ap and Aq in Q is defined as the sum of the cross co-presence weight that Ap
and Aq gains in each parallel schema of Q, denoted as
P
1
D̃pq = (Ap ,Aq )∈Q
˙ l |Sl1 ||Sl2 | .
benefit of filtering bad matchings in favor of good ones. Another interesting and beneficial characteristic of this algorithm is that it is matching score centric, i.e., the matching
score plays a much more important role than the grouping
score.
For any two attributes Ap and Aq , a matching score Xpq
measures the possibility that Ap and Aq are synonym attributes. The bigger the score, the more likely that the two
attributes are synonym attributes. The matching score calculation between Ap and Aq is similar to Dice coefficient:
(
0
if (Ap , Aq ) ∈
/L
Xpq =
(3)
2D̃pq
otherwise,
(Cp +Cq )
4 EXPERIMENTS
We use the TEL-8 and BAMM datasets from the UIUC
Web integration repository [2]. The TEL-8 dataset contains query interface schemas extracted from 447 deep Web
sources of eight representative domains where each domain
contains about 20-70 schemas. The BAMM dataset contains query interface schemas extracted from four domains
where each domain has about 50 schemas.
where L is the set of synonym attribute candidates and Cp
and Cq are the count of attributes Ap and Aq in S.
This matching score has the following properties:
1. null invariance [11]. For any two attributes, adding
more schemas that do not contain the attributes does
not affect their matching score.
2. rareness differentiation. The matching score between
rare and the other attributes is distinguishably low.
We evaluate the set of matchings automatically discovered by PSM, denoted by Mp , by comparing it with the set
of matchings manually collected by a domain expert, denoted by Mc . To facilitate comparison, we adopt the metric
in [5], target accuracy, which evaluates how similar Mp is
to Mc . The target accuracy metric includes target precision
and target recall. The target precision and target recall of
Mp with respect to Mc are the weighted average of all the
attributes’ target precision and target recall.
Grouping Score Calculation takes all schemas as input
and calculates the grouping score between every two attributes based on observation 4. We use the following
grouping score measure between two attributes Ap and Aq :
Ypq
Cpq
=
,
min(Cp , Cq )
Result on the TEL-8 dataset: Table 1 shows the matchings discovered by PSM in the Airfares and CarRentals domains, when T is set at 10%. We see that PSM can identify
very complex matchings among attributes. Table 2 presents
the performance of PSM on TEL-8 when Tg is set to 0.9.
As expected, the performance of PSM decreases for rare attributes because the occurrence pattern of the rare attributes
is not obvious with only a few occurrences. Nevertheless,
the performance of PSM is almost always better than the
performance2 of DCM in [5], shown in Table 3.
(4)
where Cpq is the co-presence count of attributes Ap and Aq
in S, and Cp and Cq are the count of attributes Ap and Aq
in S.
We also set a grouping score threshold Tg such that attributes Ap and Aq will be considered grouping attributes
only when Ypq > Tg . Practically, Tg should be close to 1 as
the grouping attributes are expected to co-occur often. Tg is
an empirical parameter and the experimental results show
that it has similar performance in a wide range.
Schema Matching Discovery uses an iterative algorithm
where in each iteration, a greedy selection strategy is used
to choose the synonym attribute candidates with the highest matching score, until there is no synonym attribute candidate available. The greediness of this algorithm has the
Result on the BAMM dataset: The performance of PSM
on BAMM is shown in Table 4, when Tg is set to be 0.9. We
can see that PSM performs well on BAMM too, except that
the target precision in the Automobiles domain is low when
T =10%. The reason is similar to that for TEL-8.
2 The
3
precision of DCM when T =5% is not available in [5].
T =20%
PT
RT
Domain
T =10%
PT
RT
T =5%
PT
RT
Airfares
Automobiles
Books
CarRentals
Hotels
Jobs
Movies
MusicRecords
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
.89
.72
1
1
.74
.94
1
1
.91
1
1
1
1
.90
.76
.67
.64
.60
.70
.72
.62
.86
.88
1
.78
.88
.72
1
.88
Average
1
1
.92
.98
.70
.88
Domain
T =20%
PT
RT
T =10%
PT
RT
Airfares
Automobiles
Books
CarRentals
Hotels
Jobs
Movies
MusicRecords
1
1
1
.72
.86
1
1
1
1
1
1
1
1
.86
1
1
1
.93
1
.72
.86
.78
1
.76
.71
1
1
.60
.87
.87
1
1
Average
.95
.98
.88
.88
T =10%
PT
RT
T =5%
PT
RT
Automobiles
Books
Movies
MusicRecords
1
1
1
1
1
1
1
1
.55
.86
1
.81
1
1
1
1
.75
.82
.90
.72
1
1
.86
1
Average
1
1
.81
1
.80
.97
Table 4: Target accuracy of PSM on BAMM dataset (Tg =0.9).
Table 2: Target accuracy of PSM on TEL-8 dataset (Tg =0.9).
Domain
T =20%
PT
RT
[5]
[6]
[7]
Table 3: Target accuracy of DCM on TEL-8 dataset.
[8]
5 SUMMARY
[9]
We present a parallel schema matching approach to
holistically discover attribute matchings across Web query
interfaces. PSM is purely based on the occurrence patterns
of attributes and requires neither domain-knowledge nor
user interaction. Experimental results show that PSM discovers both simple and complex matchings with very high
accuracy in time polynomial to the number of attributes
and the number of schemas.
[10]
[11]
[12]
Acknowledgment: This research was supported by the
Research Grants Council of Hong Kong under grant
HKUST6172/04E.
References
[1] D. Beneventano, S. Bergamaschi, S. Castano, A. Corni,
R. Guidetti, G. Malvezzi, M. Melchiori, and M. Vincini.
Information integration: The momis project demonstration.
In 26th Int. Conf. Very Large Data Bases, pages 611–614,
2000.
[2] K. C.-C. Chang, B. He, C. Li, and Z. Zhang. The
UIUC Web integration repository.
Computer Science
Department, University of Illinois at Urbana-Champaign.
http://metaquerier.cs.uiuc.edu/repository, 2003.
[3] R. Dhamankar, Y. Lee, A. Doan, A. Halevy, and P. Domingos. imap: Discovering complex semantic matches between
database schemas. In ACM SIGMOD Conference, pages 383
– 394, 2004.
[4] A. Doan, P. Domingos, and A. Y. Halevy. Reconciling
schemas of disparate data sources: A machine-learning ap-
4
proach. In ACM SIGMOD Conference, pages 509 – 520,
2001.
B. He and K. C.-C. Chang. Discovering complex matchings across Web query interfaces: A correlation mining approach. In ACM SIGKDD Conference, pages 147 – 158,
2004.
B. He, K. C.-C. Chang, and J. Han. Statistical schema
matching across Web query interfaces. In ACM SIGMOD
Conference, pages 217 – 228, 2003.
W. Li, C. Clifton, and S. Liu. Database Integration using Neural Network: Implementation and Experience. In
Knowledge and Information Systems,2(1), pages 73–96,
2000.
S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity
flooding: A versatile graph matching algorithm. In 18th Int.
Conf. on Data Engineering, pages 117–128, 2002.
T. Milo and S. Zohar. Using schema matching to simplify
heterogeneous data translation. In 24th Int. Conf. Very Large
Data Bases, pages 122–133, 1998.
E. Rahm and P. A. Bernstein. A survey of approaches to
automatic schema matching. The VLDB Journal, 10:334–
350, 2001.
P. Tan, V. Kumar, and J. Srivastava. Selecting the right
interestingness measure for association patterns. In ACM
SIGKDD Conference, pages 32 – 41, 2002.
W. Wu, C. Yu, A. Doan, and W. Meng. An interactive
clustering-based approach to integrating source query interfaces on the deep Web. In ACM SIGMOD Conference, pages
95–106, 2004.