COMA – A system for flexible combination of schema matching

COMA – A system for flexible
combination of schema matching
approaches
Hong-Hai Do, Erhard Rahm
University of Leipzig, Germany
dbs.uni-leipzig.de
Content

Motivation
The COMA approach

Comprehensive matcher library
Flexible combination scheme
Novel reuse-oriented match approach
Evaluation setup and results
Conclusions and future work
2
Motivation
Schema matching: Finding semantic correspondences
between two schemas
Crucial step in many applications

Data integration: mediators, data warehouses
E-Business: XML message mapping
...
Currently manual, time-consuming, tedious

Need for approaches to automate the task as much as possible
PO1
Customer
custName
custCity
ShipTo
custStreet
custZip
PO2
shipToStreet
DeliverTo
shipToCity
shipToZip
BillTo
Address
City
Street
Zip
PO1.ShipTo.shipToCity ↔ PO2.DeliverTo.Address.City
3
Individual Match Approaches
Schema-based
Element
Linguistic Constraintbased
• Names
• Types
• Descriptions • Keys
Structure
Instance-based
Element
Constraint- Linguistic
based
• Parents
• Children
• Leaves
Constraintbased
Reuse-oriented
Element
Structure
• Dictionaries
• Thesauries
• Previous
match results
• Value pattern
• IR (word
frequencies, and ranges
key terms)
Survey paper [Rahm, Bernstein - VLDB Journal’01]
4
Combining Match Approaches

Combination of match algorithms

Hybrid: fixed combination, difficult to extend and improve
¾

Composite: combination of the results of independently
executed matchers
¾

currently most common: Cupid, SemInt, SimilarityFlooding, DIKE,
MOMIS, TranScm
currently only for machine learning-based techniques: LSD, GLUE
COMA: Framework for flexible COmbination of
MAtch algorithms

Extensible matcher library
Combination scheme with various combination strategies
5
System Architecture
Matcher
Library
Schema Import
Match Iteration
S1
Similarity cube
Matcher 1
UserFeedback
Matcher 2
Matcher 3
Matcher execution
S2
User
Interaction
(optional)
Combination of
match results
S1→S2
S2→S1
Mapping
Combination
Scheme
6
Combination Scheme
SmallLarge,
LargeSmall,
Both
S2
S2
S1
S1
M
at
ch
e
rs
1. Aggregation Average,
of matcherMax, Min,
specific results
Weighted
Similarity cube
3. Selection of
match
candidates
Similarity matrix
MaxN (Max1),
Threshold,
MaxDelta,
Threshold+MaxN,
Threshold+MaxDelta
2. Match
direction
S1→S2
s1
s2
0.8
...
...
...
S2→S1
s2
s1
0.8
...
...
...
[S1, S2, 0.7]
Combined
similarity
Match results
4. Computation
of combined
similarity
Dice,
Average
7
Match Processing: Example
S1
shipToCity
shipToStreet
Matcher1: 0.6
Matcher2: 0.8
S2
City
S2
S1
Matcher1: 0.8
Matcher2: 0.4
shipToCity
2. Direction
|S1|>|S2|
1. Aggregation
LargeSmall
(Match candidates
for smaller schema S2)
shipToStreet
Average: 0.7
City
Average: 0.6
S2 elements S1 elements
shipToCity
City
shipToStreet
Sim
0.7
0.6
3. Selection
Max1
S2 elements S1 elements
City
shipToCity
Sim
0.7
Threshold(0.5)
S2 elements S1 elements
City
shipToCity
City
shipToStreet
Sim
0.7
0.6
8
Matcher Library
Type
Matcher
Schema Info
Auxiliary Info
Constituent
Matchers
Simple
Affix
Element names
–
–
n-gram
Element names
–
–
Soundex
Element names
–
–
EditDistance
Element names
–
–
Synonym
Element names
External dictionaries
–
DataType
Data types
Data type compatibility table
–
UserFeedback
–
User-specified (mis-) matches
–
Name
Element names
–
Affix, 3-Gram,
Synonym
TypeName
Data Types+Names
–
DataType, Name
NamePath
Names+Paths
–
Name
Children
Child elements
–
TypeName
Leaves
Leaf elements
–
TypeName
Schema
–
Existing schema-level match
results
–
Hybrid
Reuseoriented
9
Reuse-oriented Matching
S1
m1
S
firstName
0.8
FName
0.6
LName
0.6
lastName
0.7
S1
m=
firstName
MatchCompose
(m1, m2)
lastName

m
0.7
Name
S2
Name
0.65
The MatchCompose operation: Transitivity of element
similarity

S2
m2
Composition of similarity relationships
Reuse of multiple match correspondences

vs. reuse of single element-level correspondences from
synonym tables, thesauries
10
Schema-level Reuse

The Schema matcher:
Aggregation
Direction
Selection
MatchCompose
Search
repository
S1 ↔ Si, S2 ↔ Si
S1 ↔ S2
Match
problem
S1 ↔ Sj, Sj ↔ S2
S1 ↔ S2
Sk ↔ S1, S2 ↔ Sk
Match
result
Existing match
results
Similarity cube

Reuse complete match results at the schema level

Exploit all possible reuse opportunities

Limit negative effects of transitivity
11
Real-world Evaluation

5 real-world schemas (XML – Purchase order), 10 match tasks

CIDX, Excel, Noris, Paragon, Apertum from biztalk.org
40-145 elements
Systematic evaluation (automatic mode)

1 Series = 10 Experiments: Test of 1 configuration of (Matcher, Aggregation,
Direction, Selection, Combined similarity) with 10 match tasks
12,312 series = 123,120 experiments
Matchers
No
reuse
5 single
Reuse
2 single
11
combinations
12
combinations
Σ=
16 + 14
Aggregation
-Max
-Average
-Min
Direction
-LargeSmall
-SmallLarge
-Both
-Max
-Average
-Min
3
3
Selection
Combined Sim
-MaxN(1-4)
-Average
-Delta(0.01-0.1)
-Dice
-Threshold(0.3-1.0)
-Threshold(0.5)+
MaxN(1-4)
-Threshold(0.5)+
Delta(0.01-0.1)
-Average
36
2
12
Match Quality Measures

Comparison of automatically with manually (i.e. real)
derived match correspondences
Real matches
A

Quality measures:
SimilarityFlooding [ICDE02]:

Suggested matches
B
C
A: Missed matches
B: Correct matches
C: False matches
B
A+ B
Precision =
B
B+C
Overall = 1−
A+C B −C
1


=
= Recall * 2−

Precision
A+ B A+ B


Recall =
Overall: post-match effort to add missed and to remove false
matches; negative Overall → no gain
Computed for single experiments and averaged over 10
experiments for each series (average Overall, etc.)
13
7077
#All Series = 8208
300
270
270
240
210
207
179
180
136
150
160
120
114
90
62
60
30
3 0
0
Min 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Overall

Most no-reuse series have
negative average Overall
“Good” matcher/strategy:

Positive average Overall
High presence in higher Overall
ranges
Average is used by all
series with average
Overall > 0.6
Aggregation (2376 series/strategy)
100%
90%
Max
80%
Min
70%
Series share
#Series
Results: Combination Strategies (1)
60%
50%
40%
30%
Average
20%
10%
0%
Min

0.0
0.1
0.2
0.3 0.4
Overall
0.5
0.6
0.7
0.8
Aggregation: Average
(compensating)
14
Results: Combination Strategies (2)

Direction: Both (considering both
directions)
Selection: Threshold+Delta (above
threshold + within tolerance)
Combined similarity: Average
(pessimistic)
Matcher: All (combination of all
hybrid matchers)
90%
LargeSm all
70%
60%
50%
40%
30%
Both
20%
10%
0%
Min
Thr(0.8)
90%
100%
MaxN(1)
0.0
0.1
0.2
90%
80%
Series share
Thr(0.5)+MaxN(1)
60%
50%
40%
30%
Delta(0.02)
20%
0.5
0.6
0.7
0.8
0.6
0.7
0.8
Dice
80%
70%
0.3 0.4
Overall
Computation of combined similarity
(4104 series/strategy)
Best selection (228 series/strategy)
100%
Series share
Sm allLarge
80%
Series share

Direction (2736 series/strategy)
100%
70%
60%
50%
40%
30%
Average
20%
Thr(0.5)+Delta(0.02)
10%
10%
0%
0%
Min
0.0
0.1
0.2
0.3 0.4
Overall
0.5
0.6
0.7
0.8
Min
0.0
0.1
0.2
0.3 0.4
Overall
0.5
15
Results: Single Matchers
a) Single matchers
No reuse

aA
Sc
em
Sc
h
avg Recall
he
m
aM
e
am
N
eP
am
N

SchemaM: Schema with manually
derived (real) match results
SchemaA: Schema with match
results automatically derived using
the default match operation
avg Precision

Reuse
at
Ty
pe h
Na
m
e
Le
av
es
C
hi
ld
re
n
1
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0
-0,1
-0,2
-0,3
avg Overall
Instability of some single (hybrid) matchers (negative Overall)
because of shared elements
E.g. DeliverTo.Address and BillTo.Address
Considering hierarchical names (NamePath) more accurate
Schema-level reuse very effective:

Essential improvement over no-reuse hybrid matchers
Reusing approved match results better than automatically derived match results
16
Results: Combined Match Approaches
b) Matcher combinations
1
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0
avg Recall
All+SchemaM: Combination
of all no-reuse hybrid matchers
and SchemaM
avg Overall

e
Na
m
n
at
h+
ld
re
eP
Ch
i
N
am
at
h+
eN
eP
Ty
p
N
eP
am
am
at
h+
at
h+
eP
am
N
am
ve
Le
a
hi
ld
aM
e
s
Al
re
n
es
+L
he
m
aM
Sc
he
m
Sc
ea
v
e
e
eN
am
+T
yp
aM
+C
N
Sc
he
m
he
m
aM
+N
am
at
h
eP
am
Reuse matchers outperform no-reuse matchers

Sc
+N
aM
Sc
he
m
Al
l+
S
ch
em
aM
l
avg Precision
All: Combination of all no-reuse
hybrid matchers
Best no-reuse All : 0.73 average Overall (Precision 0.95, Recall 0.78)
Best reuse All+SchemaM: 0.82 average Overall (Precision 0.93, Recall 0.89)
Combinations outperform single hybrid matchers

Combined matchers, e.g. All, consider many aspects at the same time
NamePath+Leaves: effective scheme, considering paths to identify context of
shared elements, and leaves to cope with structural conflicts
17
1
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0
Overall(No Reuse)
#All Elements
1<->2
1<->3
2<->3
1<->4
2<->4
240
216
192
168
144
120
96
72
48
24
0
Overall(Manual Reuse)
3<->4
1<->5
2<->5
3<->5
# Elements
Overall
Results: Match Sensitivity
4<->5
Match tasks

Impact of schema characteristics :

Degrading match quality with increase of schema size
Best combinations: no-reuse All and reuse-oriented
All+Schema

High stability across different match tasks
Little tuning effort for the default match operation
18
Conclusions and Future Work

The COMA framework

Comprehensive evaluation on real-world schemas

Extensible matcher library, including novel reuse approach
Powerful combination scheme for both specifying match
operations and constructing new matchers from existing ones
High effectiveness on large schemas
Reuse: essential improvement over no-reuse
Composite approach as THE solution for matcher combination
Future work

Matchers: more powerful reuse strategies, instance-based
matchers
More intelligent combination strategies
Application to more real-world scenarios, esp. in bioinformatics
19

Download Report

COMA – A system for flexible combination of schema matching

Paperzz.com

Your Paperzz