COMA – A system for flexible combination of schema matching

COMA – A system for flexible
combination of schema matching
approaches
Hong-Hai Do, Erhard Rahm
University of Leipzig, Germany
dbs.uni-leipzig.de
Content
„
„
Motivation
The COMA approach
„
„
„
„
„
Comprehensive matcher library
Flexible combination scheme
Novel reuse-oriented match approach
Evaluation setup and results
Conclusions and future work
2
Motivation
Schema matching: Finding semantic correspondences
between two schemas
„ Crucial step in many applications
„
„
„
„
„
Data integration: mediators, data warehouses
E-Business: XML message mapping
...
Currently manual, time-consuming, tedious
„
Need for approaches to automate the task as much as possible
PO1
Customer
custName
custCity
ShipTo
custStreet
custZip
PO2
shipToStreet
DeliverTo
shipToCity
shipToZip
BillTo
Address
City
Street
Zip
PO1.ShipTo.shipToCity ↔ PO2.DeliverTo.Address.City
3
Individual Match Approaches
Schema-based
Element
Linguistic Constraintbased
• Names
• Types
• Descriptions • Keys
Structure
Instance-based
Element
Constraint- Linguistic
based
• Parents
• Children
• Leaves
Constraintbased
Reuse-oriented
Element
Structure
• Dictionaries
• Thesauries
• Previous
match results
• Value pattern
• IR (word
frequencies, and ranges
key terms)
Survey paper [Rahm, Bernstein - VLDB Journal’01]
4
Combining Match Approaches
„
Combination of match algorithms
„
Hybrid: fixed combination, difficult to extend and improve
¾
„
Composite: combination of the results of independently
executed matchers
¾
„
currently most common: Cupid, SemInt, SimilarityFlooding, DIKE,
MOMIS, TranScm
currently only for machine learning-based techniques: LSD, GLUE
COMA: Framework for flexible COmbination of
MAtch algorithms
„
„
Extensible matcher library
Combination scheme with various combination strategies
5
System Architecture
Matcher
Library
Schema Import
Match Iteration
S1
Similarity cube
Matcher 1
UserFeedback
Matcher 2
Matcher 3
Matcher execution
S2
User
Interaction
(optional)
Combination of
match results
S1→S2
S2→S1
Mapping
Combination
Scheme
6
Combination Scheme
SmallLarge,
LargeSmall,
Both
S2
S2
S1
S1
M
at
ch
e
rs
1. Aggregation Average,
of matcherMax, Min,
specific results
Weighted
Similarity cube
3. Selection of
match
candidates
Similarity matrix
MaxN (Max1),
Threshold,
MaxDelta,
Threshold+MaxN,
Threshold+MaxDelta
2. Match
direction
S1→S2
s1
s2
0.8
...
...
...
S2→S1
s2
s1
0.8
...
...
...
[S1, S2, 0.7]
Combined
similarity
Match results
4. Computation
of combined
similarity
Dice,
Average
7
Match Processing: Example
S1
shipToCity
shipToStreet
Matcher1: 0.6
Matcher2: 0.8
S2
City
S2
S1
Matcher1: 0.8
Matcher2: 0.4
shipToCity
2. Direction
|S1|>|S2|
1. Aggregation
LargeSmall
(Match candidates
for smaller schema S2)
shipToStreet
Average: 0.7
City
Average: 0.6
S2 elements S1 elements
shipToCity
City
shipToStreet
Sim
0.7
0.6
3. Selection
Max1
S2 elements S1 elements
City
shipToCity
Sim
0.7
Threshold(0.5)
S2 elements S1 elements
City
shipToCity
City
shipToStreet
Sim
0.7
0.6
8
Matcher Library
Type
Matcher
Schema Info
Auxiliary Info
Constituent
Matchers
Simple
Affix
Element names
–
–
n-gram
Element names
–
–
Soundex
Element names
–
–
EditDistance
Element names
–
–
Synonym
Element names
External dictionaries
–
DataType
Data types
Data type compatibility table
–
UserFeedback
–
User-specified (mis-) matches
–
Name
Element names
–
Affix, 3-Gram,
Synonym
TypeName
Data Types+Names
–
DataType, Name
NamePath
Names+Paths
–
Name
Children
Child elements
–
TypeName
Leaves
Leaf elements
–
TypeName
Schema
–
Existing schema-level match
results
–
Hybrid
Reuseoriented
9
Reuse-oriented Matching
S1
m1
S
firstName
0.8
FName
0.6
LName
0.6
lastName
0.7
S1
m=
firstName
MatchCompose
(m1, m2)
lastName
„
m
0.7
Name
S2
Name
0.65
The MatchCompose operation: Transitivity of element
similarity
„
„
S2
m2
Composition of similarity relationships
Reuse of multiple match correspondences
„
vs. reuse of single element-level correspondences from
synonym tables, thesauries
10
Schema-level Reuse
„
The Schema matcher:
Aggregation
Direction
Selection
MatchCompose
Search
repository
S1 ↔ Si, S2 ↔ Si
S1 ↔ S2
Match
problem
S1 ↔ Sj, Sj ↔ S2
S1 ↔ S2
Sk ↔ S1, S2 ↔ Sk
Match
result
Existing match
results
Similarity cube
„
Reuse complete match results at the schema level
„
Exploit all possible reuse opportunities
„
Limit negative effects of transitivity
11
Real-world Evaluation
„
5 real-world schemas (XML – Purchase order), 10 match tasks
„
„
„
CIDX, Excel, Noris, Paragon, Apertum from biztalk.org
40-145 elements
Systematic evaluation (automatic mode)
„
„
1 Series = 10 Experiments: Test of 1 configuration of (Matcher, Aggregation,
Direction, Selection, Combined similarity) with 10 match tasks
12,312 series = 123,120 experiments
Matchers
No
reuse
5 single
Reuse
2 single
11
combinations
12
combinations
Σ=
16 + 14
Aggregation
-Max
-Average
-Min
Direction
-LargeSmall
-SmallLarge
-Both
-Max
-Average
-Min
3
3
Selection
Combined Sim
-MaxN(1-4)
-Average
-Delta(0.01-0.1)
-Dice
-Threshold(0.3-1.0)
-Threshold(0.5)+
MaxN(1-4)
-Threshold(0.5)+
Delta(0.01-0.1)
-Average
36
2
12
Match Quality Measures
„
Comparison of automatically with manually (i.e. real)
derived match correspondences
Real matches
A
„
Quality measures:
SimilarityFlooding [ICDE02]:
„
„
Suggested matches
B
C
A: Missed matches
B: Correct matches
C: False matches
B
A+ B
Precision =
B
B+C
Overall = 1−
A+C B −C
1


=
= Recall * 2−

Precision
A+ B A+ B


Recall =
Overall: post-match effort to add missed and to remove false
matches; negative Overall → no gain
Computed for single experiments and averaged over 10
experiments for each series (average Overall, etc.)
13
7077
#All Series = 8208
300
270
270
240
210
207
179
180
136
150
160
120
114
90
62
60
30
3 0
0
Min 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Overall
„
„
Most no-reuse series have
negative average Overall
“Good” matcher/strategy:
„
„
Positive average Overall
High presence in higher Overall
ranges
Average is used by all
series with average
Overall > 0.6
Aggregation (2376 series/strategy)
100%
90%
Max
80%
Min
70%
Series share
#Series
Results: Combination Strategies (1)
60%
50%
40%
30%
Average
20%
10%
0%
Min
„
0.0
0.1
0.2
0.3 0.4
Overall
0.5
0.6
0.7
0.8
Aggregation: Average
(compensating)
14
Results: Combination Strategies (2)
„
„
„
Direction: Both (considering both
directions)
Selection: Threshold+Delta (above
threshold + within tolerance)
Combined similarity: Average
(pessimistic)
Matcher: All (combination of all
hybrid matchers)
90%
LargeSm all
70%
60%
50%
40%
30%
Both
20%
10%
0%
Min
Thr(0.8)
90%
100%
MaxN(1)
0.0
0.1
0.2
90%
80%
Series share
Thr(0.5)+MaxN(1)
60%
50%
40%
30%
Delta(0.02)
20%
0.5
0.6
0.7
0.8
0.6
0.7
0.8
Dice
80%
70%
0.3 0.4
Overall
Computation of combined similarity
(4104 series/strategy)
Best selection (228 series/strategy)
100%
Series share
Sm allLarge
80%
Series share
„
Direction (2736 series/strategy)
100%
70%
60%
50%
40%
30%
Average
20%
Thr(0.5)+Delta(0.02)
10%
10%
0%
0%
Min
0.0
0.1
0.2
0.3 0.4
Overall
0.5
0.6
0.7
0.8
Min
0.0
0.1
0.2
0.3 0.4
Overall
0.5
15
Results: Single Matchers
a) Single matchers
No reuse
„
aA
Sc
em
Sc
h
avg Recall
he
m
aM
e
am
N
eP
am
N
„
SchemaM: Schema with manually
derived (real) match results
SchemaA: Schema with match
results automatically derived using
the default match operation
avg Precision
„
Reuse
at
Ty
pe h
Na
m
e
Le
av
es
C
hi
ld
re
n
1
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0
-0,1
-0,2
-0,3
avg Overall
Instability of some single (hybrid) matchers (negative Overall)
because of shared elements
„ E.g. DeliverTo.Address and BillTo.Address
Considering hierarchical names (NamePath) more accurate
Schema-level reuse very effective:
„
„
Essential improvement over no-reuse hybrid matchers
Reusing approved match results better than automatically derived match results
16
Results: Combined Match Approaches
b) Matcher combinations
1
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0
avg Recall
All+SchemaM: Combination
of all no-reuse hybrid matchers
and SchemaM
avg Overall
„
e
Na
m
n
at
h+
ld
re
eP
Ch
i
N
am
at
h+
eN
eP
Ty
p
N
eP
am
am
at
h+
at
h+
eP
am
N
am
ve
Le
a
hi
ld
aM
e
s
Al
re
n
es
+L
he
m
aM
Sc
he
m
Sc
ea
v
e
e
eN
am
+T
yp
aM
+C
N
Sc
he
m
he
m
aM
+N
am
at
h
eP
am
Reuse matchers outperform no-reuse matchers
„
„
„
Sc
+N
aM
Sc
he
m
Al
l+
S
ch
em
aM
l
avg Precision
All: Combination of all no-reuse
hybrid matchers
Best no-reuse All : 0.73 average Overall (Precision 0.95, Recall 0.78)
Best reuse All+SchemaM: 0.82 average Overall (Precision 0.93, Recall 0.89)
Combinations outperform single hybrid matchers
„
„
Combined matchers, e.g. All, consider many aspects at the same time
NamePath+Leaves: effective scheme, considering paths to identify context of
shared elements, and leaves to cope with structural conflicts
17
1
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0
Overall(No Reuse)
#All Elements
1<->2
1<->3
2<->3
1<->4
2<->4
240
216
192
168
144
120
96
72
48
24
0
Overall(Manual Reuse)
3<->4
1<->5
2<->5
3<->5
# Elements
Overall
Results: Match Sensitivity
4<->5
Match tasks
„
Impact of schema characteristics :
„
„
Degrading match quality with increase of schema size
Best combinations: no-reuse All and reuse-oriented
All+Schema
„
„
High stability across different match tasks
Little tuning effort for the default match operation
18
Conclusions and Future Work
„
The COMA framework
„
„
„
Comprehensive evaluation on real-world schemas
„
„
„
„
Extensible matcher library, including novel reuse approach
Powerful combination scheme for both specifying match
operations and constructing new matchers from existing ones
High effectiveness on large schemas
Reuse: essential improvement over no-reuse
Composite approach as THE solution for matcher combination
Future work
„
„
„
Matchers: more powerful reuse strategies, instance-based
matchers
More intelligent combination strategies
Application to more real-world scenarios, esp. in bioinformatics
19