COMA – A system for flexible combination of schema matching approaches Hong-Hai Do, Erhard Rahm University of Leipzig, Germany dbs.uni-leipzig.de Content Motivation The COMA approach Comprehensive matcher library Flexible combination scheme Novel reuse-oriented match approach Evaluation setup and results Conclusions and future work 2 Motivation Schema matching: Finding semantic correspondences between two schemas Crucial step in many applications Data integration: mediators, data warehouses E-Business: XML message mapping ... Currently manual, time-consuming, tedious Need for approaches to automate the task as much as possible PO1 Customer custName custCity ShipTo custStreet custZip PO2 shipToStreet DeliverTo shipToCity shipToZip BillTo Address City Street Zip PO1.ShipTo.shipToCity ↔ PO2.DeliverTo.Address.City 3 Individual Match Approaches Schema-based Element Linguistic Constraintbased • Names • Types • Descriptions • Keys Structure Instance-based Element Constraint- Linguistic based • Parents • Children • Leaves Constraintbased Reuse-oriented Element Structure • Dictionaries • Thesauries • Previous match results • Value pattern • IR (word frequencies, and ranges key terms) Survey paper [Rahm, Bernstein - VLDB Journal’01] 4 Combining Match Approaches Combination of match algorithms Hybrid: fixed combination, difficult to extend and improve ¾ Composite: combination of the results of independently executed matchers ¾ currently most common: Cupid, SemInt, SimilarityFlooding, DIKE, MOMIS, TranScm currently only for machine learning-based techniques: LSD, GLUE COMA: Framework for flexible COmbination of MAtch algorithms Extensible matcher library Combination scheme with various combination strategies 5 System Architecture Matcher Library Schema Import Match Iteration S1 Similarity cube Matcher 1 UserFeedback Matcher 2 Matcher 3 Matcher execution S2 User Interaction (optional) Combination of match results S1→S2 S2→S1 Mapping Combination Scheme 6 Combination Scheme SmallLarge, LargeSmall, Both S2 S2 S1 S1 M at ch e rs 1. Aggregation Average, of matcherMax, Min, specific results Weighted Similarity cube 3. Selection of match candidates Similarity matrix MaxN (Max1), Threshold, MaxDelta, Threshold+MaxN, Threshold+MaxDelta 2. Match direction S1→S2 s1 s2 0.8 ... ... ... S2→S1 s2 s1 0.8 ... ... ... [S1, S2, 0.7] Combined similarity Match results 4. Computation of combined similarity Dice, Average 7 Match Processing: Example S1 shipToCity shipToStreet Matcher1: 0.6 Matcher2: 0.8 S2 City S2 S1 Matcher1: 0.8 Matcher2: 0.4 shipToCity 2. Direction |S1|>|S2| 1. Aggregation LargeSmall (Match candidates for smaller schema S2) shipToStreet Average: 0.7 City Average: 0.6 S2 elements S1 elements shipToCity City shipToStreet Sim 0.7 0.6 3. Selection Max1 S2 elements S1 elements City shipToCity Sim 0.7 Threshold(0.5) S2 elements S1 elements City shipToCity City shipToStreet Sim 0.7 0.6 8 Matcher Library Type Matcher Schema Info Auxiliary Info Constituent Matchers Simple Affix Element names – – n-gram Element names – – Soundex Element names – – EditDistance Element names – – Synonym Element names External dictionaries – DataType Data types Data type compatibility table – UserFeedback – User-specified (mis-) matches – Name Element names – Affix, 3-Gram, Synonym TypeName Data Types+Names – DataType, Name NamePath Names+Paths – Name Children Child elements – TypeName Leaves Leaf elements – TypeName Schema – Existing schema-level match results – Hybrid Reuseoriented 9 Reuse-oriented Matching S1 m1 S firstName 0.8 FName 0.6 LName 0.6 lastName 0.7 S1 m= firstName MatchCompose (m1, m2) lastName m 0.7 Name S2 Name 0.65 The MatchCompose operation: Transitivity of element similarity S2 m2 Composition of similarity relationships Reuse of multiple match correspondences vs. reuse of single element-level correspondences from synonym tables, thesauries 10 Schema-level Reuse The Schema matcher: Aggregation Direction Selection MatchCompose Search repository S1 ↔ Si, S2 ↔ Si S1 ↔ S2 Match problem S1 ↔ Sj, Sj ↔ S2 S1 ↔ S2 Sk ↔ S1, S2 ↔ Sk Match result Existing match results Similarity cube Reuse complete match results at the schema level Exploit all possible reuse opportunities Limit negative effects of transitivity 11 Real-world Evaluation 5 real-world schemas (XML – Purchase order), 10 match tasks CIDX, Excel, Noris, Paragon, Apertum from biztalk.org 40-145 elements Systematic evaluation (automatic mode) 1 Series = 10 Experiments: Test of 1 configuration of (Matcher, Aggregation, Direction, Selection, Combined similarity) with 10 match tasks 12,312 series = 123,120 experiments Matchers No reuse 5 single Reuse 2 single 11 combinations 12 combinations Σ= 16 + 14 Aggregation -Max -Average -Min Direction -LargeSmall -SmallLarge -Both -Max -Average -Min 3 3 Selection Combined Sim -MaxN(1-4) -Average -Delta(0.01-0.1) -Dice -Threshold(0.3-1.0) -Threshold(0.5)+ MaxN(1-4) -Threshold(0.5)+ Delta(0.01-0.1) -Average 36 2 12 Match Quality Measures Comparison of automatically with manually (i.e. real) derived match correspondences Real matches A Quality measures: SimilarityFlooding [ICDE02]: Suggested matches B C A: Missed matches B: Correct matches C: False matches B A+ B Precision = B B+C Overall = 1− A+C B −C 1 = = Recall * 2− Precision A+ B A+ B Recall = Overall: post-match effort to add missed and to remove false matches; negative Overall → no gain Computed for single experiments and averaged over 10 experiments for each series (average Overall, etc.) 13 7077 #All Series = 8208 300 270 270 240 210 207 179 180 136 150 160 120 114 90 62 60 30 3 0 0 Min 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Overall Most no-reuse series have negative average Overall “Good” matcher/strategy: Positive average Overall High presence in higher Overall ranges Average is used by all series with average Overall > 0.6 Aggregation (2376 series/strategy) 100% 90% Max 80% Min 70% Series share #Series Results: Combination Strategies (1) 60% 50% 40% 30% Average 20% 10% 0% Min 0.0 0.1 0.2 0.3 0.4 Overall 0.5 0.6 0.7 0.8 Aggregation: Average (compensating) 14 Results: Combination Strategies (2) Direction: Both (considering both directions) Selection: Threshold+Delta (above threshold + within tolerance) Combined similarity: Average (pessimistic) Matcher: All (combination of all hybrid matchers) 90% LargeSm all 70% 60% 50% 40% 30% Both 20% 10% 0% Min Thr(0.8) 90% 100% MaxN(1) 0.0 0.1 0.2 90% 80% Series share Thr(0.5)+MaxN(1) 60% 50% 40% 30% Delta(0.02) 20% 0.5 0.6 0.7 0.8 0.6 0.7 0.8 Dice 80% 70% 0.3 0.4 Overall Computation of combined similarity (4104 series/strategy) Best selection (228 series/strategy) 100% Series share Sm allLarge 80% Series share Direction (2736 series/strategy) 100% 70% 60% 50% 40% 30% Average 20% Thr(0.5)+Delta(0.02) 10% 10% 0% 0% Min 0.0 0.1 0.2 0.3 0.4 Overall 0.5 0.6 0.7 0.8 Min 0.0 0.1 0.2 0.3 0.4 Overall 0.5 15 Results: Single Matchers a) Single matchers No reuse aA Sc em Sc h avg Recall he m aM e am N eP am N SchemaM: Schema with manually derived (real) match results SchemaA: Schema with match results automatically derived using the default match operation avg Precision Reuse at Ty pe h Na m e Le av es C hi ld re n 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 -0,1 -0,2 -0,3 avg Overall Instability of some single (hybrid) matchers (negative Overall) because of shared elements E.g. DeliverTo.Address and BillTo.Address Considering hierarchical names (NamePath) more accurate Schema-level reuse very effective: Essential improvement over no-reuse hybrid matchers Reusing approved match results better than automatically derived match results 16 Results: Combined Match Approaches b) Matcher combinations 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 avg Recall All+SchemaM: Combination of all no-reuse hybrid matchers and SchemaM avg Overall e Na m n at h+ ld re eP Ch i N am at h+ eN eP Ty p N eP am am at h+ at h+ eP am N am ve Le a hi ld aM e s Al re n es +L he m aM Sc he m Sc ea v e e eN am +T yp aM +C N Sc he m he m aM +N am at h eP am Reuse matchers outperform no-reuse matchers Sc +N aM Sc he m Al l+ S ch em aM l avg Precision All: Combination of all no-reuse hybrid matchers Best no-reuse All : 0.73 average Overall (Precision 0.95, Recall 0.78) Best reuse All+SchemaM: 0.82 average Overall (Precision 0.93, Recall 0.89) Combinations outperform single hybrid matchers Combined matchers, e.g. All, consider many aspects at the same time NamePath+Leaves: effective scheme, considering paths to identify context of shared elements, and leaves to cope with structural conflicts 17 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 Overall(No Reuse) #All Elements 1<->2 1<->3 2<->3 1<->4 2<->4 240 216 192 168 144 120 96 72 48 24 0 Overall(Manual Reuse) 3<->4 1<->5 2<->5 3<->5 # Elements Overall Results: Match Sensitivity 4<->5 Match tasks Impact of schema characteristics : Degrading match quality with increase of schema size Best combinations: no-reuse All and reuse-oriented All+Schema High stability across different match tasks Little tuning effort for the default match operation 18 Conclusions and Future Work The COMA framework Comprehensive evaluation on real-world schemas Extensible matcher library, including novel reuse approach Powerful combination scheme for both specifying match operations and constructing new matchers from existing ones High effectiveness on large schemas Reuse: essential improvement over no-reuse Composite approach as THE solution for matcher combination Future work Matchers: more powerful reuse strategies, instance-based matchers More intelligent combination strategies Application to more real-world scenarios, esp. in bioinformatics 19
© Copyright 2026 Paperzz