Holistic Query Interface Matching using Parallel Schema Matching Weifeng Su Uni. of Sci. & Tech. Kowloon, Hong Kong [email protected] Jiying Wang City University Kowloon, Hong Kong [email protected] Abstract Using query interfaces of different Web databases, we propose a new complex schema matching approach, Parallel Schema Matching (PSM). A parallel schema is formed by comparing two individual schemas and deleting common attributes. The attribute matching can be discovered from the attribute-occurrence patterns if many parallel schemas are available. A count-based greedy algorithm identifies which attributes are more likely to be matched. Experiments show that PSM can identify both simple matching and complex matching accurately and efficiently. 1 INTRODUCTION Considering that there are thousands of Web databases available today, it is very time consuming for an ordinary user to query and retrieve information from all the relevant databases. Since most Web databases are only accessible through a query interface, a system/tool that helps a user locate information in numerous Web databases must be able to understand the query interfaces and help dispatch user queries to suitable fields of those interfaces. The main challenge of such a system is that different databases may use different fields or terms to represent the same concept. For example, to describe the genre of a CD in the MusicRecords domain, Category is used in some databases while Style is used in others. In the Books domain, First Name and Last Name are used in some databases while Author is used in others to denote the writer of a book. This paper specifically focuses on matching attributes across query interfaces of structured Web databases. We define an entry or field in a query interface as an attribute, and all attributes in the query interface as a schema of the interface. When matching the attributes, we call a 1:1 matching, such as Category with Style, a simple matching and a 1:n or m:n matching, such as First Name, Last Name with Author, a complex matching. In the latter case, attributes First Name and Last Name form a concept group before Frederick Lochovsky Uni. of Sci. & Tech. Kowloon, Hong Kong [email protected] they are matched to attribute Author. We call attributes that are in the same concept group grouping attributes and attributes that are semantically identical or similar to each other synonym attributes. We propose a new interface schema matching approach, Parallel Schema Matching (PSM), that matches all input schemas holistically instead of matching two schemas at a time to take advantage of the occurrence patterns of the attributes. To better discover the frequent/rare co-presence of the attributes, we form parallel schemas by comparing two schemas and deleting their common attributes. Both time complexity analysis and experimental results show that the PSM approach can efficiently discover simple and complex matchings at the same time with high accuracy while requiring no domain knowledge or user interaction. 2 RELATED WORK Current solutions for the schema matching problem [1, 3, 4, 6, 7, 8, 9, 10, 12] suffer from the following limitations: 1. simple matching: most schema matching methods can only discover simple matchings between schemas. 2. low accuracy: the accuracy of methods that can identify complex matchings is generally unsatisfactory. 3. time consuming: some schema matching methods have time complexity exponential in the number of attributes. 4. domain knowledge required: some schema matching methods require domain knowledge or user interaction before or during the matching process. DCM [5], which discovers complex matchings holistically using data mining techniques, is based on similar observations as PSM, but has the following disadvantages compared to PSM: 01 f10 1. DCM uses H = ff+1 f1+ to measure the negative correlation between two attributes, by which the synonym attributes are discovered. Such a measure may give a high score for rare attributes, while PSM’s matching score measure will not. 2. The time complexity of DCM is exponential with respect to the number of attributes n, while PSM’s time complexity is polynomial with respect to n. 3 PARALLEL SCHEMA MATCHING We observe that Web databases in the same domain usually share the following characteristics: 1. They are usually semantically similar to each other. 2. An attribute is usually unambiguous in a domain although it may have more than one meaning in an ordinary comprehensive dictionary. 3. Synonym attributes are rarely co-present in the same interface schema. 4. Grouping attributes are usually co-present in the same interface schema to form a “larger” concept. Figure 1: Parallel Schema Matching Workflow. We formalize the schema matching problem as the same problem described in [5]. The input is a set of schemas S = {S1 , . . . , Su }, in which each schema Si (1 ≤ i ≤ u) contains a set of attributes extracted from a query interface and the set of attributes A = ∪ui=1 Si = {A1 , . . . , An } includes all attributes in S. We assume that these schemas come from the same domain. The schema matching problem is to find all matchings M = {M1 , . . . , Mv } including both simple and complex matchings. A matching Mj (1 ≤ j ≤ v) is represented as Gj1 = Gj2 = . . . = Gjw , where Gjk (1 ≤ k ≤ w) is a group of attributes and Gjk is a subset of A, i.e., Gjk ⊂ A. Each matching Mj should represent the semantic synonym relationship between two attribute groups Gjk and Gjl (l 6= k), and each group Gjk should represent the grouping relationship between the attributes within it1 . More specifically, based on observations 2 and 4 above, we restrict each attribute to appear no more than one time in M. The workflow of PSM is shown in Figure 1. Before the matching discovery, two scores, matching score, which is used to evaluate the possibility that two attributes are synonym attributes, and grouping score, which is used to evaluate the possibility that two attributes are in the same group in a matching, are calculated between every two attributes. Synonym Attribute Candidate Generation takes all schemas as input and generates all synonym attribute candidates based on observation 3. If there are n attributes in the input schemas, the maximum number of synonym attribute candidates is Cn2 = n(n−1) . However, not any two 2 attributes from A can be actual candidates for synonym attributes. We assume that two attributes (Ap , Aq ) are synonym attribute candidates if Ap and Aq are co-present in less than Tpq schemas where Tpq is defined as: Tpq = 1 An (Cp + Cq ) ln u , u and Cp and Cq are the count of attribute Ap and Aq in S. Parallel Schema Construction takes every pair of input schemas and constructs a set of parallel schemas in three steps. First, every two different schemas are paired to form preliminary parallel schemas. If there are u input schemas, Cu2 = u(u−1) preliminary parallel schemas will be ob2 tained. Second, in every preliminary parallel schema, common attributes that are shared by its two attribute sets are deleted based on observation 2. Finally, parallel schemas that contain at least one empty attribute set are discarded. After deleting the common attributes, the relationship between attribute Ap ’s count in the input schema set Q with Ap ’s count in the parallel schema set S is Dp = Cp (u − Cp ), (2) where Cp and Dp are the count of Ap in S and Q, respectively. Matching Score Calculation calculates matching scores based on the co-presence count of the attributes in the parallel schemas. Definition 1 Given a parallel schema Ql = (Sl1 , Sl2 ) and two attributes Ap ∈ Ql and Aq ∈ Ql , if Ap ∈ Sl1 and Aq ∈ Sl2 or vice versa, Ap and Aq are defined as being ˙ l. cross co-present in Ql , denoted as (Ap , Aq )∈Q The cross co-presence count of two attributes in the parallel schema set is used to calculate the matching score between them. During the calculation, we assume that each parallel schema provides the same amount of information about matching, thus each cross co-present attribute pair in it will gain equal cross co-presence weight. (1) Definition 2 Given a set of parallel schemas Q = {Ql = (Sl1 , Sl2 ), l = 1 . . . m} and two attributes Ap and Aq , a attribute group can have just one attribute. 2 Domain Airfares CarRentals Discovered Matching {departure date (datetime), return date (datetime)} = {depart (datetime), return (datetime)} {adult (integer), children (integer), infant (integer), senior (integer)} ={passenger (integer)} {destination (string)} = {from (string), to (string)} ={arrival city (string), departure city (string)} {cabin (string)} = {class (string)} {drop off city (string), pick up city (string)} ={drop off location (string), pick up location (string)} {drop off (datetime), pick up (datetime)= { pick up date (datetime), drop off date (datetime), pick up time (datetime), drop off time (datetime)} Correct? Y Y P Y Y Y Table 1: Discovered matchings for Airfares and CarRentals (T =10%). weighted cross co-presence count of Ap and Aq in Q is defined as the sum of the cross co-presence weight that Ap and Aq gains in each parallel schema of Q, denoted as P 1 D̃pq = (Ap ,Aq )∈Q ˙ l |Sl1 ||Sl2 | . benefit of filtering bad matchings in favor of good ones. Another interesting and beneficial characteristic of this algorithm is that it is matching score centric, i.e., the matching score plays a much more important role than the grouping score. For any two attributes Ap and Aq , a matching score Xpq measures the possibility that Ap and Aq are synonym attributes. The bigger the score, the more likely that the two attributes are synonym attributes. The matching score calculation between Ap and Aq is similar to Dice coefficient: ( 0 if (Ap , Aq ) ∈ /L Xpq = (3) 2D̃pq otherwise, (Cp +Cq ) 4 EXPERIMENTS We use the TEL-8 and BAMM datasets from the UIUC Web integration repository [2]. The TEL-8 dataset contains query interface schemas extracted from 447 deep Web sources of eight representative domains where each domain contains about 20-70 schemas. The BAMM dataset contains query interface schemas extracted from four domains where each domain has about 50 schemas. where L is the set of synonym attribute candidates and Cp and Cq are the count of attributes Ap and Aq in S. This matching score has the following properties: 1. null invariance [11]. For any two attributes, adding more schemas that do not contain the attributes does not affect their matching score. 2. rareness differentiation. The matching score between rare and the other attributes is distinguishably low. We evaluate the set of matchings automatically discovered by PSM, denoted by Mp , by comparing it with the set of matchings manually collected by a domain expert, denoted by Mc . To facilitate comparison, we adopt the metric in [5], target accuracy, which evaluates how similar Mp is to Mc . The target accuracy metric includes target precision and target recall. The target precision and target recall of Mp with respect to Mc are the weighted average of all the attributes’ target precision and target recall. Grouping Score Calculation takes all schemas as input and calculates the grouping score between every two attributes based on observation 4. We use the following grouping score measure between two attributes Ap and Aq : Ypq Cpq = , min(Cp , Cq ) Result on the TEL-8 dataset: Table 1 shows the matchings discovered by PSM in the Airfares and CarRentals domains, when T is set at 10%. We see that PSM can identify very complex matchings among attributes. Table 2 presents the performance of PSM on TEL-8 when Tg is set to 0.9. As expected, the performance of PSM decreases for rare attributes because the occurrence pattern of the rare attributes is not obvious with only a few occurrences. Nevertheless, the performance of PSM is almost always better than the performance2 of DCM in [5], shown in Table 3. (4) where Cpq is the co-presence count of attributes Ap and Aq in S, and Cp and Cq are the count of attributes Ap and Aq in S. We also set a grouping score threshold Tg such that attributes Ap and Aq will be considered grouping attributes only when Ypq > Tg . Practically, Tg should be close to 1 as the grouping attributes are expected to co-occur often. Tg is an empirical parameter and the experimental results show that it has similar performance in a wide range. Schema Matching Discovery uses an iterative algorithm where in each iteration, a greedy selection strategy is used to choose the synonym attribute candidates with the highest matching score, until there is no synonym attribute candidate available. The greediness of this algorithm has the Result on the BAMM dataset: The performance of PSM on BAMM is shown in Table 4, when Tg is set to be 0.9. We can see that PSM performs well on BAMM too, except that the target precision in the Automobiles domain is low when T =10%. The reason is similar to that for TEL-8. 2 The 3 precision of DCM when T =5% is not available in [5]. T =20% PT RT Domain T =10% PT RT T =5% PT RT Airfares Automobiles Books CarRentals Hotels Jobs Movies MusicRecords 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 .89 .72 1 1 .74 .94 1 1 .91 1 1 1 1 .90 .76 .67 .64 .60 .70 .72 .62 .86 .88 1 .78 .88 .72 1 .88 Average 1 1 .92 .98 .70 .88 Domain T =20% PT RT T =10% PT RT Airfares Automobiles Books CarRentals Hotels Jobs Movies MusicRecords 1 1 1 .72 .86 1 1 1 1 1 1 1 1 .86 1 1 1 .93 1 .72 .86 .78 1 .76 .71 1 1 .60 .87 .87 1 1 Average .95 .98 .88 .88 T =10% PT RT T =5% PT RT Automobiles Books Movies MusicRecords 1 1 1 1 1 1 1 1 .55 .86 1 .81 1 1 1 1 .75 .82 .90 .72 1 1 .86 1 Average 1 1 .81 1 .80 .97 Table 4: Target accuracy of PSM on BAMM dataset (Tg =0.9). Table 2: Target accuracy of PSM on TEL-8 dataset (Tg =0.9). Domain T =20% PT RT [5] [6] [7] Table 3: Target accuracy of DCM on TEL-8 dataset. [8] 5 SUMMARY [9] We present a parallel schema matching approach to holistically discover attribute matchings across Web query interfaces. PSM is purely based on the occurrence patterns of attributes and requires neither domain-knowledge nor user interaction. Experimental results show that PSM discovers both simple and complex matchings with very high accuracy in time polynomial to the number of attributes and the number of schemas. [10] [11] [12] Acknowledgment: This research was supported by the Research Grants Council of Hong Kong under grant HKUST6172/04E. References [1] D. Beneventano, S. Bergamaschi, S. Castano, A. Corni, R. Guidetti, G. Malvezzi, M. Melchiori, and M. Vincini. Information integration: The momis project demonstration. In 26th Int. Conf. Very Large Data Bases, pages 611–614, 2000. [2] K. C.-C. Chang, B. He, C. Li, and Z. Zhang. The UIUC Web integration repository. Computer Science Department, University of Illinois at Urbana-Champaign. http://metaquerier.cs.uiuc.edu/repository, 2003. [3] R. Dhamankar, Y. Lee, A. Doan, A. Halevy, and P. Domingos. imap: Discovering complex semantic matches between database schemas. In ACM SIGMOD Conference, pages 383 – 394, 2004. [4] A. Doan, P. Domingos, and A. Y. Halevy. Reconciling schemas of disparate data sources: A machine-learning ap- 4 proach. In ACM SIGMOD Conference, pages 509 – 520, 2001. B. He and K. C.-C. Chang. Discovering complex matchings across Web query interfaces: A correlation mining approach. In ACM SIGKDD Conference, pages 147 – 158, 2004. B. He, K. C.-C. Chang, and J. Han. Statistical schema matching across Web query interfaces. In ACM SIGMOD Conference, pages 217 – 228, 2003. W. Li, C. Clifton, and S. Liu. Database Integration using Neural Network: Implementation and Experience. In Knowledge and Information Systems,2(1), pages 73–96, 2000. S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity flooding: A versatile graph matching algorithm. In 18th Int. Conf. on Data Engineering, pages 117–128, 2002. T. Milo and S. Zohar. Using schema matching to simplify heterogeneous data translation. In 24th Int. Conf. Very Large Data Bases, pages 122–133, 1998. E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal, 10:334– 350, 2001. P. Tan, V. Kumar, and J. Srivastava. Selecting the right interestingness measure for association patterns. In ACM SIGKDD Conference, pages 32 – 41, 2002. W. Wu, C. Yu, A. Doan, and W. Meng. An interactive clustering-based approach to integrating source query interfaces on the deep Web. In ACM SIGMOD Conference, pages 95–106, 2004.
© Copyright 2026 Paperzz