1 A multi-criteria Fusion Approach for Geographical Data Matching Ana-Maria Olteanu IGN, COGIT Laboratory 2/4 Avenue Pasteur, 94165 Saint-Mandé cedex, France, [email protected] Abstract A spatial data matching based on the Dempster-Shafer theory is presented in this paper. Spatial data are imperfect and this imperfection should be taken into account in the matching process. Data are matched by combining three criteria based on location, names of places and nature of the objects, i.e. semantic properties. Experiments are performed on relief points belonging to five different datasets. Precision and recall equal to almost 100% are achieved. Keywords: data matching, Dempster-Shafer theory, ontology, fusion, semantics 1. Introduction The integration of geographical databases (GDB) describing the same real world from different sources becomes a growing issue. At present, many geographical databases exist to jointly describe the same reality at different spatial and semantic resolutions. Many applications, such as quality evaluations, incoherence detection, propagation of updates, and study of several adjacent zones, require the integration of geographical databases. The integration process (Devogele et al., 1998; Sheeren et al., 2004) is usually carried out in three stages. The first stage is pre-integration, consisting of enrichment of the schemas sources and standardisation of the models. The second stage consists of schema matching (Gesbert, 2005) and data matching (Mustière, 2006; Olteanu et al., 2006) which are not completely independent. The last stage is defining of an integrated schema and specifications, and populating the integrated GDB. 2 The aim of this paper is to structure the data matching process. We use the Dempster-Shafer theory (Shafer, 1976) to fuse different knowledge components provided by different criteria like geometry, name, semantic of the spatial feature, in order to take a decision in the matching process. 2. Related work Data matching is a geoinformation process to find homologous features in different databases (Walter and Fritsch, 1999). It is useful in many fields handling spatial information such as integration of spatial data (Voltz, 2006; Mustière, 2006), automated updating, quality analysis (Bel Hadj Ali, 2001), or inconsistency detection between databases (Sheeren et al., 2004). Matching methods developed in the literature have revealed good results and efficiency on certain types of data in selected test areas. Matching algorithms depend on the geometry of the features to match, the spatial relations, and last but not least on semantic attributes. (Bel Hadj Ali, 2001) proposed an approach for polygons, which mainly relies on distances between geometries. Beeri et al. (2004) proposed four methods for matching while comparing locations of points and relying on probabilistic considerations. Feature names can also be compared using a string distance (Levenshtein, 1965; Cohen et al., 2003). In respect of linear networks, Walter and Fritsch (1999) developed a statistical approach based on geometric and topologic criteria to match features in two different databases at similar scales. Matching algorithms for two networks at similar (Voltz, 2006) or different scales (Devogele, 1997; Mustière, 2006) were proposed. Matching is based on different criteria, including an analysis of topological relations, and it is carried out in several steps. Comparing the semantics of classes is necessary to match schemas and is also useful to match data, even if this approach is rarely used (Gesbert, 2005). Matching methods generally use one or several of the geometrical, semantic and topological criteria. 3 All criteria are usually applied successively, and, more importantly, most approaches do not explicitly model data imperfection such as uncertainty, incompleteness and imprecision. Therefore, we aim to develop a matching approach taking all criteria jointly into account and to model imperfection. We consider the Dempster-Shafer theory that is relevant to achieve these objectives because, on the one hand, it allows to model imprecision, uncertainty and incompleteness and to combine criteria and hypotheses, and on the other hand, a decision is made based on beliefs. 3. The Frame of the Dempster-Shafer Theory The Dempster-Shafer theory was introduced by Dempster in 1967. Based on Dempster’s work, Shafer (Shafer, 1976) introduced the Dempster-Shafer model, which is based on belief functions. 3.1. The Frame of Discernment Let Θ be a finite set of N hypotheses, Hi, i = 1..N, called the frame of discernment that corresponds to the potential solution of a decision problem, Θ = {H1, H2,…, HN}. From the frame of discernment, let 2Θ denote the set of all subsets of Θ defined by: 2Θ = { φ , { H 1}, { H 2 }, { H 1 , H 2 }...{ H 1 ...H N },Θ } (1) where {Hi, Hj} represents the hypothesis that the solution of a problem is one of them, i.e. either Hi or Hj. The Dempster-Shafer theory is based on the basic belief assignment, i.e. a function that assigns to a proposition, A ∈ 2Θ, and a value named the basic belief mass, noted m(A), that represents the extent to which a criterion believes in it. As an example, we consider a matching that is based on the geometry of features. The closer two features are, the stronger the criterion believes that the features are homologous, and thus the value of the basic belief mass is important. 4 The basic belief assumption satisfies equation (2): m : 2Θ → [0,1] ∑Θ m( A) = 1 (2) A⊆ Every proposition A∈2Θ with m(A) > 0 is called a focal proposition of m. Hence, from now on, we consider only the focal propositions to combine knowledge and to make a decision. 3.2. Dempster’s rule of combination The Dempster-Shafer theory allows to fuse several criteria using the Dempster’s rule (Shafer, 1976). Let us consider two sources S1 and S2. Each source supports a proposition with a certain basic belief mass, called m1(A) and m2(A), respectively. We denote by m12 the basic belief mass resulting from the combination of two sources by the Dempster’s rule and that supports the same proposition A. If, for example, we wish to find out if two features belonging to two different databases should be matched, two criteria are used, e.g. geometry and comparison of toponyms. Assuming that the two features are homologous, the first criterion believes that they are homologous because the features are close and assigns to it an important basic belief mass, whereas the second criterion is not sure because the two toponyms are not similar and thus it assigns to this assumption a less important basic belief mass. To take a decision, different criteria have to be combined, potentially leading to a conflict situation. If this is the case, the conflict is assigned to the empty set, and it is used by Dempster to normalise the basic belief mass resulting from the combination of two sources, m12. Thus, the basic belief mass assigned to the conflict is redistributed according to the focal elements (Shafer, 1976; Smets and Kennes, 1994). 4. The Dempster-Shafer theory in a data matching context Matching generally consists in searching for potential candidates for each feature belonging to 5 the reference database and then analysing them in order to determine the final results. Spatial data have imperfections: the location could be imprecise, or toponyms might have different versions due to omission, abbreviation and substitution. Using several criteria successively during matching, errors could propagate leading to an erroneous matching. Therefore, imperfection should be taken into account and criteria should be applied in the same time to obtain more relevant information. The Dempster-Shafer theory offers tools for modelling imperfection through the belief functions and to combine different knowledge through the Dempster’s rule of combination. Data matching based on the Dempster-Shafer theory consisting of three steps is now described. Firstly, computation of the basic belief mass for each candidate and each criterion is performed. Secondly, criteria are combined per candidate, i.e. for each candidate the basic belief mass are combined to provide a combined basic belief mass synthesizing the knowledge of the different criteria. Finally, the results of the second step are combined, i.e. candidates are combined to provide an overall view. 4.1. Local approach: definition of the frame of discernment We compare two datasets made for different purposes and with different scales. We consider the more detailed dataset as comparison and the less detailed as reference dataset. Therefore, for each reference feature, we look into the comparison dataset for geometrically close features that are candidates for matching. The distance that determines how far we look for candidates is an empirical one and it depends on the nature of the reference feature. In our case, for each candidate Ci, the assumption "Ci is the homologous feature" is added to the local frame of discernment of the reference feature. Due to the fact that a feature may have no homologue, a new hypothesis, NM, standing for: "the feature is not matched at all ", is defined. A 6 similar approach was presented in Royère et al. (2000). For each feature belonging to the reference database, named a reference feature, a global frame of discernment is now defined as follows: Θ = {C1 , C2 ,...Ci ,...C N , NM } (3) In equation 3, N defines the number of candidates, Ci represents the assumption that the candidate Ci is the homologous of the reference feature and NM is the assumption that the reference feature is not matched. To compute the basic belief assignments, we use a local approach by analyzing each candidate separately. To do so, we use a particular case of the Dempster-Shafer theory, named the specialised sources (Appriou, 1991). Each source specialises on a candidate and assigns a belief to it. In our case, a source coincides to a criterion of data matching (geometric, toponym and semantic). We consider Si, a subset of 2Θ, defined as: Si = {Ci, ¬Ci, Θ}, where: • Ci is the hypothesis that the homologous of the reference feature is Ci. • ¬Ci = {C1, C2,… Ci-1,… CN, NM} is the hypothesis that Ci is not the homologous of the reference feature. The solution could be another candidate, or the hypothesis NM. • Θ = {C1, … CN, NM} is hypothesis that the criterion does not know if Ci is the right candidate or not, i.e. it models ignorance. 4.2. Computation of basic belief masses We propose three criteria to match data and in our case those criteria represent the sources in the frame of the Dempster-Shafer Theory. There are described below. 7 Figure 1. Modelling of the basic belief assumption for geometrical a), toponym b) and semantic c) criteria. The geometrical criterion The geometrical criterion is based on the Euclidean distance, dE, between the location of the reference feature and the location of the candidate. We suppose that the closer the candidate is to the reference feature, the more probably this one is the homologous of the reference feature, as defined in figure 1a). In figure 1a), T2 represents the selection threshold, i.e. the distance determining how far we look for candidates, while T1 defines a confidence threshold that gives less weight for candidates that are fairly far. Due to the imprecision of data location, some homologous features can be relatively far. To avoid to definitively eliminate a candidate, which is too far as compared to the reference feature, we consider that the basic belief mass allocated to the hypothesis "the candidate Ci is the homologue feature" is never 0, but it ranges from 0.1 to 1. The toponym criterion The second criterion is based on the comparison of toponyms. A distance, dT, between two toponyms, is computed using the Levenshtein distance (Levenshtein, 1965): 8 dT = d L ( toponym1 ,toponyme2 ) (7) max( L1 , L2 ) where: dL= Levenshtein distance, L1= toponym1’s length and L2= toponym2’s length. Figure 1b) shows the basic belief assignment for the toponym criterion. Curves are different from the geometrical criterion, in order to express that we are less confident in this source. Thus, we manage the case of ambiguity, when two toponyms indicating the same features are compared, one having the official name whereas the other has a non-official name. It consists in decreasing the mass of belief associated with the assumption "Ci is not the homologue of the feature" and increasing the basic belief mass associated with ignorance. Thus, if the distance dT is greater than the threshold T, (e.g., 30% of letters are different), the basic belief mass assigned to both hypotheses "Ci is not the right candidate" and ignorance are equal to 0.5. The semantic criterion Features may have the same toponyms, i.e. being close to each other, but present different natures and thus are not homologous, like for example a summit and a mountain pass. The semantic criterion uses semantic properties to be able to do so. A semantic distance, dS, based on an ontology, which is obtained by automatic extraction from text specifications of the two databases, is computed (Abadie et al. 2007). In figure 1c), the basic believe assignment for the semantic criterion is modelled. Thus if the semantic distance between the reference feature and the candidate Ci is equal to 0 (e.g. the features are similar, semantically speaking), a basic belief mass equal to 0.5 is assigned to the assumption "Ci is the homologous feature", so that this criterion does not allocate a strong belief to this candidate. On the contrary, if the semantic distance is greater than a threshold T, a basic belief mass equal to 1 is assigned to the hypothesis "Ci is not the homologous of the reference feature". 9 4.3. Combination of the criteria and candidates Once masses of belief have been initialised, a local approach could be carried out. It merges the criteria for each candidate individually using Dempster’s rule, without taking the other candidates into account. In the third step, called the global approach, the results of the local approach are combined, again using Dempster’s rule. Thus, the results for two candidates are combined, and these results are combined with the results for the third candidate and so on and so forth. In our application, a decision has generally to be made in favour of a simple hypothesis. Within the context of the Transferable Belief Model, Smets defines and justifies the use of the "pignistic" probability decision rule, which transforms a belief function into a probability one (Smets and Kennes, 1994). 5. Analysis of relief data and results We now illustrate and discuss some matching results. Our experiments concern two different spatial databases containing punctual spatial data representing relief. The approach is illustrated with an example of relief points of interest of the relief taken from two databases from IGN, the French mapping agency: BD CARTO ®, considered as the reference database, and BD TOPO ®, considered as the comparison database. Each database has a specific representation of the real world according to its applications, perception and purposes. Five datasets representing five "Départements" in France (approximately 100 x 100 km) were used for validation. Different types of imperfection are encountered within the data. First, a point location is imprecise and different points may have a different precision. For example, the location of peaks is more precise then the location of valleys. Differences are due to differences in size of valleys and peaks. Second, it is useful to take the toponym into account during matching. The toponym, however, as well as location, has imperfection. For example, used names and official names may 10 differ, the same toponym may be used at several places, or there can be various interpretations of the pronunciation, word and character omission. The toponym, may also be uncertain (e.g. "peyrelue or sallent ") and incomplete (e.g "col de louesque" or only "louesque"). Third, the attribute "nature" does not present the same level of details. For example in BD CARTO ®, there is a grouping of the concepts represented with the same value of the attribute nature "summit, ridge, hill", whereas in the BD TOPO ®, these concepts are dissociated. Usage of a single geometrical criterion based on the comparison of the locations does not lead to satisfactory results. The homologous feature is not always the closest one. In the same way, using only a toponym or a semantic criterion gives inconsistencies. Therefore, combination of this information is necessary to make a sound decision. Evaluation of the results is an important step for data matching and the two indicators Precision and Recall are used for this purpose. Precision is the percentage of the matching links discovered by the process that are correct, compared to an interactive matching. (Beeri et al., 2004). Recall is the percentage of the correct matching links (defined by an interactive matching) that are actually discovered by the automatic process. Evaluation is done in terms of the number of features. Table 1. Qualitative evaluation of the matching process Dataset1 Two Three Dataset 2 Dataset3 Two Two Three Three Dataset 4 Dataset 5 Two Two Three Three Criteria Criteria Criteria Criteria Criteria Criteria Criteria Criteria Criteria Criteria Precision 100% 100% 96.4% 99.1% 97% 98.3% 100% 100% 97% 99.7% Recall 99% 95.9% 99.1% 97% 97.5% 97.7% 97.7% 96.7% 98.7% 99% 11 The first and the second line of the table 1 show Precision and Recall values for each dataset when two (geometric and toponym) and three criteria (geometric, toponym and semantic) are used, respectively. Both Precision and Recall improve when three criteria are used for all datasets. These results are satisfactory and show the importance of the semantic properties during matching. In both Precision and Recall, conflict (0.7% from all matching results) is not taken into account, leading to less elevated Recall values. Figure 2 illustrates two matching results obtained using this approach. We observe from figure 2a) that two homologous features are matched even if the two toponyms are written in two different languages. The figure 2b) shows that matching does not necessarily match the closest candidate. Figure 2. Results of the data matching process 6. Conclusion In this paper, we presented data matching based on the Dempster-Shafer Theory. We tested it on five different datasets representing relief data. A high Precision and a Recall larger than 95% are obtained and they increase when three criteria are combined. 12 Acknowledgment The author would like to thank Sébastien Mustière for the reviews of this paper and the quality of his advices. References Abadie, N., Olteanu, A-M., and S. Mustière. 2007. Comparaison de la nature d’objets géographiques. Paper presented at the meeting of the Ingénierie des Connaissances, Grenoble. Appriou, A. 1991. Probabilités et incertitudes en fusion de données multi-senseurs. Revue Scientifique et Technique de la Défense 11:27-40. Bel Hadj Ali, A. 2001. Qualité géométrique des entités géographiques surfacique–Application à l’appariement et définition d’une typologie des écarts géométriques. PhD diss., Marne la Vallée. Beeri, C., Kanza, Y., Safra, E., and Y. Sagiv. 2004. Object Fusion in Geographic Information Systems. In Proceedings of the 30th VLDB Conference, Toronto, Canada, 816-827. Cohen, W.W., Ravikumar, P., and S.E. Fienberg. 2003. A Comparison of String Distance Metrics for Name-Matching Tasks. In Proceedings of the International Joint Conference on Artificial Intelligence, Washington, DC, USA, pp. 73-78. Devogele, T. 1997. Processus d’intégration et d’appariement de bases de données Géographiques -Applications à une base de données routières multi-échelles. PhD diss., Université de Versailles. Devogele, T., Parent, C., and S. Spaccapietra. 1998. On spatial database integration. International Journal of Geographical Information Science 12:335-352. Gesbert, N. 2005, Formalisation des spécifications de bases de données géographiques en vue de 13 leur intégration, PhD diss., Marne la Vallée. Levenshtein, V. 1965. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics-Doklady 10-8:707-710. Mustière, S. 2006. Results of Experiments on Automated Matching of Networks at Different Scales. In Proceedings of the ISPRS Workshop on Multiply Representation and Interoperability of Spatial Data, Hanover, Germany, pp. 92-100. Olteanu, A-M., Mustière, S., and A. Ruas. 2006. Matching Imperfect Data. In Proceedings of International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences, Lisbon, Portugal, pp. 694-704. Royère, C., Gruyer, D., and V. Cherfaoui. 2000. Data Association with Belief Theory. In Proceedings of International Conference of Information Fusion, Paris, France, pp. 23-29. Shafer, G. 1976. A Mathematical Theory of Evidence. Princeton University Press, Princeton. Sherren, D., Mustiere, S., and J-D.Zucker. 2004. How to Integrate Heterogeneous Spatial Databases in a Consistent Way? In Proceedings of Advanced Databases and Information Systems, Budapest, Hungary, 364-378. Smets, P., and R. Kennes. 1994. The Transferable Belief Model. Artificial Intelligence, 66:191234. Voltz, S.2006. An Iterative Approach for Matching Multiple Representations of Street Data. In Proceedings of the ISPRS Workshop on Multiply Representation and Interoperability of Spatial Data, Hanover, Germany, 101-110. Walter, V. and D. Fritsch. 1999. Matching Spatial Data Sets: Statistical Approach. International Journal of GIS 13:445-473.
© Copyright 2026 Paperzz