1 A multi-criteria Fusion Approach for Geographical Data Matching

1
A multi-criteria Fusion Approach for Geographical Data Matching
Ana-Maria Olteanu
IGN, COGIT Laboratory
2/4 Avenue Pasteur, 94165 Saint-Mandé cedex, France, [email protected]
Abstract
A spatial data matching based on the Dempster-Shafer theory is presented in this paper. Spatial
data are imperfect and this imperfection should be taken into account in the matching process.
Data are matched by combining three criteria based on location, names of places and nature of
the objects, i.e. semantic properties. Experiments are performed on relief points belonging to five
different datasets. Precision and recall equal to almost 100% are achieved.
Keywords: data matching, Dempster-Shafer theory, ontology, fusion, semantics
1. Introduction
The integration of geographical databases (GDB) describing the same real world from different
sources becomes a growing issue. At present, many geographical databases exist to jointly
describe the same reality at different spatial and semantic resolutions. Many applications, such as
quality evaluations, incoherence detection, propagation of updates, and study of several adjacent
zones, require the integration of geographical databases. The integration process (Devogele et al.,
1998; Sheeren et al., 2004) is usually carried out in three stages. The first stage is pre-integration,
consisting of enrichment of the schemas sources and standardisation of the models. The second
stage consists of schema matching (Gesbert, 2005) and data matching (Mustière, 2006; Olteanu
et al., 2006) which are not completely independent. The last stage is defining of an integrated
schema and specifications, and populating the integrated GDB.
2
The aim of this paper is to structure the data matching process. We use the Dempster-Shafer
theory (Shafer, 1976) to fuse different knowledge components provided by different criteria like
geometry, name, semantic of the spatial feature, in order to take a decision in the matching
process.
2. Related work
Data matching is a geoinformation process to find homologous features in different databases
(Walter and Fritsch, 1999). It is useful in many fields handling spatial information such as
integration of spatial data (Voltz, 2006; Mustière, 2006), automated updating, quality analysis
(Bel Hadj Ali, 2001), or inconsistency detection between databases (Sheeren et al., 2004).
Matching methods developed in the literature have revealed good results and efficiency on
certain types of data in selected test areas. Matching algorithms depend on the geometry of the
features to match, the spatial relations, and last but not least on semantic attributes. (Bel Hadj
Ali, 2001) proposed an approach for polygons, which mainly relies on distances between
geometries. Beeri et al. (2004) proposed four methods for matching while comparing locations of
points and relying on probabilistic considerations. Feature names can also be compared using a
string distance (Levenshtein, 1965; Cohen et al., 2003). In respect of linear networks, Walter and
Fritsch (1999) developed a statistical approach based on geometric and topologic criteria to
match features in two different databases at similar scales. Matching algorithms for two networks
at similar (Voltz, 2006) or different scales (Devogele, 1997; Mustière, 2006) were proposed.
Matching is based on different criteria, including an analysis of topological relations, and it is
carried out in several steps. Comparing the semantics of classes is necessary to match schemas
and is also useful to match data, even if this approach is rarely used (Gesbert, 2005). Matching
methods generally use one or several of the geometrical, semantic and topological criteria.
3
All criteria are usually applied successively, and, more importantly, most approaches do not
explicitly model data imperfection such as uncertainty, incompleteness and imprecision.
Therefore, we aim to develop a matching approach taking all criteria jointly into account and to
model imperfection. We consider the Dempster-Shafer theory that is relevant to achieve these
objectives because, on the one hand, it allows to model imprecision, uncertainty and
incompleteness and to combine criteria and hypotheses, and on the other hand, a decision is
made based on beliefs.
3. The Frame of the Dempster-Shafer Theory
The Dempster-Shafer theory was introduced by Dempster in 1967. Based on Dempster’s work,
Shafer (Shafer, 1976) introduced the Dempster-Shafer model, which is based on belief functions.
3.1. The Frame of Discernment
Let Θ be a finite set of N hypotheses, Hi, i = 1..N, called the frame of discernment that
corresponds to the potential solution of a decision problem, Θ = {H1, H2,…, HN}. From the frame
of discernment, let 2Θ denote the set of all subsets of Θ defined by:
2Θ = { φ , { H 1}, { H 2 }, { H 1 , H 2 }...{ H 1 ...H N },Θ } (1)
where {Hi, Hj} represents the hypothesis that the solution of a problem is one of them, i.e. either
Hi or Hj. The Dempster-Shafer theory is based on the basic belief assignment, i.e. a function that
assigns to a proposition, A ∈ 2Θ, and a value named the basic belief mass, noted m(A), that
represents the extent to which a criterion believes in it. As an example, we consider a matching
that is based on the geometry of features. The closer two features are, the stronger the criterion
believes that the features are homologous, and thus the value of the basic belief mass is
important.
4
The basic belief assumption satisfies equation (2):
m : 2Θ → [0,1]
∑Θ m( A) = 1
(2)
A⊆
Every proposition A∈2Θ with m(A) > 0 is called a focal proposition of m. Hence, from now on,
we consider only the focal propositions to combine knowledge and to make a decision.
3.2. Dempster’s rule of combination
The Dempster-Shafer theory allows to fuse several criteria using the Dempster’s rule (Shafer,
1976). Let us consider two sources S1 and S2. Each source supports a proposition with a certain
basic belief mass, called m1(A) and m2(A), respectively. We denote by m12 the basic belief mass
resulting from the combination of two sources by the Dempster’s rule and that supports the same
proposition A. If, for example, we wish to find out if two features belonging to two different
databases should be matched, two criteria are used, e.g. geometry and comparison of toponyms.
Assuming that the two features are homologous, the first criterion believes that they are
homologous because the features are close and assigns to it an important basic belief mass,
whereas the second criterion is not sure because the two toponyms are not similar and thus it
assigns to this assumption a less important basic belief mass. To take a decision, different criteria
have to be combined, potentially leading to a conflict situation. If this is the case, the conflict is
assigned to the empty set, and it is used by Dempster to normalise the basic belief mass resulting
from the combination of two sources, m12. Thus, the basic belief mass assigned to the conflict is
redistributed according to the focal elements (Shafer, 1976; Smets and Kennes, 1994).
4. The Dempster-Shafer theory in a data matching context
Matching generally consists in searching for potential candidates for each feature belonging to
5
the reference database and then analysing them in order to determine the final results. Spatial
data have imperfections: the location could be imprecise, or toponyms might have different
versions due to omission, abbreviation and substitution. Using several criteria successively
during matching, errors could propagate leading to an erroneous matching. Therefore,
imperfection should be taken into account and criteria should be applied in the same time to
obtain more relevant information. The Dempster-Shafer theory offers tools for modelling
imperfection through the belief functions and to combine different knowledge through the
Dempster’s rule of combination.
Data matching based on the Dempster-Shafer theory consisting of three steps is now described.
Firstly, computation of the basic belief mass for each candidate and each criterion is performed.
Secondly, criteria are combined per candidate, i.e. for each candidate the basic belief mass are
combined to provide a combined basic belief mass synthesizing the knowledge of the different
criteria. Finally, the results of the second step are combined, i.e. candidates are combined to
provide an overall view.
4.1. Local approach: definition of the frame of discernment
We compare two datasets made for different purposes and with different scales. We consider the
more detailed dataset as comparison and the less detailed as reference dataset. Therefore, for
each reference feature, we look into the comparison dataset for geometrically close features that
are candidates for matching. The distance that determines how far we look for candidates is an
empirical one and it depends on the nature of the reference feature.
In our case, for each candidate Ci, the assumption "Ci is the homologous feature" is added to the
local frame of discernment of the reference feature. Due to the fact that a feature may have no
homologue, a new hypothesis, NM, standing for: "the feature is not matched at all ", is defined. A
6
similar approach was presented in Royère et al. (2000).
For each feature belonging to the reference database, named a reference feature, a global frame
of discernment is now defined as follows:
Θ = {C1 , C2 ,...Ci ,...C N , NM } (3)
In equation 3, N defines the number of candidates, Ci represents the assumption that the
candidate Ci is the homologous of the reference feature and NM is the assumption that the
reference feature is not matched. To compute the basic belief assignments, we use a local
approach by analyzing each candidate separately. To do so, we use a particular case of the
Dempster-Shafer theory, named the specialised sources (Appriou, 1991). Each source specialises
on a candidate and assigns a belief to it. In our case, a source coincides to a criterion of data
matching (geometric, toponym and semantic).
We consider Si, a subset of 2Θ, defined as: Si = {Ci, ¬Ci, Θ}, where:
•
Ci is the hypothesis that the homologous of the reference feature is Ci.
•
¬Ci = {C1, C2,… Ci-1,… CN, NM} is the hypothesis that Ci is not the homologous of the
reference feature. The solution could be another candidate, or the hypothesis NM.
•
Θ = {C1, … CN, NM} is hypothesis that the criterion does not know if Ci is the right
candidate or not, i.e. it models ignorance.
4.2. Computation of basic belief masses
We propose three criteria to match data and in our case those criteria represent the sources in the
frame of the Dempster-Shafer Theory. There are described below.
7
Figure 1. Modelling of the basic belief assumption for geometrical a), toponym b) and semantic
c) criteria.
The geometrical criterion
The geometrical criterion is based on the Euclidean distance, dE, between the location of the
reference feature and the location of the candidate. We suppose that the closer the candidate is to
the reference feature, the more probably this one is the homologous of the reference feature, as
defined in figure 1a). In figure 1a), T2 represents the selection threshold, i.e. the distance
determining how far we look for candidates, while T1 defines a confidence threshold that gives
less weight for candidates that are fairly far. Due to the imprecision of data location, some
homologous features can be relatively far. To avoid to definitively eliminate a candidate, which
is too far as compared to the reference feature, we consider that the basic belief mass allocated to
the hypothesis "the candidate Ci is the homologue feature" is never 0, but it ranges from 0.1 to 1.
The toponym criterion
The second criterion is based on the comparison of toponyms. A distance, dT, between two
toponyms, is computed using the Levenshtein distance (Levenshtein, 1965):
8
dT =
d L ( toponym1 ,toponyme2 )
(7)
max( L1 , L2 )
where: dL= Levenshtein distance, L1= toponym1’s length and L2= toponym2’s length.
Figure 1b) shows the basic belief assignment for the toponym criterion. Curves are different
from the geometrical criterion, in order to express that we are less confident in this source. Thus,
we manage the case of ambiguity, when two toponyms indicating the same features are
compared, one having the official name whereas the other has a non-official name. It consists in
decreasing the mass of belief associated with the assumption "Ci is not the homologue of the
feature" and increasing the basic belief mass associated with ignorance. Thus, if the distance dT
is greater than the threshold T, (e.g., 30% of letters are different), the basic belief mass assigned
to both hypotheses "Ci is not the right candidate" and ignorance are equal to 0.5.
The semantic criterion
Features may have the same toponyms, i.e. being close to each other, but present different
natures and thus are not homologous, like for example a summit and a mountain pass. The
semantic criterion uses semantic properties to be able to do so. A semantic distance, dS, based on
an ontology, which is obtained by automatic extraction from text specifications of the two
databases, is computed (Abadie et al. 2007). In figure 1c), the basic believe assignment for the
semantic criterion is modelled. Thus if the semantic distance between the reference feature and
the candidate Ci is equal to 0 (e.g. the features are similar, semantically speaking), a basic belief
mass equal to 0.5 is assigned to the assumption "Ci is the homologous feature", so that this
criterion does not allocate a strong belief to this candidate. On the contrary, if the semantic
distance is greater than a threshold T, a basic belief mass equal to 1 is assigned to the hypothesis
"Ci is not the homologous of the reference feature".
9
4.3. Combination of the criteria and candidates
Once masses of belief have been initialised, a local approach could be carried out. It merges the
criteria for each candidate individually using Dempster’s rule, without taking the other
candidates into account. In the third step, called the global approach, the results of the local
approach are combined, again using Dempster’s rule. Thus, the results for two candidates are
combined, and these results are combined with the results for the third candidate and so on and
so forth. In our application, a decision has generally to be made in favour of a simple hypothesis.
Within the context of the Transferable Belief Model, Smets defines and justifies the use of the
"pignistic" probability decision rule, which transforms a belief function into a probability one
(Smets and Kennes, 1994).
5. Analysis of relief data and results
We now illustrate and discuss some matching results. Our experiments concern two different
spatial databases containing punctual spatial data representing relief. The approach is illustrated
with an example of relief points of interest of the relief taken from two databases from IGN, the
French mapping agency: BD CARTO ®, considered as the reference database, and BD TOPO ®,
considered as the comparison database. Each database has a specific representation of the real
world according to its applications, perception and purposes. Five datasets representing five
"Départements" in France (approximately 100 x 100 km) were used for validation. Different
types of imperfection are encountered within the data. First, a point location is imprecise and
different points may have a different precision. For example, the location of peaks is more
precise then the location of valleys. Differences are due to differences in size of valleys and
peaks. Second, it is useful to take the toponym into account during matching. The toponym,
however, as well as location, has imperfection. For example, used names and official names may
10
differ, the same toponym may be used at several places, or there can be various interpretations of
the pronunciation, word and character omission. The toponym, may also be uncertain (e.g.
"peyrelue or sallent ") and incomplete (e.g "col de louesque" or only "louesque"). Third, the
attribute "nature" does not present the same level of details. For example in BD CARTO ®, there
is a grouping of the concepts represented with the same value of the attribute nature "summit,
ridge, hill", whereas in the BD TOPO ®, these concepts are dissociated.
Usage of a single geometrical criterion based on the comparison of the locations does not lead to
satisfactory results. The homologous feature is not always the closest one. In the same way,
using only a toponym or a semantic criterion gives inconsistencies. Therefore, combination of
this information is necessary to make a sound decision.
Evaluation of the results is an important step for data matching and the two indicators Precision
and Recall are used for this purpose. Precision is the percentage of the matching links discovered
by the process that are correct, compared to an interactive matching. (Beeri et al., 2004). Recall
is the percentage of the correct matching links (defined by an interactive matching) that are
actually discovered by the automatic process. Evaluation is done in terms of the number of
features.
Table 1. Qualitative evaluation of the matching process
Dataset1
Two
Three
Dataset 2
Dataset3
Two
Two
Three
Three
Dataset 4
Dataset 5
Two
Two
Three
Three
Criteria Criteria Criteria Criteria Criteria Criteria Criteria Criteria Criteria Criteria
Precision 100%
100%
96.4%
99.1%
97%
98.3%
100%
100%
97%
99.7%
Recall
99%
95.9%
99.1%
97%
97.5%
97.7%
97.7%
96.7%
98.7%
99%
11
The first and the second line of the table 1 show Precision and Recall values for each dataset
when two (geometric and toponym) and three criteria (geometric, toponym and semantic) are
used, respectively. Both Precision and Recall improve when three criteria are used for all
datasets. These results are satisfactory and show the importance of the semantic properties during
matching. In both Precision and Recall, conflict (0.7% from all matching results) is not taken
into account, leading to less elevated Recall values.
Figure 2 illustrates two matching results obtained using this approach. We observe from figure
2a) that two homologous features are matched even if the two toponyms are written in two
different languages. The figure 2b) shows that matching does not necessarily match the closest
candidate.
Figure 2. Results of the data matching process
6. Conclusion
In this paper, we presented data matching based on the Dempster-Shafer Theory. We tested it on
five different datasets representing relief data. A high Precision and a Recall larger than 95% are
obtained and they increase when three criteria are combined.
12
Acknowledgment
The author would like to thank Sébastien Mustière for the reviews of this paper and the quality
of his advices.
References
Abadie, N., Olteanu, A-M., and S. Mustière. 2007. Comparaison de la nature d’objets
géographiques. Paper presented at the meeting of the Ingénierie des Connaissances, Grenoble.
Appriou, A. 1991. Probabilités et incertitudes en fusion de données multi-senseurs. Revue
Scientifique et Technique de la Défense 11:27-40.
Bel Hadj Ali, A. 2001. Qualité géométrique des entités géographiques surfacique–Application à
l’appariement et définition d’une typologie des écarts géométriques. PhD diss., Marne la
Vallée.
Beeri, C., Kanza, Y., Safra, E., and Y. Sagiv. 2004. Object Fusion in Geographic Information
Systems. In Proceedings of the 30th VLDB Conference, Toronto, Canada, 816-827.
Cohen, W.W., Ravikumar, P., and S.E. Fienberg. 2003. A Comparison of String Distance
Metrics for Name-Matching Tasks. In Proceedings of the International Joint Conference on
Artificial Intelligence, Washington, DC, USA, pp. 73-78.
Devogele, T. 1997. Processus d’intégration et d’appariement de bases de données Géographiques
-Applications à une base de données routières multi-échelles. PhD diss., Université de
Versailles.
Devogele, T., Parent, C., and S. Spaccapietra. 1998. On spatial database integration.
International Journal of Geographical Information Science 12:335-352.
Gesbert, N. 2005, Formalisation des spécifications de bases de données géographiques en vue de
13
leur intégration, PhD diss., Marne la Vallée.
Levenshtein, V. 1965. Binary Codes Capable of Correcting Deletions, Insertions and Reversals.
Soviet Physics-Doklady 10-8:707-710.
Mustière, S. 2006. Results of Experiments on Automated Matching of Networks at Different
Scales. In Proceedings of the ISPRS Workshop on Multiply Representation and
Interoperability of Spatial Data, Hanover, Germany, pp. 92-100.
Olteanu, A-M., Mustière, S., and A. Ruas. 2006. Matching Imperfect Data. In Proceedings of
International Symposium on Spatial Accuracy Assessment in Natural Resources and
Environmental Sciences, Lisbon, Portugal, pp. 694-704.
Royère, C., Gruyer, D., and V. Cherfaoui. 2000. Data Association with Belief Theory. In
Proceedings of International Conference of Information Fusion, Paris, France, pp. 23-29.
Shafer, G. 1976. A Mathematical Theory of Evidence. Princeton University Press, Princeton.
Sherren, D., Mustiere, S., and J-D.Zucker. 2004. How to Integrate Heterogeneous Spatial
Databases in a Consistent Way? In Proceedings of Advanced Databases and Information
Systems, Budapest, Hungary, 364-378.
Smets, P., and R. Kennes. 1994. The Transferable Belief Model. Artificial Intelligence, 66:191234.
Voltz, S.2006. An Iterative Approach for Matching Multiple Representations of Street Data. In
Proceedings of the ISPRS Workshop on Multiply Representation and Interoperability of
Spatial Data, Hanover, Germany, 101-110.
Walter, V. and D. Fritsch. 1999. Matching Spatial Data Sets: Statistical Approach. International
Journal of GIS 13:445-473.