A Set of Experiments to Consider Data Quality Criteria in

A Set of Experiments to Consider Data Quality
Criteria in Classification Techniques for Data
Mining
Roberto Espinosa1, José Zubcoff2 , and Jose-Norberto Mazón3
1
2
3
University of Matanzas, Cuba
[email protected]
Dept. of Sea Sciences and Applied Biology
University of Alicante, Spain
[email protected]
Dept. of Software and Computing Systems
University of Alicante, Spain
[email protected]
Abstract. A successful data mining process depends on the data quality
of the sources in order to obtain reliable knowledge. Therefore, preprocessing data is required for dealing with data quality criteria. However,
preprocessing data has been traditionally seen as a time-consuming and
non-trivial task since data quality criteria have to be considered without
any guide about how they affect the data mining process. To overcome
this situation, in this paper, we propose to analyze the data mining
techniques to know the behavior of different data quality criteria on the
sources and how they affects the results of the algorithms. To this aim, we
have conducted a set of experiments to assess three data quality criteria:
completeness, correlation and balance of data. This work is a first step
towards considering, in a systematic and structured manner, data quality criteria for supporting and guiding data miners in obtaining reliable
knowledge.
1
Introduction
Data mining is the process of applying data analysis and discovery algorithms to
find knowledge patterns over a collection of data [6]. This process necessarily implies a deep understanding of the available sources and the application domain
in order to determine the best algorithm to be applied. Due to this fact, the
development of data mining applications has been traditionally considered as a
true hand-crafted process in which the success is highly dependent on the analyst
knowledge, experience and skill. Unfortunately, the higher amount of data available (either intentionally or extensionally), the more complex and unattainable
the process.
Moreover, data mining is only a step of an overall process named knowledge
discovery in databases (KDD). KDD consists of using databases in order to apply
B. Murgante et al. (Eds.): ICCSA 2011, Part II, LNCS 6783, pp. 680–694, 2011.
c Springer-Verlag Berlin Heidelberg 2011
A Set of Experiments to Consider Data Quality Criteria
681
data mining to a set of already preprocessed data and also to evaluate the resulting patterns to extract knowledge. Indeed, the importance of the preprocessing
task should be emphasized due to the fact that (i) it has a significant impact on
the quality of the results of the applied data mining algorithms [10], and (ii) it
requires significantly more effort than the data mining task itself [8]. Although,
the preprocessing task has been traditionally related to clean the data, from a
data mining perspective, there exist other data quality criteria that must be considered due to that fact that they also affect the KDD. Therefore, the purpose of
this paper is to analyze the data mining techniques to understand the behavior
of different data quality criteria on the sources and how they affect the results
of the data mining algorithms. It is worth noting that this work is not intended
to improve the data quality of the sources, but to deal with it. In this way, the
data quality of the sources would help to determine the best possible algorithm
to be applied to the sources to obtain coherent results. Our motivation is that
the selection of a suitable algorithm to boost the data mining results should be
based on the analysis of existing data quality of the sources instead of wasting
resources in trying to improve data quality. Furthermore, our approach would
guide users through a proper algorithm selection in order to obtain reliable results to ensure that the patterns discovered in the data sources derive in well
informed decisions regardless the data quality of the sources. Finally, due to the
complexity of the different data mining techniques, this work only focuses on
classification techniques.
The remainder of this paper is structured as follows. Section 2 briefly describes
the related work. Section 3 describes the experiments carried out in order to
consider data quality of sources in classification techniques. Discussion about
results is presented in Sect. 3.3. Finally, Sect. 4 presents our conclusions and
sketches out future work.
2
Related Work
The KDD process (Fig. 1) is summarized in three phases: (i) data integration in a
repository, also known as preprocessing data or ETL (Extract/Transform/Load)
phase in the data warehouse area, (ii) algorithms and attributes selection phase
for data mining (i.e., the core of KDD), and (iii) the analysis and evaluation of
the resulting patterns in the final phase.
All the phases of this process are highly dependent of the previous one. This
way, the success of the analysis phase depends on the selection of adequate
attributes and algorithms. Also, this selection phase depends on the data preprocessing phase in order to eliminate any problem that affects the quality of
the data.
For the first phase of the KDD process there are some proposals that address the problem of data quality from the point of view of cleaning data: (i)
for duplicates detection and elimination [5,1], entity identification under different labels [13], etc.; (ii) resolution of conflicts in instances [14] by using specific
cleaning techniques [15]; (iii) uncommon values, lost or incomplete, damaging
682
R. Espinosa, J. Zubcoff, and J.-N. Mazón
Fig. 1. The KDD process: from the data sources to the knowledge
of data [12], among others. Several cleaning techniques have been used to solve
problems, such as heterogeneous structure of the data: an example is standarization of data representation, such as dates [3].
There are other approaches that consider data quality during the second phase
of the KDD process. In [4], the authors propose a variant to provide the users all
the details to make correct decisions. They outline that besides the traditional
reports it is also essential to give the users information about quality, for example, the quality of the metadata. In [9], authors use metadata as a resource to
store the results of measuring data quality. One of the most interesting proposal
is presented in [2] where a set of measures for data quality is defined to be applied in preprocessing stages of mining models. The aim of this work is to be
integrated with the CWM metamodel [11]. Unfortunately, data quality criteria
are not studied with the aim of guiding the selection of the appropriate data
mining algorithm.
Conversely, our work considers data quality in a broader perspective, thus
assuming that not only data cleaning is important for data mining to obtain
reliable knowledge, but also other data quality criteria, such as completeness,
correlation or balance. Therefore, in this work, we propose to conduct a set of
experiments to analyze the behavior of different data quality criteria on the
sources and how they affects the results of data mining algorithms. For the
sake of understandability, this paper is only concerned about the classification
techniques and three data quality criteria: completeness, correlation and balance
of data.
3
Determining Data Quality Measures for Classification
Techniques
Traditionally, several data quality criteria related to data cleaning have been
considered in the preprocessing phase of KDD [7]. However, other data quality
A Set of Experiments to Consider Data Quality Criteria
683
criteria should be addressed in data mining according to some hints given in [6].
On one hand, missing or noisy data should be considered (which is related to the
completeness data quality criteria) and, on the other hand, complex relationships
among data should be detected (which is related to correlation and balance).
Bearing in consideration these issues, three quality criteria have been studied in this paper: completeness, correlation, and balance. These criteria should
be taken into account when selecting attributes to apply classification mining
techniques in order to obtain better knowledge patterns. As resulting patterns
are not always useful, since their reliability or consistency may not be assured,
considering these criterias help to find useful knowledge.
Therefore, we have conducted some experiments by using a real case study,
in order to check how these three novel data quality criteria may be related to
the discovery of non-useful, non-reliable or even inconsistent knowledge when
classification mining are applied on a inadequate data set. An overview of how
we have conducted the experiments is given in Fig.2.
Our case study contains data about players behavior in games from several
basketball tournaments from the Cuba Basketball League. Including data of the
“III Juegos Deportivos del ALBA” (http://www.juegosalba.cu), an international tournament (male and female categories) that took place in Ciego de vila
(Cuba) in April 2009. Data from our case study are stored in a database according to the schema shown in Fig.3. In those tournaments, data for each game and
player contain a set of indicators (see Table 1). Besides, they take into account
basic players data such: sex, height, weight, and so on. For each player, their position (Table 2), and the global nominal evaluation (Weak Integral, Integral and
Very Integral) are known. This evaluation is established from the performance
obtained in each game, taking into account both the position of each player, and
the offensive and defensive indicators.
Fig. 2. Overview of our experiments for considering data quality criteria in classification techniques
684
R. Espinosa, J. Zubcoff, and J.-N. Mazón
Fig. 3. Excerpt of our database schema
Although our database is rather small, it is useful for generating several models
of classification, thus showing how a selection of non-suitable attributes leads to
useless, non-precise or non-valid pattern.
3.1
Data Quality Issues in Classification Techniques
According to the hints given by [6], some of the data quality criteria that may
affect the results of classification techniques are: (i) correlation between selected
attributes (input and predict ), (ii) (un)balanced data, and (iii) (un)complete
data (presence of null values). We will describe next these three data quality
criteria.
Correlation between selected attributes. If an input attribute depends
on or is related to the attribute to predict, which is known a priori, then the
resulting pattern will be useless, since it is previously known. The main reason is
that resulting patterns of a classification technique try to group input attributes
depending on their relation with the attribute selected as predict (dependent
variable in statistic).
Although it is relatively easy to identify the correlation between attributes in
low-dimensional sets of data, it is a complex problem when dealing with highdimensional that should be solved in a formal manner.
A Set of Experiments to Consider Data Quality Criteria
685
Table 1. Set of statistical indicators
Indicators Name
Steals Balls
Defensive Rebounds
Defensives Failing in the confrontation an adversary that penetrates
toward the basket
No recovering rebounds after the team’s launching rival
Defensive assistances
Assistances
Balls’ Turnovers
Wrong Free Throws
2 Points Wrong Throws
3 Points Wrong Throws
Offensive Rebounds
Ofensives Free Throws Points
2 Points Throws Made
3 Points Throws Made
Percentage of effectiveness in free throws
Percentage of effectiveness in 2 points throws
Percentage of effectiveness in 3 points throws
Percentage of effectiveness in Fields Goals Throws
Permanence at the playing field
Abbreviation
BG
RD
FE
NR
AD
A
PB
TLE
TE2
TE3
RO
PA1
PA2
PA3
PorCTLibre
PorCT2
PorCT3
PTCampo
PC
Table 2. The players’ positions
Position Abbreviation Function
Guard
G
Organizing the game and helping his associates
Forward
F
Taking the team’s offensive weight
Pivot
P
Guaranteeing the low game basket
Balanced data. Data is unbalanced when different quantities of instances of
an attribute in different dimensions are related to each other. It would be desirable having the same distributions of the different labels for the attributes
(balanced data) in order to obtain successful classification mining techniques.
Again, regardless how balanced the data are, results can be obtained by applying classification mining. However, the results are likely to be biased due to
the over-fitting unbalanced data. Therefore, the idealistic situation is to have
similar number of instances for each label on the dataset (balanced data). As
unbalanced data affect the reliability of resulting patterns, our objective is to
consider this data quality criteria by taking into account the attributes of an
algorithm selected as input and the available data in the sources.
Completeness. Completeness refers to the presence of null values in the data
sources, which obviously affect the resulting patterns. In this case, our objective
686
R. Espinosa, J. Zubcoff, and J.-N. Mazón
is to check how the presence of null values influences the results of classification
techniques.
3.2
Experiments
This section describes the experiments we conducted in order to check if useless,
unreliable or even inconsistent results may arise as a consequence of applying
classification algorithms is related to the data quality of sources.
Experiments about correlation between selected attributes. Table 3
shows the linear regression algorithm between PA1, TLE and PorCTLibre indicators. Table 1 shows a correlation factor of 0.7283. This value indicates that
attributes are correlated. In a similar way, we apply the linear regression between
TE2, PA2 and PorCT2; TE3, PA3 and PorCT3; and TE2, TE3, PA2, PA3 and
PorCTCampo. These results lead to discard attributes PorCTLibre, PorCT2,
PorCT3, PorCTCampo, since they are useless in classification techniques due
to the fact that all others attributes are used for calculating them. Therefore,
attributes used for computing other attributes must be avoided being used in
the classification mining techniques. Consequently, our initial hypothesis was
corroborated for this case: correlated attributes with previously known relationship lead to discover useless knowledge. As a result, domain knowledge is an
important factor for being successful in classification techniques.
Table 3. Obtained results of applying the Linear Regression algorithm between TLE,
PA1 and PorCTLibre
P orCT Libre = −6.1835 ∗ T LE + 16.7723 ∗ P A1 + 14.049
Time taken to build model: 0.19 seconds
=== Cross-validaton ===
=== Summary ===
Correlation coefficient
0.7283
Mean absolute error
18.886
Root mean squared error
24.3515
Relative absolute error
59.2649 %
Root relative squared error
68.4085 %
Total Number of Instances
420
Experiments about completeness. A set of data containing all the players
height and their respective position was selected (see Table 4). Our experiments
only use the male players records in order to avoid the influence of the sex
attribute on the classifier. The J48 classification algorithm has been applied (see
Table 5).
Our first case (Table 5) is concerned about the original data and the classifier
obtained 84% of correct classification.
A Set of Experiments to Consider Data Quality Criteria
687
Table 4. Masculine players’ statistical data of the source of data
Position Instances
Guard
136
Forward
152
Pivot
132
Total
420
For the second and third cases, the set of data contains a quantity of null
values equivalent for each class: 10% and 30%, respectively. The results obtained
in these cases suggest the same behavior: as the null values presence increases, the
percentages of correct classification decreases. Having in mind the total amount
of null values added to the original dataset, we can conclude that the presence
of null values homogeneously added to a dataset does not significantly affect the
classifier.
The aim of the fourth case was to classify one class with a greater quantity of
null values. We apply the J48 algorithm to a set of data where the quantity of null
values is 50% of the records. Then, all null registers correspond to only one class,
in this case Guard. Intuitively, as the quantity of null values becomes larger, the
percentages of correct classification diminish. Therefore, it has been correctly
classified 62, 134 and 97 instances of Guard, Forward and Pivot representing the
45%, 88% and 73%, respectively. The lowest percentage was the one belonging
to the class with the greatest quantity of null instances for the attribute height
as we can deduce from Table 5. The percentages of instances correctly classified
with similar variation between them can be observed. However, if the number of
instances correctly classified belonging to the affected class is checked, then the
impact becomes evident.
The percentages of instances correctly classified hardly vary in a broader sense,
but if the number of instances correctly classified belonging to the affected class
is checked, then the impact becomes evident. As appreciated in Table 5, there
exists a class Goalkeeper with zero impact in the results due to there is no
instances of this class in the dataset. Therefore, classes with no presence in the
data source must be eliminated from the input.
Experiments about balanced data. A set of unbalanced data is selected
in such a way that the class Guard has more instances. When the J48 algorithm is applied for classification, the results (see Fig.4) may be considered good:
73.4375% of instances were correctly classified. However, a high quantity of instances correctly classified is observed in the Confusion Matrix (on the ones with
high witnesses on the dataset). Therefore, the influence of the prevailing class
over the rest of the dataset can be recognized in the result.
Results obtained by using different algorithms of classification. In this
section, several experiments for applying different algorithms to various classification techniques are described with the aim of determining until what extent the
688
R. Espinosa, J. Zubcoff, and J.-N. Mazón
Table 5. Obtained results by applying the J48 algorithm of classification as the quantity of the null values in the set of data is varying
Instances 1-Correctly
classified
instances
Guard
Forward
Pivot
Goalkeeper
Total instances
Correctly classified instances
Incorrectly
classified
instances
Ignored class
unknown
instances
2-Correctly
classified
instances
with
10% null values
for each class
122 (89.7 %) 111 (81.6 %)
134 (88.2 %) 112 (73.7 %)
97 (73.5 %) 87 (65.9 %)
353 (84 %) 310 (73.8 %)
3-Correctly
classified
instances
with
30% null values
for each class
86 (63.2 %)
83 (54.6 %)
68 (51.5 %)
237 (80.9 %)
4-Correctly
classified
instances
with
50% null values
for Guard class
62 (45.6 %)
134 (88.2 %)
97 (73.5 %)
293 (83.2 %)
-
67 (16 %)
68 (16.2 %)
56 (19.1 %)
59 (16.8 %)
-
0
42
127
68
136
152
132
0
420
-
data quality criteria affect the results. The tool used for obtaining these results
was Weka1 . The objective was to show if the behavior of different classification
algorithms depends on the data quality criteria.
Through the development of this experiment, we first (i) select an ideal dataset
taking into account the experts’ knowledge, that is, without none of the problems
previously found, in such a way that the best possible classification is obtained.
Next, (ii) the algorithms previously listed on Table 6, were applied to the Original
column (without any quality issue previously described), and to a correlated
model. A consecutive number were added to define the order of the algorithms
correctly classified compared to the Original one. The row containing the number
1 points to the algorithm with the best result in relation to the results obtained
with the original file.
The next step was to set up different level of error for each one of the data
quality criteria in the original file. The level and the criteria were as follows.
1. Completeness:
(a) 10 % of null data for each existent class of the attribute to predict
(b) 30 % of null data for each existent class of the attribute to predict
(c) 50 % of null data for the class Guard (Table 5)
2. Correlation:
(a) Inclusion of Input attributes which is known in advance that they are
correlated. (Table 6)
1
http://www.waikato.ac.nz/ml/weka/
A Set of Experiments to Consider Data Quality Criteria
689
Fig. 4. Obtained results to apply the classifier to the unbalanced dataset
3. Balance and Unbalance: We make the experiments with 3 different datasets,
considering that in our case of study exist 3 class of the attribute to predict:
(a) Dataset with the players denominated Guard unbalanced in relation to
the Forward and Pivot
(b) Dataset with the unbalanced players Forward respect to Guard and
Pivot
(c) Dataset with the unbalanced players Pivot respect to Guard them and
Forward
For each one, the quantity of elements of the unbalanced class are modified,
distributing the quantity of records of every class taking into account the following percentages in relation to the total: 40, 30 and 30; 50, 25 and 25; 60, 20
and 20 (see Table 7).
1. Unbalance and null values:
(a) Unbalanced with 10% of null data for each existent class of the attribute
to predict.
(b) Unbalanced with 30% of null data for each existent class of the attribute
to predict.
(c) Unbalanced with 50% of null data for each existent class of the attribute
to predict.
2. Unbalance and correlation:
We used the same files for this variant than unbalance but now introducing
the attributes in phase.
3. Correlation and null values:
We used the same files for this variant than null values but now introducing
the correlated attributes.
690
R. Espinosa, J. Zubcoff, and J.-N. Mazón
Table 6. Obtained results when classifying the original dataset and when adding
correlated attributes
Techniques Algorithms
Original (%) Correlated Dataset (%)
ConjuntivesRules
61.19
7-62.50
Decision Table
80.00
3-94.79
DTNB
80.00
5-91.67
JRip
80.00
4-95.83
Rules
NNge
73.57
6-79.17
OneR
73.33
8-70.83
Part
80.48
1-96.88
Ridor
78.81
2-93.75
ZeroR
36.19
9-31.25
Logistic
76.67
5-76.04
MultilayerPerceptron
78.81
4-82.29
Functions RBFNetwork
76.67
3-80.21
SimpleLogistic
76.19
2-82.29
SMO
76.67
1-83.33
BFTree
69.00
4-90.00
FT
79.00
5-91.00
Trees
J48
73.00
2-97.00
J48graft
71.00
1-97.00
Random Forest
75.00
3-98.00
Table 7. Results of the classifier for Trees with unbalanced data
Class
Algorithms
Original (%) 40, 30 and 30% 50, 25 and 25% 60, 20 and 20%
BFTree
80.70
2-80.50
5-69,80
5-67.90
FT
77.10
1-80.70
1-69.30
1-69.80
Guard J48
81.40
5-80.00
2-71.90
2-71.70
J48graft
81.40
5-80.00
2-71.90
2-71.70
Random Forest
80.70
4-80.20
4-70.00
4-69.00
BFTree
80.70
4-78.30
4-73.57
1- 82.38
FT
77.10
1-78.10
1-73.10
2-78.57
Forward J48
81.40
2-80.71
2-74.52
4-81.19
J48graft
81.40
2-80.71
2-74.52
5-81.19
Random Forest
80.70
4-78.33
5-71.90
3-81.67
BFTree
80.70
3-80.95
5-76.43
2-74.76
FT
77.10
1- 80.00
1-76.43
1-77.14
Pivot J48
81.40
4-78.33
2-78.33
3-75.00
J48graft
81.40
4-78.33
2-78.33
3-75.00
Random Forest
80.70
2-82.14
4-77.14
5-73.81
A Set of Experiments to Consider Data Quality Criteria
691
Table 8. Results of the classifier for Function with unbalanced data and null values
Class
Original (%) Unbalance
and 10% null
values
Logistic
76.67
3-74.42
Multilayer Perceptron
78.81
5-66.28
Guard RBFNetwork
76.67
4-70.93
SimpleLogistic
76.19
2-74.42
SMO
76.67
1-75.58
Logistic
70.49
1-83.64
Multilayer Perceptron
70.49
1-83.64
Forward RBFNetwork
63.93
5-70.91
SimpleLogistic
75.41
3-85.45
SMO
72.13
4-81.82
Logistic
71.05
1-73.91
Multilayer Perceptron
84.21
4-78.26
Pivot RBFNetwork
81.58
5-63.77
SimpleLogistic
82.89
3-81.16
SMO
84.21
2-82.61
3.3
Algorithms
Unbalance
and 30% null
values
1-76.92
2-76.92
4-69.23
5-66.67
3-71.79
4-78.82
1-82.35
5-70.59
3-85.88
2-83.53
4-60.78
2-76.47
5-66.67
1-80.39
2-76.47
Unbalance
and 50% null
values
2-74.24
4-72.73
5-65.15
1-74.24
3-72.73
5-78.69
2-83.61
4-75.41
3-88.52
1-91.80
4-55.26
3-71.05
5-63.16
1-73.68
2-73.68
Results
As can be derived from the experiments (Table 5), the percentages of correct
classification of instances are significantly diminished due to the presence of
null attributes in the dataset for the case of the null values; then, it can be
concluded that the presence of null values homogeneously added to a set of data
does not affect the results of the classifier. The percentages of instances correctly
classified hardly vary, but if the attribute affected belongs to the selected class
it changes as the percentage of the instances correctly classified decreases. This
analysis was possible since we knew the original quantity of attributes of each
class in the data sources. This knowledge is not previously known when we faced
a real source of data: when the quantity of attributes of each class is not known,
therefore the percentage represents the attributes with null values of the total
of elements is only known and multiple combinations of null values with the
same percentage of null values in relation to the total could appear. That is,
when we add 30% of null values for the same class (Guard ), according to our
data sources, it represents almost the same amount than when we add 10% of
null values for each class (Guard, Forward and Pivot ), and the results obtained
were very different. The quantity of elements of each class in the data sources
that are correctly classified has to be considered to check the null values. It can
be appreciated that as the quantity of null values grows in a non-homogeneous
manner, another problem is introduced: the unbalance between classes. Finally,
the results show a notably impact because of the presence of the unbalanced
class in relation to the the rest of the classes (see Table 5).
692
R. Espinosa, J. Zubcoff, and J.-N. Mazón
For the case of the experiments with unbalanced data (Table 7), the situation
depends on the quantity of attributes that were correctly classified in the data
sources. Then, results are correct if the distribution of the data for each class is
as uniform as possible. Taking the observed results into account and the quantity
of combinations in the data sources, the a possible solution is to determine if
data are unbalanced or not, according to the quantity of different existent classes.
Unbalanced data have impact at a global level. In some cases, the changes added
to the dataset imply an overfitting in the obtained classification model. A deep
analysis can confirm that the algorithm FT always has the best results with
regards of the other algorithms. This allows us to confirm that, for this case
study, the use of the FT algorithm can be suggested when the data sources are
unbalanced.
For the case of correlation (Table 6), the percentages of classification were
ostensibly better, considering that the attribute added as Input in the data
sources as the most influential. The solution would be creating a mechanism
that alerts when introducing strongly correlated attributes as Input/Output.
When applying the classification to unbalanced and incomplete data
(Table 8), it can be noticed that the null values of the classification diminish, in
a general way, as it is worked up with larger quantity of null values. Classification improves their results only in those cases which values were homogeneously
added, confirming that quantity of data of the unbalanced class has influence.
We have detected three aspects related with the quality of data for classification techniques that should be discussed in early stages of the KDD. In short,
the data selected for the application of a classification mining technique would
(i) avoid correlated data, (ii) avoid highly unbalanced data, and finally, (iii) select data according to the context of the problem. Correlated data increase the
complexity of the resulting pattern, but not providing useful information. Unbalanced data provides non-reliable overfitted patterns. Finally, Input data not
selected by following a domain criteria, can result in a more complex pattern or
even with non-useful or non-novel information. The experiments accomplished
allowed us to reach our objective: to know the behavior of the different subsets
of classification techniques and algorithms on different specific dataset, allowing
us to assure that it would be essential to consider data quality in an systematic
manner.
4
Conclusions and Future Work
A successful data mining process depends on the data quality of the sources in
order to obtain reliable knowledge. Traditionally, a preprocessing step has been
done for preparing data for data mining. This step has been widely based on data
quality criteria related to data cleaning (such as duplication detection). However,
from a broader perspective, there exist other data quality criteria that must be
considered due to the fact that they affect the knowledge discovery. In this paper,
we have conducted a set of experiments to show that completeness, correlation
and balance of data also affect the data mining results. The results of these
A Set of Experiments to Consider Data Quality Criteria
693
experiments would be useful for guiding user in selecting a proper algorithm
in order to obtain reliable results regardless the data quality of the sources.
Therefore, our point of view of data quality for data mining is not to ameliorate
the data quality of the sources, but dealing with the quality that is included in the
data. In this way, the selection of a suitable algorithm to boost the data mining
results should be based on the analysis of existing data quality of the sources
instead of wasting resources in trying to improve data quality. This work is a first
step towards considering, in a systematic and structured manner, data quality
criteria for supporting and guiding data miners for obtaining reliable knowledge.
Therefore, our immediate future work consists on conducting more experiments,
focusing on the feasibility of our approach when dealing with complex data: not
only because of the different kind of data types (XML, multimedia, and so on),
but also because of the high dimensionality of complex data.
Acknowledgments. This work has been partially supported by the MANTRA
project from Valencia Ministry of Education and Science (GV/2011/035) and
from the University of Alicante (GRE09- 17), SERENIDAD project (PEII-110327-7035) from Junta de Comunidades de Castilla La Mancha and by the
MESOLAP project (TIN2010-14860) from the Spanish Ministry of Education
and Science. Authors would like to express ther gratitude to Emilio Soler for his
helpful comments and discussion.
References
1. Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data
warehouses. In: VLDB, pp. 586–597. Morgan Kaufmann, San Francisco (2002)
2. Berti-Equille, L.: Measuring and modelling data quality for quality-awareness in
data mining. In: Guillet, F., Hamilton, H.J. (eds.) Quality Measures in Data Mining. Studies in Computational Intelligence, vol. 43, pp. 101–126. Springer, Heidelberg (2007)
3. Bharat, K., Broder, A., Dean, J., Henzinger, M.R., Automatically Extracting
and Structure Free and Text Addresses, Borkar, V., Deshmukh, K., Sarawagi, S.,
Knoblock, C., Lerman, K., Minton, S., Muslea, I., Vassiliadis, P., Vagena, Z., Skiadopoulos, S., Karayannidis, N., Sellis, T., Lomet, D.B., Gravano, L., Levy, A.,
Sarawagi, S., Weikum, G.: Special Issue on Data Cleaning (2000)
4. Chiang, R.H.L., Barron, T.M., Storey, V.C.: Reverse engineering of relational
databases: Extraction of an eer model from a relational database. Data Knowl.
Eng. 12(2), 107–142 (1994)
5. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A
survey. IEEE Transactions on Knowledge and Data Engineering 19, 1–16 (2007)
6. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: Knowledge discovery and data
mining: Towards a unifying framework. In: KDD, pp. 82–88 (1996)
7. González-Aranda, P., Ruiz, E.M., Millán, S., Ruiz, C., Segovia, J.: Towards a
methodology for data mining project development: The importance of abstraction.
In: Lin, T.Y., Xie, Y., Wasilewska, A., Liau, C.J. (eds.) Data Mining: Foundations and Practice. Studies in Computational Intelligence, vol. 118, pp. 165–178.
Springer, Heidelberg (2008)
694
R. Espinosa, J. Zubcoff, and J.-N. Mazón
8. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann,
San Francisco (2000)
9. Jarke, M., Vassiliou, Y.: Data warehouse quality: A review of the dwq project. In:
Strong, D.M., Kahn, B.K. (eds.) IQ, pp. 299–313. MIT, Cambridge (1997)
10. Kriegel, H.P., Borgwardt, K.M., Kröger, P., Pryakhin, A., Schubert, M., Zimek,
A.: Future trends in data mining. Data Min. Knowl. Discov. 15(1), 87–97 (2007)
11. Object Management Group: Common Warehouse Metamodel Specification 1.1,
http://www.omg.org/cgi-bin/doc?formal/03-03-02
12. Rahm, E., Do, H.: Data cleaning: Problems and current approaches. IEEE Data
Eng. Bull. 23(4), 3–13 (2000)
13. Strong, D.M., Lee, Y.W., Wang, R.Y.: 10 potholes in the road to information
quality. IEEE Computer 30(8), 38–46 (1997)
14. Strong, D.M., Lee, Y.W., Wang, R.Y.: Data quality in context. Commun.
ACM 40(5), 103–110 (1997)
15. Troyanskaya, O.G., Cantor, M., Sherlock, G., Brown, P.O., Hastie, T., Tibshirani,
R., Botstein, D., Altman, R.B.: Missing value estimation methods for dna microarrays. Bioinformatics 17(6), 520–525 (2001)