A Set of Experiments to Consider Data Quality Criteria in Classification Techniques for Data Mining Roberto Espinosa1, José Zubcoff2 , and Jose-Norberto Mazón3 1 2 3 University of Matanzas, Cuba [email protected] Dept. of Sea Sciences and Applied Biology University of Alicante, Spain [email protected] Dept. of Software and Computing Systems University of Alicante, Spain [email protected] Abstract. A successful data mining process depends on the data quality of the sources in order to obtain reliable knowledge. Therefore, preprocessing data is required for dealing with data quality criteria. However, preprocessing data has been traditionally seen as a time-consuming and non-trivial task since data quality criteria have to be considered without any guide about how they affect the data mining process. To overcome this situation, in this paper, we propose to analyze the data mining techniques to know the behavior of different data quality criteria on the sources and how they affects the results of the algorithms. To this aim, we have conducted a set of experiments to assess three data quality criteria: completeness, correlation and balance of data. This work is a first step towards considering, in a systematic and structured manner, data quality criteria for supporting and guiding data miners in obtaining reliable knowledge. 1 Introduction Data mining is the process of applying data analysis and discovery algorithms to find knowledge patterns over a collection of data [6]. This process necessarily implies a deep understanding of the available sources and the application domain in order to determine the best algorithm to be applied. Due to this fact, the development of data mining applications has been traditionally considered as a true hand-crafted process in which the success is highly dependent on the analyst knowledge, experience and skill. Unfortunately, the higher amount of data available (either intentionally or extensionally), the more complex and unattainable the process. Moreover, data mining is only a step of an overall process named knowledge discovery in databases (KDD). KDD consists of using databases in order to apply B. Murgante et al. (Eds.): ICCSA 2011, Part II, LNCS 6783, pp. 680–694, 2011. c Springer-Verlag Berlin Heidelberg 2011 A Set of Experiments to Consider Data Quality Criteria 681 data mining to a set of already preprocessed data and also to evaluate the resulting patterns to extract knowledge. Indeed, the importance of the preprocessing task should be emphasized due to the fact that (i) it has a significant impact on the quality of the results of the applied data mining algorithms [10], and (ii) it requires significantly more effort than the data mining task itself [8]. Although, the preprocessing task has been traditionally related to clean the data, from a data mining perspective, there exist other data quality criteria that must be considered due to that fact that they also affect the KDD. Therefore, the purpose of this paper is to analyze the data mining techniques to understand the behavior of different data quality criteria on the sources and how they affect the results of the data mining algorithms. It is worth noting that this work is not intended to improve the data quality of the sources, but to deal with it. In this way, the data quality of the sources would help to determine the best possible algorithm to be applied to the sources to obtain coherent results. Our motivation is that the selection of a suitable algorithm to boost the data mining results should be based on the analysis of existing data quality of the sources instead of wasting resources in trying to improve data quality. Furthermore, our approach would guide users through a proper algorithm selection in order to obtain reliable results to ensure that the patterns discovered in the data sources derive in well informed decisions regardless the data quality of the sources. Finally, due to the complexity of the different data mining techniques, this work only focuses on classification techniques. The remainder of this paper is structured as follows. Section 2 briefly describes the related work. Section 3 describes the experiments carried out in order to consider data quality of sources in classification techniques. Discussion about results is presented in Sect. 3.3. Finally, Sect. 4 presents our conclusions and sketches out future work. 2 Related Work The KDD process (Fig. 1) is summarized in three phases: (i) data integration in a repository, also known as preprocessing data or ETL (Extract/Transform/Load) phase in the data warehouse area, (ii) algorithms and attributes selection phase for data mining (i.e., the core of KDD), and (iii) the analysis and evaluation of the resulting patterns in the final phase. All the phases of this process are highly dependent of the previous one. This way, the success of the analysis phase depends on the selection of adequate attributes and algorithms. Also, this selection phase depends on the data preprocessing phase in order to eliminate any problem that affects the quality of the data. For the first phase of the KDD process there are some proposals that address the problem of data quality from the point of view of cleaning data: (i) for duplicates detection and elimination [5,1], entity identification under different labels [13], etc.; (ii) resolution of conflicts in instances [14] by using specific cleaning techniques [15]; (iii) uncommon values, lost or incomplete, damaging 682 R. Espinosa, J. Zubcoff, and J.-N. Mazón Fig. 1. The KDD process: from the data sources to the knowledge of data [12], among others. Several cleaning techniques have been used to solve problems, such as heterogeneous structure of the data: an example is standarization of data representation, such as dates [3]. There are other approaches that consider data quality during the second phase of the KDD process. In [4], the authors propose a variant to provide the users all the details to make correct decisions. They outline that besides the traditional reports it is also essential to give the users information about quality, for example, the quality of the metadata. In [9], authors use metadata as a resource to store the results of measuring data quality. One of the most interesting proposal is presented in [2] where a set of measures for data quality is defined to be applied in preprocessing stages of mining models. The aim of this work is to be integrated with the CWM metamodel [11]. Unfortunately, data quality criteria are not studied with the aim of guiding the selection of the appropriate data mining algorithm. Conversely, our work considers data quality in a broader perspective, thus assuming that not only data cleaning is important for data mining to obtain reliable knowledge, but also other data quality criteria, such as completeness, correlation or balance. Therefore, in this work, we propose to conduct a set of experiments to analyze the behavior of different data quality criteria on the sources and how they affects the results of data mining algorithms. For the sake of understandability, this paper is only concerned about the classification techniques and three data quality criteria: completeness, correlation and balance of data. 3 Determining Data Quality Measures for Classification Techniques Traditionally, several data quality criteria related to data cleaning have been considered in the preprocessing phase of KDD [7]. However, other data quality A Set of Experiments to Consider Data Quality Criteria 683 criteria should be addressed in data mining according to some hints given in [6]. On one hand, missing or noisy data should be considered (which is related to the completeness data quality criteria) and, on the other hand, complex relationships among data should be detected (which is related to correlation and balance). Bearing in consideration these issues, three quality criteria have been studied in this paper: completeness, correlation, and balance. These criteria should be taken into account when selecting attributes to apply classification mining techniques in order to obtain better knowledge patterns. As resulting patterns are not always useful, since their reliability or consistency may not be assured, considering these criterias help to find useful knowledge. Therefore, we have conducted some experiments by using a real case study, in order to check how these three novel data quality criteria may be related to the discovery of non-useful, non-reliable or even inconsistent knowledge when classification mining are applied on a inadequate data set. An overview of how we have conducted the experiments is given in Fig.2. Our case study contains data about players behavior in games from several basketball tournaments from the Cuba Basketball League. Including data of the “III Juegos Deportivos del ALBA” (http://www.juegosalba.cu), an international tournament (male and female categories) that took place in Ciego de vila (Cuba) in April 2009. Data from our case study are stored in a database according to the schema shown in Fig.3. In those tournaments, data for each game and player contain a set of indicators (see Table 1). Besides, they take into account basic players data such: sex, height, weight, and so on. For each player, their position (Table 2), and the global nominal evaluation (Weak Integral, Integral and Very Integral) are known. This evaluation is established from the performance obtained in each game, taking into account both the position of each player, and the offensive and defensive indicators. Fig. 2. Overview of our experiments for considering data quality criteria in classification techniques 684 R. Espinosa, J. Zubcoff, and J.-N. Mazón Fig. 3. Excerpt of our database schema Although our database is rather small, it is useful for generating several models of classification, thus showing how a selection of non-suitable attributes leads to useless, non-precise or non-valid pattern. 3.1 Data Quality Issues in Classification Techniques According to the hints given by [6], some of the data quality criteria that may affect the results of classification techniques are: (i) correlation between selected attributes (input and predict ), (ii) (un)balanced data, and (iii) (un)complete data (presence of null values). We will describe next these three data quality criteria. Correlation between selected attributes. If an input attribute depends on or is related to the attribute to predict, which is known a priori, then the resulting pattern will be useless, since it is previously known. The main reason is that resulting patterns of a classification technique try to group input attributes depending on their relation with the attribute selected as predict (dependent variable in statistic). Although it is relatively easy to identify the correlation between attributes in low-dimensional sets of data, it is a complex problem when dealing with highdimensional that should be solved in a formal manner. A Set of Experiments to Consider Data Quality Criteria 685 Table 1. Set of statistical indicators Indicators Name Steals Balls Defensive Rebounds Defensives Failing in the confrontation an adversary that penetrates toward the basket No recovering rebounds after the team’s launching rival Defensive assistances Assistances Balls’ Turnovers Wrong Free Throws 2 Points Wrong Throws 3 Points Wrong Throws Offensive Rebounds Ofensives Free Throws Points 2 Points Throws Made 3 Points Throws Made Percentage of effectiveness in free throws Percentage of effectiveness in 2 points throws Percentage of effectiveness in 3 points throws Percentage of effectiveness in Fields Goals Throws Permanence at the playing field Abbreviation BG RD FE NR AD A PB TLE TE2 TE3 RO PA1 PA2 PA3 PorCTLibre PorCT2 PorCT3 PTCampo PC Table 2. The players’ positions Position Abbreviation Function Guard G Organizing the game and helping his associates Forward F Taking the team’s offensive weight Pivot P Guaranteeing the low game basket Balanced data. Data is unbalanced when different quantities of instances of an attribute in different dimensions are related to each other. It would be desirable having the same distributions of the different labels for the attributes (balanced data) in order to obtain successful classification mining techniques. Again, regardless how balanced the data are, results can be obtained by applying classification mining. However, the results are likely to be biased due to the over-fitting unbalanced data. Therefore, the idealistic situation is to have similar number of instances for each label on the dataset (balanced data). As unbalanced data affect the reliability of resulting patterns, our objective is to consider this data quality criteria by taking into account the attributes of an algorithm selected as input and the available data in the sources. Completeness. Completeness refers to the presence of null values in the data sources, which obviously affect the resulting patterns. In this case, our objective 686 R. Espinosa, J. Zubcoff, and J.-N. Mazón is to check how the presence of null values influences the results of classification techniques. 3.2 Experiments This section describes the experiments we conducted in order to check if useless, unreliable or even inconsistent results may arise as a consequence of applying classification algorithms is related to the data quality of sources. Experiments about correlation between selected attributes. Table 3 shows the linear regression algorithm between PA1, TLE and PorCTLibre indicators. Table 1 shows a correlation factor of 0.7283. This value indicates that attributes are correlated. In a similar way, we apply the linear regression between TE2, PA2 and PorCT2; TE3, PA3 and PorCT3; and TE2, TE3, PA2, PA3 and PorCTCampo. These results lead to discard attributes PorCTLibre, PorCT2, PorCT3, PorCTCampo, since they are useless in classification techniques due to the fact that all others attributes are used for calculating them. Therefore, attributes used for computing other attributes must be avoided being used in the classification mining techniques. Consequently, our initial hypothesis was corroborated for this case: correlated attributes with previously known relationship lead to discover useless knowledge. As a result, domain knowledge is an important factor for being successful in classification techniques. Table 3. Obtained results of applying the Linear Regression algorithm between TLE, PA1 and PorCTLibre P orCT Libre = −6.1835 ∗ T LE + 16.7723 ∗ P A1 + 14.049 Time taken to build model: 0.19 seconds === Cross-validaton === === Summary === Correlation coefficient 0.7283 Mean absolute error 18.886 Root mean squared error 24.3515 Relative absolute error 59.2649 % Root relative squared error 68.4085 % Total Number of Instances 420 Experiments about completeness. A set of data containing all the players height and their respective position was selected (see Table 4). Our experiments only use the male players records in order to avoid the influence of the sex attribute on the classifier. The J48 classification algorithm has been applied (see Table 5). Our first case (Table 5) is concerned about the original data and the classifier obtained 84% of correct classification. A Set of Experiments to Consider Data Quality Criteria 687 Table 4. Masculine players’ statistical data of the source of data Position Instances Guard 136 Forward 152 Pivot 132 Total 420 For the second and third cases, the set of data contains a quantity of null values equivalent for each class: 10% and 30%, respectively. The results obtained in these cases suggest the same behavior: as the null values presence increases, the percentages of correct classification decreases. Having in mind the total amount of null values added to the original dataset, we can conclude that the presence of null values homogeneously added to a dataset does not significantly affect the classifier. The aim of the fourth case was to classify one class with a greater quantity of null values. We apply the J48 algorithm to a set of data where the quantity of null values is 50% of the records. Then, all null registers correspond to only one class, in this case Guard. Intuitively, as the quantity of null values becomes larger, the percentages of correct classification diminish. Therefore, it has been correctly classified 62, 134 and 97 instances of Guard, Forward and Pivot representing the 45%, 88% and 73%, respectively. The lowest percentage was the one belonging to the class with the greatest quantity of null instances for the attribute height as we can deduce from Table 5. The percentages of instances correctly classified with similar variation between them can be observed. However, if the number of instances correctly classified belonging to the affected class is checked, then the impact becomes evident. The percentages of instances correctly classified hardly vary in a broader sense, but if the number of instances correctly classified belonging to the affected class is checked, then the impact becomes evident. As appreciated in Table 5, there exists a class Goalkeeper with zero impact in the results due to there is no instances of this class in the dataset. Therefore, classes with no presence in the data source must be eliminated from the input. Experiments about balanced data. A set of unbalanced data is selected in such a way that the class Guard has more instances. When the J48 algorithm is applied for classification, the results (see Fig.4) may be considered good: 73.4375% of instances were correctly classified. However, a high quantity of instances correctly classified is observed in the Confusion Matrix (on the ones with high witnesses on the dataset). Therefore, the influence of the prevailing class over the rest of the dataset can be recognized in the result. Results obtained by using different algorithms of classification. In this section, several experiments for applying different algorithms to various classification techniques are described with the aim of determining until what extent the 688 R. Espinosa, J. Zubcoff, and J.-N. Mazón Table 5. Obtained results by applying the J48 algorithm of classification as the quantity of the null values in the set of data is varying Instances 1-Correctly classified instances Guard Forward Pivot Goalkeeper Total instances Correctly classified instances Incorrectly classified instances Ignored class unknown instances 2-Correctly classified instances with 10% null values for each class 122 (89.7 %) 111 (81.6 %) 134 (88.2 %) 112 (73.7 %) 97 (73.5 %) 87 (65.9 %) 353 (84 %) 310 (73.8 %) 3-Correctly classified instances with 30% null values for each class 86 (63.2 %) 83 (54.6 %) 68 (51.5 %) 237 (80.9 %) 4-Correctly classified instances with 50% null values for Guard class 62 (45.6 %) 134 (88.2 %) 97 (73.5 %) 293 (83.2 %) - 67 (16 %) 68 (16.2 %) 56 (19.1 %) 59 (16.8 %) - 0 42 127 68 136 152 132 0 420 - data quality criteria affect the results. The tool used for obtaining these results was Weka1 . The objective was to show if the behavior of different classification algorithms depends on the data quality criteria. Through the development of this experiment, we first (i) select an ideal dataset taking into account the experts’ knowledge, that is, without none of the problems previously found, in such a way that the best possible classification is obtained. Next, (ii) the algorithms previously listed on Table 6, were applied to the Original column (without any quality issue previously described), and to a correlated model. A consecutive number were added to define the order of the algorithms correctly classified compared to the Original one. The row containing the number 1 points to the algorithm with the best result in relation to the results obtained with the original file. The next step was to set up different level of error for each one of the data quality criteria in the original file. The level and the criteria were as follows. 1. Completeness: (a) 10 % of null data for each existent class of the attribute to predict (b) 30 % of null data for each existent class of the attribute to predict (c) 50 % of null data for the class Guard (Table 5) 2. Correlation: (a) Inclusion of Input attributes which is known in advance that they are correlated. (Table 6) 1 http://www.waikato.ac.nz/ml/weka/ A Set of Experiments to Consider Data Quality Criteria 689 Fig. 4. Obtained results to apply the classifier to the unbalanced dataset 3. Balance and Unbalance: We make the experiments with 3 different datasets, considering that in our case of study exist 3 class of the attribute to predict: (a) Dataset with the players denominated Guard unbalanced in relation to the Forward and Pivot (b) Dataset with the unbalanced players Forward respect to Guard and Pivot (c) Dataset with the unbalanced players Pivot respect to Guard them and Forward For each one, the quantity of elements of the unbalanced class are modified, distributing the quantity of records of every class taking into account the following percentages in relation to the total: 40, 30 and 30; 50, 25 and 25; 60, 20 and 20 (see Table 7). 1. Unbalance and null values: (a) Unbalanced with 10% of null data for each existent class of the attribute to predict. (b) Unbalanced with 30% of null data for each existent class of the attribute to predict. (c) Unbalanced with 50% of null data for each existent class of the attribute to predict. 2. Unbalance and correlation: We used the same files for this variant than unbalance but now introducing the attributes in phase. 3. Correlation and null values: We used the same files for this variant than null values but now introducing the correlated attributes. 690 R. Espinosa, J. Zubcoff, and J.-N. Mazón Table 6. Obtained results when classifying the original dataset and when adding correlated attributes Techniques Algorithms Original (%) Correlated Dataset (%) ConjuntivesRules 61.19 7-62.50 Decision Table 80.00 3-94.79 DTNB 80.00 5-91.67 JRip 80.00 4-95.83 Rules NNge 73.57 6-79.17 OneR 73.33 8-70.83 Part 80.48 1-96.88 Ridor 78.81 2-93.75 ZeroR 36.19 9-31.25 Logistic 76.67 5-76.04 MultilayerPerceptron 78.81 4-82.29 Functions RBFNetwork 76.67 3-80.21 SimpleLogistic 76.19 2-82.29 SMO 76.67 1-83.33 BFTree 69.00 4-90.00 FT 79.00 5-91.00 Trees J48 73.00 2-97.00 J48graft 71.00 1-97.00 Random Forest 75.00 3-98.00 Table 7. Results of the classifier for Trees with unbalanced data Class Algorithms Original (%) 40, 30 and 30% 50, 25 and 25% 60, 20 and 20% BFTree 80.70 2-80.50 5-69,80 5-67.90 FT 77.10 1-80.70 1-69.30 1-69.80 Guard J48 81.40 5-80.00 2-71.90 2-71.70 J48graft 81.40 5-80.00 2-71.90 2-71.70 Random Forest 80.70 4-80.20 4-70.00 4-69.00 BFTree 80.70 4-78.30 4-73.57 1- 82.38 FT 77.10 1-78.10 1-73.10 2-78.57 Forward J48 81.40 2-80.71 2-74.52 4-81.19 J48graft 81.40 2-80.71 2-74.52 5-81.19 Random Forest 80.70 4-78.33 5-71.90 3-81.67 BFTree 80.70 3-80.95 5-76.43 2-74.76 FT 77.10 1- 80.00 1-76.43 1-77.14 Pivot J48 81.40 4-78.33 2-78.33 3-75.00 J48graft 81.40 4-78.33 2-78.33 3-75.00 Random Forest 80.70 2-82.14 4-77.14 5-73.81 A Set of Experiments to Consider Data Quality Criteria 691 Table 8. Results of the classifier for Function with unbalanced data and null values Class Original (%) Unbalance and 10% null values Logistic 76.67 3-74.42 Multilayer Perceptron 78.81 5-66.28 Guard RBFNetwork 76.67 4-70.93 SimpleLogistic 76.19 2-74.42 SMO 76.67 1-75.58 Logistic 70.49 1-83.64 Multilayer Perceptron 70.49 1-83.64 Forward RBFNetwork 63.93 5-70.91 SimpleLogistic 75.41 3-85.45 SMO 72.13 4-81.82 Logistic 71.05 1-73.91 Multilayer Perceptron 84.21 4-78.26 Pivot RBFNetwork 81.58 5-63.77 SimpleLogistic 82.89 3-81.16 SMO 84.21 2-82.61 3.3 Algorithms Unbalance and 30% null values 1-76.92 2-76.92 4-69.23 5-66.67 3-71.79 4-78.82 1-82.35 5-70.59 3-85.88 2-83.53 4-60.78 2-76.47 5-66.67 1-80.39 2-76.47 Unbalance and 50% null values 2-74.24 4-72.73 5-65.15 1-74.24 3-72.73 5-78.69 2-83.61 4-75.41 3-88.52 1-91.80 4-55.26 3-71.05 5-63.16 1-73.68 2-73.68 Results As can be derived from the experiments (Table 5), the percentages of correct classification of instances are significantly diminished due to the presence of null attributes in the dataset for the case of the null values; then, it can be concluded that the presence of null values homogeneously added to a set of data does not affect the results of the classifier. The percentages of instances correctly classified hardly vary, but if the attribute affected belongs to the selected class it changes as the percentage of the instances correctly classified decreases. This analysis was possible since we knew the original quantity of attributes of each class in the data sources. This knowledge is not previously known when we faced a real source of data: when the quantity of attributes of each class is not known, therefore the percentage represents the attributes with null values of the total of elements is only known and multiple combinations of null values with the same percentage of null values in relation to the total could appear. That is, when we add 30% of null values for the same class (Guard ), according to our data sources, it represents almost the same amount than when we add 10% of null values for each class (Guard, Forward and Pivot ), and the results obtained were very different. The quantity of elements of each class in the data sources that are correctly classified has to be considered to check the null values. It can be appreciated that as the quantity of null values grows in a non-homogeneous manner, another problem is introduced: the unbalance between classes. Finally, the results show a notably impact because of the presence of the unbalanced class in relation to the the rest of the classes (see Table 5). 692 R. Espinosa, J. Zubcoff, and J.-N. Mazón For the case of the experiments with unbalanced data (Table 7), the situation depends on the quantity of attributes that were correctly classified in the data sources. Then, results are correct if the distribution of the data for each class is as uniform as possible. Taking the observed results into account and the quantity of combinations in the data sources, the a possible solution is to determine if data are unbalanced or not, according to the quantity of different existent classes. Unbalanced data have impact at a global level. In some cases, the changes added to the dataset imply an overfitting in the obtained classification model. A deep analysis can confirm that the algorithm FT always has the best results with regards of the other algorithms. This allows us to confirm that, for this case study, the use of the FT algorithm can be suggested when the data sources are unbalanced. For the case of correlation (Table 6), the percentages of classification were ostensibly better, considering that the attribute added as Input in the data sources as the most influential. The solution would be creating a mechanism that alerts when introducing strongly correlated attributes as Input/Output. When applying the classification to unbalanced and incomplete data (Table 8), it can be noticed that the null values of the classification diminish, in a general way, as it is worked up with larger quantity of null values. Classification improves their results only in those cases which values were homogeneously added, confirming that quantity of data of the unbalanced class has influence. We have detected three aspects related with the quality of data for classification techniques that should be discussed in early stages of the KDD. In short, the data selected for the application of a classification mining technique would (i) avoid correlated data, (ii) avoid highly unbalanced data, and finally, (iii) select data according to the context of the problem. Correlated data increase the complexity of the resulting pattern, but not providing useful information. Unbalanced data provides non-reliable overfitted patterns. Finally, Input data not selected by following a domain criteria, can result in a more complex pattern or even with non-useful or non-novel information. The experiments accomplished allowed us to reach our objective: to know the behavior of the different subsets of classification techniques and algorithms on different specific dataset, allowing us to assure that it would be essential to consider data quality in an systematic manner. 4 Conclusions and Future Work A successful data mining process depends on the data quality of the sources in order to obtain reliable knowledge. Traditionally, a preprocessing step has been done for preparing data for data mining. This step has been widely based on data quality criteria related to data cleaning (such as duplication detection). However, from a broader perspective, there exist other data quality criteria that must be considered due to the fact that they affect the knowledge discovery. In this paper, we have conducted a set of experiments to show that completeness, correlation and balance of data also affect the data mining results. The results of these A Set of Experiments to Consider Data Quality Criteria 693 experiments would be useful for guiding user in selecting a proper algorithm in order to obtain reliable results regardless the data quality of the sources. Therefore, our point of view of data quality for data mining is not to ameliorate the data quality of the sources, but dealing with the quality that is included in the data. In this way, the selection of a suitable algorithm to boost the data mining results should be based on the analysis of existing data quality of the sources instead of wasting resources in trying to improve data quality. This work is a first step towards considering, in a systematic and structured manner, data quality criteria for supporting and guiding data miners for obtaining reliable knowledge. Therefore, our immediate future work consists on conducting more experiments, focusing on the feasibility of our approach when dealing with complex data: not only because of the different kind of data types (XML, multimedia, and so on), but also because of the high dimensionality of complex data. Acknowledgments. This work has been partially supported by the MANTRA project from Valencia Ministry of Education and Science (GV/2011/035) and from the University of Alicante (GRE09- 17), SERENIDAD project (PEII-110327-7035) from Junta de Comunidades de Castilla La Mancha and by the MESOLAP project (TIN2010-14860) from the Spanish Ministry of Education and Science. Authors would like to express ther gratitude to Emilio Soler for his helpful comments and discussion. References 1. Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: VLDB, pp. 586–597. Morgan Kaufmann, San Francisco (2002) 2. Berti-Equille, L.: Measuring and modelling data quality for quality-awareness in data mining. In: Guillet, F., Hamilton, H.J. (eds.) Quality Measures in Data Mining. Studies in Computational Intelligence, vol. 43, pp. 101–126. Springer, Heidelberg (2007) 3. Bharat, K., Broder, A., Dean, J., Henzinger, M.R., Automatically Extracting and Structure Free and Text Addresses, Borkar, V., Deshmukh, K., Sarawagi, S., Knoblock, C., Lerman, K., Minton, S., Muslea, I., Vassiliadis, P., Vagena, Z., Skiadopoulos, S., Karayannidis, N., Sellis, T., Lomet, D.B., Gravano, L., Levy, A., Sarawagi, S., Weikum, G.: Special Issue on Data Cleaning (2000) 4. Chiang, R.H.L., Barron, T.M., Storey, V.C.: Reverse engineering of relational databases: Extraction of an eer model from a relational database. Data Knowl. Eng. 12(2), 107–142 (1994) 5. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19, 1–16 (2007) 6. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: Knowledge discovery and data mining: Towards a unifying framework. In: KDD, pp. 82–88 (1996) 7. González-Aranda, P., Ruiz, E.M., Millán, S., Ruiz, C., Segovia, J.: Towards a methodology for data mining project development: The importance of abstraction. In: Lin, T.Y., Xie, Y., Wasilewska, A., Liau, C.J. (eds.) Data Mining: Foundations and Practice. Studies in Computational Intelligence, vol. 118, pp. 165–178. Springer, Heidelberg (2008) 694 R. Espinosa, J. Zubcoff, and J.-N. Mazón 8. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2000) 9. Jarke, M., Vassiliou, Y.: Data warehouse quality: A review of the dwq project. In: Strong, D.M., Kahn, B.K. (eds.) IQ, pp. 299–313. MIT, Cambridge (1997) 10. Kriegel, H.P., Borgwardt, K.M., Kröger, P., Pryakhin, A., Schubert, M., Zimek, A.: Future trends in data mining. Data Min. Knowl. Discov. 15(1), 87–97 (2007) 11. Object Management Group: Common Warehouse Metamodel Specification 1.1, http://www.omg.org/cgi-bin/doc?formal/03-03-02 12. Rahm, E., Do, H.: Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000) 13. Strong, D.M., Lee, Y.W., Wang, R.Y.: 10 potholes in the road to information quality. IEEE Computer 30(8), 38–46 (1997) 14. Strong, D.M., Lee, Y.W., Wang, R.Y.: Data quality in context. Commun. ACM 40(5), 103–110 (1997) 15. Troyanskaya, O.G., Cantor, M., Sherlock, G., Brown, P.O., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.B.: Missing value estimation methods for dna microarrays. Bioinformatics 17(6), 520–525 (2001)
© Copyright 2026 Paperzz