How Clean is Clean Enough? Determining the Most Effective Use of Resources in the Data Cleansing Process Research-in-Progress Jeffery Lucas The University of Alabama, Tuscaloosa, AL 35487 [email protected] Uzma Raja The University of Alabama, Tuscaloosa, AL 35487 [email protected] Rafay Ishfaq Auburn University, Auburn AL, 36849 [email protected] Abstract Poor data quality can have a significant impact on system and organizational performance. With significant increase in data gathering and storage, the number of sources of data that must be merged in data warehouse and Enterprise Resource Planning (ERP) implementations has increased significantly. This makes data cleansing as part of the implementation conversion, increasingly difficult. In this research we expand the traditional Extraction-Load-Transform (ETL) process to identify subprocesses between the main stages. We then identify the decisions and tradeoffs related to the various decisions on allocation of time, resources and accuracy constraints on the data cleansing process. We develop a mathematical model of the process to identify the optimal configuration of these factors in data cleansing process. We use empirical data to test the feasibly of the proposed model. Multiple domain experts validate the range of constraints used for model testing. Three different levels of cleansing complexity are tested in the preliminary analysis to demonstrate the use and validity of the modeling process. Keywords: Data Cleansing Integer Programming, Data Cleansing, ETL. Optimization Thirty Fifth International Conference on Information Systems, Auckland 2014 1 Decision Analytics, Big Data, and Visualization Introduction Data Warehouse and ERP implementations involve large data integration projects. As the amount of data and the number of data sources increase, this integration process becomes increasingly complex. As of recent estimates, 88% percent of all data integration projects either fail completely or significantly overrun their budgets (Marsh 2005). Even in the projects that are completed, the post-conversion data quality is often less than desirable (Watts et al. 2009). Poor quality data within ERP and Data Warehouse systems can have a severe impact on organizational performance including customer dissatisfaction, increased operational cost, less effective decision-making and a reduced ability to make and execute strategy (Redman 1998). As data sets become larger and more complex the risks associated with inaccurate data will continue to increase (Watts et al. 2009). While the costs of poor data quality can be difficult to measure, many organizations have determined it is sufficient enough to take on data cleansing projects and to institute Master Data Management roles and techniques within the organization. Gartner estimates the revenue for data quality tools to reach 2 billion by the end of 2017 (Friedman and Bitterer 2006). As sources of data grow and data sets become larger, the impacts of poor data quality will continue to grow. ETL is a process in which multiple software tools are utilized for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. To address the need for quality data, organizations will often take on data cleansing activities as part of the conversion ETL process. Given high quality data is associated with perceived system value (Wixom and Watson 2001), it is important to fully understand the cleansing process adopted by the organizations during the ETL process. During the cleansing process, the expert conducts data analysis, define mapping rules, complete data verification and manually clean the data when necessary. The quality of these tasks depends upon the expertise of the individuals performing the cleansing (Galhardas 2011). This presents the conversion expert with the question of determining the best use of experts’ time. Depending on the risks and costs associated with dirty data compared with the costs of additional cleansing efforts, a conversion expert could choose to build additional and potentially more complex data cleansing rules, to manually clean the data or to channel the inaccurate data in their production environment. The cost and associated risks associated with these choices must be considered to determine a proper course of action and the best use of resources during the cleansing process. In this paper, we investigate the interplay of these resource assignment and data quality issues. We draw from existing literature on data quality, ETL process and costs of quality. As per our knowledge, this is the first study to model the iterative nature of the data cleansing process within ETL process and use optimization techniques to identify the most suitable strategy for data cleansing. We present a detailed view of the data cleansing process that is the framework for this study. We then discuss the methodology that will be used in the study followed by an integer programming mathematical model for the analysis. We conclude by discussing our plan to conduct the analysis Literature Review Data cleansing, also called data cleaning or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data (Rahm and Do 2000). In general the cleansing expert looks to determine the potential error types, search the data to identify instances of the potential errors in the data and to correct the errors once found (Maletic and Marcus 2000) From a process view standpoint, the cleansing expert conducts data analysis, defines the workflow and mapping rules, verifies the results and conducts the transformation (Rahm and Do 200). Thus the details of the process through which data cleansing is performed plays an important role in the overall quality of the data. In this study we develop a granular view of the ETL process and the related stages. Data cleansing, especially for large datasets, has significant costs associated to it. Poor quality of input data leads to increased time and effort expended on the cleansing process without guarantees of high quality outcome. There are four types of costs associated with poor data quality (Haug et al. 2011). There are direct and hidden operational costs such as manufacturing error, payment errors, long lead times and employee dissatisfaction. There are also both direct and hidden operational costs on strategic decisions including poor planning, poor price policies, decreased efficiency, etc. (Haug et al. 2011). When examining 2 Thirty Fifth International Conference on Information Systems, Auckland 2014 How Clean is Clean Enough? decisions made in ETL processes, cost is an important factor to consider, since these direct and indirect costs can undermine organizational policies on data accuracy and availability. In this research we consider cost as a major factor in decisions regarding cleansing cycles and accuracy requirements. When assessing the state of data quality within an organization, both subjective and objective measures must be considered (Pipino et al. 2002). Subjective measures deal primarily with the data consumer's perception of the data quality. These subjective measures can be captured with the use of a questionnaire. As mentioned earlier, these subjective measures are important as they drive user behavior. The most straightforward objective measure is a simple ratio. The simple ratio measures the ratio of desired outcomes to total outcomes‚ (Pipino et al. 2002). This study will focus on this objective measure of accuracy. The user’s impact on the data cleansing processes will be determined using the ratio of inaccurate records to total records before and after the user’s involvement. In this research we use accuracy requirements as one of the main constraint when making decisions regarding cleansing cycles. Research Framework Data conversions for ERP and data warehouse systems are extremely complex and multi-faceted. Current trends in data collection results in multiple sources of data that are available for use in enterprise systems. Figure 1. Expanded ETL Process Outline This data is merged and mapped into the proper format for loading into the target system, then loaded and verified, the classic ETL process. Most literature has focused either on the ETL process including mapping rules and solving the merge / purge problem or it has focused on data cleansing in existing systems. As mentioned in the introduction, these two processes are tightly bound during the conversion process and both are typically required to be successful. Much of research in ETL considers a simplistic view of the flow of data from extract to transform to load processes. While at a higher level, this hold true, in reality, there is a complex mesh of decisions that are to be made within these process that impact the cost, time and resource utilization of the data cleansing process, to obtain a target accuracy (data quality). To understand the complex nature of the decisions within the ETL process, we take a closer look into the black box. The actual process of data flow is iterative in nature and involves decisions regarding expert resource assignment to tasks, time allocation, costs and accuracy compromises (Vassiliadis et al. 2009). A detailed view of the expanded ETL process is presented in Figure 1. Thirty Fifth International Conference on Information Systems, Auckland 2014 3 Decision Analytics, Big Data, and Visualization Process Outline Prior to beginning the ETL process, the expert must first define the initial conversion requirements. This includes identifying sources of data and understanding the layout and schema of the extracted data. The expert will also define mapping and transformation rules and any data correction rules that are known at the start of the project. Prior to the initial extraction of data the expert will also work to understand the users trust level with each source of data (Vassiliadis et al. 2002) After developing the initial mapping and data cleansing rules, the data is ready for extraction. Following extraction, the data is loaded into the ETL tool. At this point the data is co-located simplifying the data analysis and cleansing process. Data analysis and the transformation step is an iterative process. Initial data review and profiling produces simple frequency counts, population levels, simple numerical analysis, etc. At this point, experts can begin data editing including checking for missing data, incorrect formats, values outside of the valid range, etc. Simple cross-field editing can also be used to ensure effective data synthesis. After completing the initial data analysis, the data is prepared for more detailed analysis. The data is consolidated using the transformation and cleansing rules identified to this point. Once the data is consolidated, advanced techniques can be used to analyze the data. Some of the techniques used during this phase can include association discovery, clustering and sequence discovery (Maletic and Marcus 2000). More detailed and specific business level rules can also be applied at this point in the process. After several rounds of refining the extraction, defaulting, mapping and consolidation rules through data analysis the expert is ready to transform and load the data. While these transformations and mappings may be complex, they should be thoroughly defined and validated through the data analysis and cleansing process. Transforming the data into a format usable by the target system and loading into the target system should be a relatively straightforward process of applying the rules defined and tested in the previous steps. Data Cleansing As data errors are identified during data analysis, there are three choices to consider. The mapping, defaulting and transformation rules can be updated to clean the data in an automated fashion. This could even include identifying new sources of conversion data. The expert can choose to manually clean records. There are costs associated with cleansing data including time, labor and computing resources. At some point cleansing data becomes more costly than the ongoing costs of dirty data (Haug et al. 2011). This leaves the expert with the third option of leaving the data as is and accepting the ongoing costs and risks associated with the erroneous records. There are several factors that drive the decision to automate, manually clean or leave the data as is including population size, complexity of cleansing logic and the criticality of the erroneous data elements. The expert must first determine if an automated approach is even feasible. In some cases the data needed to correct the erroneous records is not available and the defaulting logic is not appropriate for the data set. If a suitable mapping rule or defaulting logic can be identified, the expert must weigh the cost of building the correction logic to manually cleansing. Cost associated with cleansing a dataset manually tends to be linear in nature. The larger the data set the higher the cost to cleanse the dataset. This may not be the case with automated cleansing where the majority of the costs are in developing and testing the cleansing logic. Once developed the cleansing logic can be ran against varying population sizes with less impact on the overall cost of the project. The expert must also consider the criticality of the erroneous records. How often will the data be used? How will the data be used? What risks are introduced to ongoing processing and decision making if the data is left in its current state? The expert must balance these ongoing risks and costs with the costs associated with cleansing the data and available resources for the effort. Methodology The framework in the previous section presented the process view and the decisions involved in the ETL data cleansing process. The challenge is to balance the requirements for data accuracy and the resource limitations in terms of time, cost and expertise. Timely delivery of high quality data at the output while 4 Thirty Fifth International Conference on Information Systems, Auckland 2014 How Clean is Clean Enough? staying within the budget is the ideal goal for the process. The effectiveness of the process depends upon the decisions about the level of accuracy, allocation of experts to clean and the reaction time to task. To evaluate the decisions and the tradeoffs involved in the data cleansing process, we use a binary integer programming methodology. This methodology allows us to formulate the system flows, develop a mathematical model and then experimentally evaluate various decisions and their interactions. Such a technique is useful for planning decisions in practical situations. The goal is to identify the optimal configuration of the ETL process. The processes, stages, tasks and decisions in the granular ETL process discussed in the previous section can be represented as a network graph. The process flow of ETL is represented by a network graph, which consists of sets of nodes and links, as shown in Figure 2. The set of nodes represents different stages of the ETL process, whereas the set of links represents the process flow. The stages in the ETL process consist of two groups: main stages (extract, transform, and load) and intermediate stages (evaluate, re-work). The sequence of stages through which data flows in the ETL process constitutes a path. There are multiple paths in the graph that are organized as a sequence of stages through which data is processed. All paths begin at the Start node and terminate at the End node, in the graph. Each stage processes the data using certain resources, which vary in their expertise and capabilities. Figure 2. Network Graph Representation of ETL process A node represents each stage in the graph and its attributes. Each node j has the following attributes: unit processing time tj, unit cost cj and the ability to improve data accuracy πj. The expertise of a resource r processing data at stage j is given by erj. The objective of this model is to identify the stages through which data will be processed in order to achieve system-level goals related to data quality, processing time and project cost. These system-level goals are represented by: σ (target overall data quality), τ (available total processing time) and µ (available monetary resources). A sequence of stages through which data is processed is referred as a feasible path of the ETL process. The sequence of stages in a feasible path is represented by variables xij. Each feasible path has a corresponding total processing time and cost, incurred in achieving the target system-level goals. Note that all feasible Thirty Fifth International Conference on Information Systems, Auckland 2014 5 Decision Analytics, Big Data, and Visualization paths of the ETL process must contain the main stages (extract, transform, and load). Each intermediate stage (evaluate, re-work) of data processing added to a sample path will improve data accuracy (although it will also increase the total processing times and costs). The optimal path of the ETL process is one that achieves the targeted accuracy level by using the least number of intermediate data processing stages. Next we develop a mathematical formulation of the network graph to represent the ETL process. The network graph representation serves as the foundation for building the mathematical representation of the system. This formulation implements the relationships in the ETL process in the form of system parameters and decision variables. Solving this formulation gives us the values of the decision variables that are interpreted as choices/decisions about which identify the configuration of the ETL process i.e. main stages and intermediate stages through which data will be processed. Given system level targets of data accuracy, there will be tradeoffs between allocating experts for cleansing tasks, time available to perform a process and the nature of analysis performed in the sub-processes. The mathematical representation of the ETL process and different tradeoffs involved in the selection of data processing stages is developed using the following notations: Sets: S A R = Stages = Links between stages = Resources Parameters: = = = = = = = = Improvement in data accuracy after completing stage j Data processing cost at stage j Processing time at stage j Expertise of resource r to process data in stage j Available ETL completion time Available ETL monetary budget Data accuracy requirements at end of ETL process Constant; = 1 if i = start node; = −1 if i = end node; = 0 otherwise Decision Variables: = 1, if data is transferred to stage j after stage i, 0 otherwise Model: Minimize: Subject to: (,)∈ (1) − = ;∀ ∈ (2) ≤ 1;∀ ∈ (3) (,)∈ (,)∈ ∈ ≤ ! " ≤ (,)∈,∈ (,)∈,∈ (,)∈,∈ ≥ (4) (5) (6) The objective function of the mathematical formulation is given by (eq-1). The output of this model yields an optimal path that identifies the intermediate stages used in the ETL process. The constraints in (eq-2) implement data flow requirements for feasible paths from Start node to End node of the ETL network 6 Thirty Fifth International Conference on Information Systems, Auckland 2014 How Clean is Clean Enough? graph. The constraints in (eq-3) require that when there are multiple choices for selecting the next stage of the ETL process, only one stage is selected. The limitation of total available time for the ETL process is implemented by constraint (eq-4). Constraint (eq-5) limits total costs of the ETL process to be within budget. Constraint (eq-6) implements the requirement that a feasible path consists of a sufficient number of intermediate (and main) stages to ensure that the ETL process achieves the target data accuracy level. Preliminary Analysis The optimization model developed in the previous section served as the basis of our preliminary experimental study. The formulation was coded in AMPL software using the interactive command environment for setting up optimizations models and solved using built-in library of solution algorithms available in IBM CPLEX software (Robert et al. 2002). Next a full factorial experimental study was set up to examine the interplay of: time, budget, expertise, accuracy and size. The output was the process configuration for that combination. An instance of the formulation represents a given scenario that was solved by the solver. An instance will be decided by the size of the data. Data Description To test the feasibility of the model, data was generated for a model run. At least one instance of data is required for every combination of the factorials. There will be deterministic change in output based on changes in the value. Based on boundary values we find when the effects change. The factors and interaction effects on the outcome i.e. the optimal configuration of the ETL process and its stages. Experts in the field were recruited to provide insights into the realistic levels of the parameters listed in the previous section. The first author of the paper has over 20 years of experience in the data conversion domain, so the first author proposed the initial accuracy rates. Experts with at least 15 years of experience were contacted and were sent a description of the problem. The data was based on real problem sets from insurance and benefits industries. Real data samples were used to identify realistic data size and the inaccuracies in the input population. Experts validated the realistic accuracy expectations, competencies of experts for tasks, time, and budget constraints. The error rates and accuracy estimates were also based on the actual numbers in the cleaning process. Three experts then independently verified the numbers and proposed modifications based on their experience. The suggested modifications were within the margin of error. All three experts to ensure there was no bias verified the final numbers. In order to narrow in on sample data, we focused on a single domain for the conversion and a single use for the data post conversion. After narrowing our focus we were able to work with industry experts to define “typical” values for a conversion. The focus of our analysis is on a Defined Benefit (DB) conversion. A DB plan is often more commonly referred to as a pension plan. DB conversions tend to be very complex for several reasons. The plan administrator must track data on the plan participants for up to 30 or more years, to use in a calculation at retirement. There is a vast amount of data to track over a very long period of time. The problem is enhanced by the fact that very few participants review their data until retirement so issues are not found and corrected in a timely manner. Plan sizes can vary greatly as well for only a few hundred participants at small organizations to hundreds of thousands of participants in fortune 500 companies or government entities. For the purpose of this analysis, we focused in on plans with roughly 20,000 participants. The 20k participants represent the total number of participants (current employees, former employees and beneficiaries) in the conversion. It does not represent the true volume of data. For each participant the plan administrator must convert several types of data (tables in the database). And each type of data would have several characteristics (fields on the table). Consider an example: For each person we would need personal indicative data and pay data. Personal indicative data likely contains fields such as first name, last name, birth date, etc. Pay data would include the type of pay, frequency of pay, date it was paid and the amount. It can be seen that a conversion of 20k participants can actually be a very large amount of data. A plan that pays their employees twice a month for example and has 10k active employees will generate 240k rows of pay data per year. The complexity of a DB conversion can be driven by several other factors as well. The number of unions and complexity of their rules, the number of mergers and acquisitions the organization has been through, the number of system conversions in the past, just to name a few. For the purpose of this effort we tried to define three broad complexity levels: Simple, Average and Complex. Thirty Fifth International Conference on Information Systems, Auckland 2014 7 Decision Analytics, Big Data, and Visualization The use of the data post conversion in our analysis was for real time online retirement processing. This is important as it drives the requirement for data quality at the end of the ETL process. If this data was for reporting, or even for off line retirement processing the quality of data could be significantly less. Since the transactions are taken online, the user expects the system to work without interruption. In this environment, invalid data is tracked at the participant level. We may convert thousands of individual fields for participants (think of the entire pay history, hours history, etc.), but if anything is wrong in any of the fields, we cannot process the transaction for the participants. The larger the participant count the higher the data accuracy must be. If the plan has 5,000 participants and we have a 5% error rate, there are 250 participants that we will need to handle off line when they try to retire. If the plan has 500,000 and we have a 5% error rate we have 25,000 participants to handle in a manual fashion. The type of error dictates the experience level of the ETL team member needed to solve the error. The four types of errors considered for the analysis are: Type 1 - This is incorrect data that requires very little effort to clean up. An example may be a population that is missing a birth date, but the data is available in other data we have, so the correction rule might be as simple as using our normal mapping if the data is there, else using this alternate mapping. Type 2 - This is incorrect data that requires effort to correct, but does not require specialized knowledge or skill set and does not require overly complicated code. Using the missing birth date example, if we had the data, to correct the birth date, but had to evaluate several pieces of data to determine which source to use, it is considered a Type 2 error. Type 3 - This is incorrect data that requires a significant effort to correct or a specialized skill set or knowledge e.g. an incorrect service amount that must be calculated at conversion. Type 4 - This is incorrect data that simply cannot be cleaned in an automated fashion. The data does not exist or can only be found manually with extensive research. Table 1. Percentage of error types in each of the three conversions Conversion Simple Average Difficult Total Errors 20% 40% 60% Type 1 10% 20% 25% Type 2 7% 10% 18% Type 3 2% 6% 10% Type 4 1% 4% 7% The three types of conversion were expressed in terms of the percentages of the three types of errors are described in Table 1. The first column defines the conversion type: Simple, Average and Complex. Next is the percentage of total errors. These figures were derived from actual empirical data and then validated by the three experts to ensure that the case was not atypical. The following columns represent the percentages of individual error types listed earlier. Table 2. Parameters for each conversion type and the estimated constraints Process E T L 8 Resource Programmer Analyst Analyst Lead Analyst Lead Programmer Analyst Analyst Analyst Lead Lead Node 3 4 6 7 10 11 13 14 17 18 20 21 Simple Conversion Average Conversion Complex Conversion t2 e42 t2 e42 t2 e42 12 6 24 56 26 48 80 120 128 80 240 240 = 528ℎ(), = 0.95, = $9.22- c2 e42 π2 2.58 1.50 6.02 20.44 6.68 17.52 17.20 30.12 32.12 20.08 87.60 87.60 0.038 0.013 0.047 0.042 0.333 0.524 0.429 0.405 0.202 0.081 0.073 0.069 = 880ℎ(), = 0.95, = $20.39- 20 12 8 72 40 120 160 168 56 56 56 240 c2 e42 π2 4.30 3.01 2.01 26.28 10.04 43.80 34.40 42.17 14.06 14.06 20.44 87.60 0.035 0.024 0.044 0.035 0.522 0.522 0.191 0.119 0.080 0.080 0.054 0.080 Thirty Fifth International Conference on Information Systems, Auckland 2014 = 1584ℎ(), = 0.95, = $59.68- 24 16 8 8 53 171 256 240 80 80 80 80 c2 e42 π2 5.16 4.02 2.01 2.92 13.39 62.49 55.04 60.24 20.08 20.08 29.20 29.20 0.035 0.020 0.044 0.044 0.436 0.532 0.197 0.117 0.088 0.088 0.066 0.066 How Clean is Clean Enough? Parameter Description The parameters identified in the model in the previous section were defined based on the empirical data for each of the three conversion types. The target accuracy, budget (cost), time permitted for the conversion process and the types of resources available were also derived from the sample data. Since the three experts were from different firms, they were able to validate the estimates for the sample data independently to ensure that the estimates were reasonable. The estimates of the accuracy of cleansing process and time taken for manual and automated cleaning were also extracted from the sample datasets used for the study. It must be explained here that while our network implementation of the nodes in Figure 2 shows only two stages for each of the ETL processes, in reality there can be multiple iterations in each stage, before the data is passed on to the next process. The parameters for each of the three conversion types along with the percentage improvement at each node are presented in Table 2. Results and Discussion The data described in the previous section was used in the optimization model. The output of the model was used to record the optimal path. The optimal path identified different stages of the ETL process to achieve target accuracy level, while conforming to the budgetary and the processing time constraints. The optimal path for Simple conversion is: Start-1-2-8-9-11-12-13-15-16-End The optimal path for Average conversion is: Start-1-2-3-5-6-8-9-10-12-13-15-16-17-19-21-End The optimal path for Complex conversion is: Start-1-2-3-5-6-8-9-11-12-13-15-16-17-19-20-End The numbers in the optimal paths represent the nodes in Figure 2 and the order of nodes represents the sequence of stages in the optimal ETL process. It was seen that for the case of simple conversions, optimal path identified no data cleaning in the extract phase, while resolving most of the errors, through a combination of manual and automated cleansing processes, in the transform phase. This could be indicative of the nature of the conversion task that makes cleansing process improvement during extract and load phase marginal. For the case of conversions of average complexity, the optimal path indicated that cleansing at each node improves the accuracy significantly. This result also shows that manual processing is optimal compared to the automated processes for data errors with higher complexity. For the case of most complex conversion, optimal path is similar to the case of average conversion, with a difference that an additional cleansing step was used in the transform stage. In this step, automated cleansing provided a better data cleaning option than the manual process. Hence, the nature and type of errors in a complex conversion made it different than the cases of average and simple conversion. Furthermore, analysis of the results showed that budget and available time for the three types of conversions also play key roles in the composition of optimal paths to satisfy the data cleansing requirements. Conclusion In this research in progress, we develop an optimization model of the ETL process. We identify the major stages and the sub processes within the ETL process that impact the quality of data cleansing. We use the parameters of time, expert resources, budget, and target data accuracy requirements to formulate a mathematical model of the system. We then use empirical data for conversions of three different complexity levels from one domain. The estimates of the parameters of the model are derived from the sample data and validated by three industry experts independently. Preliminary results have been presented that represent how complexity of the task, accuracy requirements, budget and time available can impact the optimal path for data cleansing. Sometimes cleansing upfront in the extract process is not the best solution and vice versa. This preliminary analysis establishes the feasibility of the approach. Future studies will use additional data sources and include additional sub stages within the ETL process. This research will help us understand the various facets of data cleansing process in ETL process and provide with optimal combinations of the decisions regarding resource allocation, budget and data accuracy. Thirty Fifth International Conference on Information Systems, Auckland 2014 9 Decision Analytics, Big Data, and Visualization References Erhard, E., and Do, H. 2000. “Data Cleaning: Problems and Current Approaches,” Bulletin of the IEEE Computer Society Technical Committee on Data Engineering (23:4), pp. 3-13. Galhardas, H., Lopes, A., and Santos, E. 2011. “Support for User Involvement in Data Cleaning,” in Proceedings of the 2011 International Conference of Data Warehousing and Knowledge Discover, pp. 136-151. Fourer, R. Gay, D.M., and Kernighan, B.W. 2002. AMPL: A Modeling Language for Mathematical Programming. Duxbury Press / Brooks/Cole Publishing Company. Friedman, T., and Bitterer, A. 2006. Magic quadrant for data quality tools. Gartner Group, 184. Haug, A., Zachariassen, F., and Liempd, D. 2011. “The Costs of Poor Data Quality,” Journal of Industrial Engineering and Management (4:2), pp. 168-193. Lee, Y., Strong, D., Kahn, B., and Wang, R. 2002. “AIMQ: a methodology for information quality assessment,” Information & Management (40:2), pp. 133-146. Maletic, J., and Marcus, A. 2000. “Data Cleansing: Beyond Integrity Analysis,” in Proceedings of the 2000 Conference on Information Quality, pp. 200-209 Marsh, R. 2005. “Drowning in dirty data? It’s time to sink or swim: a four-stage methodology for total data quality management,” The Journal of Database Marketing & Customer Strategy Management (12:2), pp. 105-112. Pipino, L., Lee, Y., and Wang, R. 2002. “Data Quality Assessment,” Communications of the ACM (45:4), pp. 211-218. Rahm, E., and Do, H. H. 2000. “Data cleaning: Problems and current approaches,” IEEE Data Eng. Bull. 23(4), 3-13. Redman, T. 1998. “The Impact of Poor Data Quality on the Typical Enterprise,” Communications of the ACM (41:2), pp. 79-82. Vassiliadis, P., Simitsis, A., and Baikousi, E. 2009. “A Taxonomy of ETL Activities,” in Proceedings of the 12th ACM International Workshop on Data Warehousing, pp. 25-32. Watts, S., Shankaranarayanan, G., and Even A. 2009. “Data Quality Assessment in Context: A Cognitive Perspective,” Decision Support Systems (48:1), pp. 202-211. Wixom, B. and Watson, H. 2001. “An Empirical Investigation of the Factors Affecting Data Warehousing Success,” MIS Quarterly (25:1), pp. 17-3. 10 Thirty Fifth International Conference on Information Systems, Auckland 2014
© Copyright 2025 Paperzz