How Clean is Clean Enough?

How Clean is Clean Enough? Determining the
Most Effective Use of Resources in the Data
Cleansing Process
Research-in-Progress
Jeffery Lucas
The University of Alabama,
Tuscaloosa, AL 35487
[email protected]
Uzma Raja
The University of Alabama,
Tuscaloosa, AL 35487
[email protected]
Rafay Ishfaq
Auburn University,
Auburn AL, 36849
[email protected]
Abstract
Poor data quality can have a significant impact on system and organizational
performance. With significant increase in data gathering and storage, the number of
sources of data that must be merged in data warehouse and Enterprise Resource
Planning (ERP) implementations has increased significantly. This makes data cleansing
as part of the implementation conversion, increasingly difficult. In this research we
expand the traditional Extraction-Load-Transform (ETL) process to identify subprocesses between the main stages. We then identify the decisions and tradeoffs related
to the various decisions on allocation of time, resources and accuracy constraints on the
data cleansing process. We develop a mathematical model of the process to identify the
optimal configuration of these factors in data cleansing process. We use empirical data
to test the feasibly of the proposed model. Multiple domain experts validate the range of
constraints used for model testing. Three different levels of cleansing complexity are
tested in the preliminary analysis to demonstrate the use and validity of the modeling
process.
Keywords: Data Cleansing Integer Programming, Data Cleansing, ETL. Optimization
Thirty Fifth International Conference on Information Systems, Auckland 2014
1
Decision Analytics, Big Data, and Visualization
Introduction
Data Warehouse and ERP implementations involve large data integration projects. As the amount of data
and the number of data sources increase, this integration process becomes increasingly complex. As of
recent estimates, 88% percent of all data integration projects either fail completely or significantly overrun their budgets (Marsh 2005). Even in the projects that are completed, the post-conversion data
quality is often less than desirable (Watts et al. 2009). Poor quality data within ERP and Data Warehouse
systems can have a severe impact on organizational performance including customer dissatisfaction,
increased operational cost, less effective decision-making and a reduced ability to make and execute
strategy (Redman 1998). As data sets become larger and more complex the risks associated with
inaccurate data will continue to increase (Watts et al. 2009).
While the costs of poor data quality can be difficult to measure, many organizations have determined it is
sufficient enough to take on data cleansing projects and to institute Master Data Management roles and
techniques within the organization. Gartner estimates the revenue for data quality tools to reach 2 billion
by the end of 2017 (Friedman and Bitterer 2006). As sources of data grow and data sets become larger,
the impacts of poor data quality will continue to grow.
ETL is a process in which multiple software tools are utilized for the extraction of data from several
sources, their cleansing, customization and insertion into a data warehouse. To address the need for
quality data, organizations will often take on data cleansing activities as part of the conversion ETL
process. Given high quality data is associated with perceived system value (Wixom and Watson 2001), it is
important to fully understand the cleansing process adopted by the organizations during the ETL process.
During the cleansing process, the expert conducts data analysis, define mapping rules, complete data
verification and manually clean the data when necessary. The quality of these tasks depends upon the
expertise of the individuals performing the cleansing (Galhardas 2011). This presents the conversion
expert with the question of determining the best use of experts’ time.
Depending on the risks and costs associated with dirty data compared with the costs of additional
cleansing efforts, a conversion expert could choose to build additional and potentially more complex data
cleansing rules, to manually clean the data or to channel the inaccurate data in their production
environment. The cost and associated risks associated with these choices must be considered to
determine a proper course of action and the best use of resources during the cleansing process.
In this paper, we investigate the interplay of these resource assignment and data quality issues. We draw
from existing literature on data quality, ETL process and costs of quality. As per our knowledge, this is the
first study to model the iterative nature of the data cleansing process within ETL process and use
optimization techniques to identify the most suitable strategy for data cleansing. We present a detailed
view of the data cleansing process that is the framework for this study. We then discuss the methodology
that will be used in the study followed by an integer programming mathematical model for the analysis.
We conclude by discussing our plan to conduct the analysis
Literature Review
Data cleansing, also called data cleaning or scrubbing, deals with detecting and removing errors and
inconsistencies from data in order to improve the quality of data (Rahm and Do 2000). In general the
cleansing expert looks to determine the potential error types, search the data to identify instances of the
potential errors in the data and to correct the errors once found (Maletic and Marcus 2000) From a
process view standpoint, the cleansing expert conducts data analysis, defines the workflow and mapping
rules, verifies the results and conducts the transformation (Rahm and Do 200). Thus the details of the
process through which data cleansing is performed plays an important role in the overall quality of the
data. In this study we develop a granular view of the ETL process and the related stages.
Data cleansing, especially for large datasets, has significant costs associated to it. Poor quality of input
data leads to increased time and effort expended on the cleansing process without guarantees of high
quality outcome. There are four types of costs associated with poor data quality (Haug et al. 2011). There
are direct and hidden operational costs such as manufacturing error, payment errors, long lead times and
employee dissatisfaction. There are also both direct and hidden operational costs on strategic decisions
including poor planning, poor price policies, decreased efficiency, etc. (Haug et al. 2011). When examining
2
Thirty Fifth International Conference on Information Systems, Auckland 2014
How Clean is Clean Enough?
decisions made in ETL processes, cost is an important factor to consider, since these direct and indirect
costs can undermine organizational policies on data accuracy and availability. In this research we
consider cost as a major factor in decisions regarding cleansing cycles and accuracy requirements.
When assessing the state of data quality within an organization, both subjective and objective measures
must be considered (Pipino et al. 2002). Subjective measures deal primarily with the data consumer's
perception of the data quality. These subjective measures can be captured with the use of a questionnaire.
As mentioned earlier, these subjective measures are important as they drive user behavior. The most
straightforward objective measure is a simple ratio. The simple ratio measures the ratio of desired
outcomes to total outcomes‚ (Pipino et al. 2002). This study will focus on this objective measure of
accuracy. The user’s impact on the data cleansing processes will be determined using the ratio of
inaccurate records to total records before and after the user’s involvement. In this research we use
accuracy requirements as one of the main constraint when making decisions regarding cleansing cycles.
Research Framework
Data conversions for ERP and data warehouse systems are extremely complex and multi-faceted. Current
trends in data collection results in multiple sources of data that are available for use in enterprise systems.
Figure 1. Expanded ETL Process Outline
This data is merged and mapped into the proper format for loading into the target system, then loaded
and verified, the classic ETL process.
Most literature has focused either on the ETL process including mapping rules and solving the merge /
purge problem or it has focused on data cleansing in existing systems. As mentioned in the introduction,
these two processes are tightly bound during the conversion process and both are typically required to be
successful.
Much of research in ETL considers a simplistic view of the flow of data from extract to transform to load
processes. While at a higher level, this hold true, in reality, there is a complex mesh of decisions that are to
be made within these process that impact the cost, time and resource utilization of the data cleansing
process, to obtain a target accuracy (data quality).
To understand the complex nature of the decisions within the ETL process, we take a closer look into the
black box. The actual process of data flow is iterative in nature and involves decisions regarding expert
resource assignment to tasks, time allocation, costs and accuracy compromises (Vassiliadis et al. 2009). A
detailed view of the expanded ETL process is presented in Figure 1.
Thirty Fifth International Conference on Information Systems, Auckland 2014
3
Decision Analytics, Big Data, and Visualization
Process Outline
Prior to beginning the ETL process, the expert must first define the initial conversion requirements. This
includes identifying sources of data and understanding the layout and schema of the extracted data. The
expert will also define mapping and transformation rules and any data correction rules that are known at
the start of the project. Prior to the initial extraction of data the expert will also work to understand the
users trust level with each source of data (Vassiliadis et al. 2002)
After developing the initial mapping and data cleansing rules, the data is ready for extraction. Following
extraction, the data is loaded into the ETL tool. At this point the data is co-located simplifying the data
analysis and cleansing process.
Data analysis and the transformation step is an iterative process. Initial data review and profiling
produces simple frequency counts, population levels, simple numerical analysis, etc. At this point, experts
can begin data editing including checking for missing data, incorrect formats, values outside of the valid
range, etc. Simple cross-field editing can also be used to ensure effective data synthesis.
After completing the initial data analysis, the data is prepared for more detailed analysis. The data is
consolidated using the transformation and cleansing rules identified to this point. Once the data is
consolidated, advanced techniques can be used to analyze the data. Some of the techniques used during
this phase can include association discovery, clustering and sequence discovery (Maletic and Marcus
2000). More detailed and specific business level rules can also be applied at this point in the process.
After several rounds of refining the extraction, defaulting, mapping and consolidation rules through data
analysis the expert is ready to transform and load the data. While these transformations and mappings
may be complex, they should be thoroughly defined and validated through the data analysis and cleansing
process. Transforming the data into a format usable by the target system and loading into the target
system should be a relatively straightforward process of applying the rules defined and tested in the
previous steps.
Data Cleansing
As data errors are identified during data analysis, there are three choices to consider. The mapping,
defaulting and transformation rules can be updated to clean the data in an automated fashion. This could
even include identifying new sources of conversion data. The expert can choose to manually clean
records. There are costs associated with cleansing data including time, labor and computing resources. At
some point cleansing data becomes more costly than the ongoing costs of dirty data (Haug et al. 2011).
This leaves the expert with the third option of leaving the data as is and accepting the ongoing costs and
risks associated with the erroneous records.
There are several factors that drive the decision to automate, manually clean or leave the data as is
including population size, complexity of cleansing logic and the criticality of the erroneous data elements.
The expert must first determine if an automated approach is even feasible. In some cases the data needed
to correct the erroneous records is not available and the defaulting logic is not appropriate for the data
set. If a suitable mapping rule or defaulting logic can be identified, the expert must weigh the cost of
building the correction logic to manually cleansing. Cost associated with cleansing a dataset manually
tends to be linear in nature. The larger the data set the higher the cost to cleanse the dataset. This may
not be the case with automated cleansing where the majority of the costs are in developing and testing the
cleansing logic. Once developed the cleansing logic can be ran against varying population sizes with less
impact on the overall cost of the project.
The expert must also consider the criticality of the erroneous records. How often will the data be used?
How will the data be used? What risks are introduced to ongoing processing and decision making if the
data is left in its current state? The expert must balance these ongoing risks and costs with the costs
associated with cleansing the data and available resources for the effort.
Methodology
The framework in the previous section presented the process view and the decisions involved in the ETL
data cleansing process. The challenge is to balance the requirements for data accuracy and the resource
limitations in terms of time, cost and expertise. Timely delivery of high quality data at the output while
4
Thirty Fifth International Conference on Information Systems, Auckland 2014
How Clean is Clean Enough?
staying within the budget is the ideal goal for the process. The effectiveness of the process depends upon
the decisions about the level of accuracy, allocation of experts to clean and the reaction time to task.
To evaluate the decisions and the tradeoffs involved in the data cleansing process, we use a binary integer
programming methodology. This methodology allows us to formulate the system flows, develop a
mathematical model and then experimentally evaluate various decisions and their interactions. Such a
technique is useful for planning decisions in practical situations. The goal is to identify the optimal
configuration of the ETL process.
The processes, stages, tasks and decisions in the granular ETL process discussed in the previous section
can be represented as a network graph. The process flow of ETL is represented by a network graph, which
consists of sets of nodes and links, as shown in Figure 2. The set of nodes represents different stages of the
ETL process, whereas the set of links represents the process flow. The stages in the ETL process consist of
two groups: main stages (extract, transform, and load) and intermediate stages (evaluate, re-work). The
sequence of stages through which data flows in the ETL process constitutes a path. There are multiple
paths in the graph that are organized as a sequence of stages through which data is processed. All paths
begin at the Start node and terminate at the End node, in the graph. Each stage processes the data using
certain resources, which vary in their expertise and capabilities.
Figure 2. Network Graph Representation of ETL process
A node represents each stage in the graph and its attributes. Each node j has the following attributes: unit
processing time tj, unit cost cj and the ability to improve data accuracy πj. The expertise of a resource r
processing data at stage j is given by erj. The objective of this model is to identify the stages through which
data will be processed in order to achieve system-level goals related to data quality, processing time and
project cost. These system-level goals are represented by: σ (target overall data quality), τ (available total
processing time) and µ (available monetary resources).
A sequence of stages through which data is processed is referred as a feasible path of the ETL process. The
sequence of stages in a feasible path is represented by variables xij. Each feasible path has a corresponding
total processing time and cost, incurred in achieving the target system-level goals. Note that all feasible
Thirty Fifth International Conference on Information Systems, Auckland 2014
5
Decision Analytics, Big Data, and Visualization
paths of the ETL process must contain the main stages (extract, transform, and load). Each intermediate
stage (evaluate, re-work) of data processing added to a sample path will improve data accuracy (although
it will also increase the total processing times and costs). The optimal path of the ETL process is one that
achieves the targeted accuracy level by using the least number of intermediate data processing stages.
Next we develop a mathematical formulation of the network graph to represent the ETL process. The
network graph representation serves as the foundation for building the mathematical representation of
the system. This formulation implements the relationships in the ETL process in the form of system
parameters and decision variables. Solving this formulation gives us the values of the decision variables
that are interpreted as choices/decisions about which identify the configuration of the ETL process i.e.
main stages and intermediate stages through which data will be processed. Given system level targets of
data accuracy, there will be tradeoffs between allocating experts for cleansing tasks, time available to
perform a process and the nature of analysis performed in the sub-processes. The mathematical
representation of the ETL process and different tradeoffs involved in the selection of data processing
stages is developed using the following notations:
Sets:
S
A
R
= Stages
= Links between stages
= Resources
Parameters:
=
=
=
=
=
=
=
=
Improvement in data accuracy after completing stage j
Data processing cost at stage j
Processing time at stage j
Expertise of resource r to process data in stage j
Available ETL completion time
Available ETL monetary budget
Data accuracy requirements at end of ETL process
Constant; = 1 if i = start node; = −1 if i = end node; = 0 otherwise
Decision Variables:
= 1, if data is transferred to stage j after stage i, 0 otherwise
Model:
Minimize:
Subject to:
(,)∈
(1)
− = ;∀ ∈ (2)
≤ 1;∀ ∈ (3)
(,)∈
(,)∈
∈
≤ ! " ≤ (,)∈,∈
(,)∈,∈
(,)∈,∈
≥ (4)
(5)
(6)
The objective function of the mathematical formulation is given by (eq-1). The output of this model yields
an optimal path that identifies the intermediate stages used in the ETL process. The constraints in (eq-2)
implement data flow requirements for feasible paths from Start node to End node of the ETL network
6
Thirty Fifth International Conference on Information Systems, Auckland 2014
How Clean is Clean Enough?
graph. The constraints in (eq-3) require that when there are multiple choices for selecting the next stage
of the ETL process, only one stage is selected. The limitation of total available time for the ETL process is
implemented by constraint (eq-4). Constraint (eq-5) limits total costs of the ETL process to be within
budget. Constraint (eq-6) implements the requirement that a feasible path consists of a sufficient number
of intermediate (and main) stages to ensure that the ETL process achieves the target data accuracy level.
Preliminary Analysis
The optimization model developed in the previous section served as the basis of our preliminary
experimental study. The formulation was coded in AMPL software using the interactive command
environment for setting up optimizations models and solved using built-in library of solution algorithms
available in IBM CPLEX software (Robert et al. 2002). Next a full factorial experimental study was set up
to examine the interplay of: time, budget, expertise, accuracy and size. The output was the process
configuration for that combination. An instance of the formulation represents a given scenario that was
solved by the solver. An instance will be decided by the size of the data.
Data Description
To test the feasibility of the model, data was generated for a model run. At least one instance of data is
required for every combination of the factorials. There will be deterministic change in output based on
changes in the value. Based on boundary values we find when the effects change. The factors and
interaction effects on the outcome i.e. the optimal configuration of the ETL process and its stages.
Experts in the field were recruited to provide insights into the realistic levels of the parameters listed in
the previous section. The first author of the paper has over 20 years of experience in the data conversion
domain, so the first author proposed the initial accuracy rates. Experts with at least 15 years of experience
were contacted and were sent a description of the problem. The data was based on real problem sets from
insurance and benefits industries. Real data samples were used to identify realistic data size and the
inaccuracies in the input population. Experts validated the realistic accuracy expectations, competencies
of experts for tasks, time, and budget constraints. The error rates and accuracy estimates were also based
on the actual numbers in the cleaning process. Three experts then independently verified the numbers
and proposed modifications based on their experience. The suggested modifications were within the
margin of error. All three experts to ensure there was no bias verified the final numbers.
In order to narrow in on sample data, we focused on a single domain for the conversion and a single use
for the data post conversion. After narrowing our focus we were able to work with industry experts to
define “typical” values for a conversion. The focus of our analysis is on a Defined Benefit (DB) conversion.
A DB plan is often more commonly referred to as a pension plan. DB conversions tend to be very complex
for several reasons. The plan administrator must track data on the plan participants for up to 30 or more
years, to use in a calculation at retirement. There is a vast amount of data to track over a very long period
of time. The problem is enhanced by the fact that very few participants review their data until retirement
so issues are not found and corrected in a timely manner. Plan sizes can vary greatly as well for only a few
hundred participants at small organizations to hundreds of thousands of participants in fortune 500
companies or government entities. For the purpose of this analysis, we focused in on plans with roughly
20,000 participants.
The 20k participants represent the total number of participants (current employees, former employees
and beneficiaries) in the conversion. It does not represent the true volume of data. For each participant
the plan administrator must convert several types of data (tables in the database). And each type of data
would have several characteristics (fields on the table). Consider an example: For each person we would
need personal indicative data and pay data. Personal indicative data likely contains fields such as first
name, last name, birth date, etc. Pay data would include the type of pay, frequency of pay, date it was paid
and the amount. It can be seen that a conversion of 20k participants can actually be a very large amount
of data. A plan that pays their employees twice a month for example and has 10k active employees will
generate 240k rows of pay data per year.
The complexity of a DB conversion can be driven by several other factors as well. The number of unions
and complexity of their rules, the number of mergers and acquisitions the organization has been through,
the number of system conversions in the past, just to name a few. For the purpose of this effort we tried
to define three broad complexity levels: Simple, Average and Complex.
Thirty Fifth International Conference on Information Systems, Auckland 2014
7
Decision Analytics, Big Data, and Visualization
The use of the data post conversion in our analysis was for real time online retirement processing. This is
important as it drives the requirement for data quality at the end of the ETL process. If this data was for
reporting, or even for off line retirement processing the quality of data could be significantly less. Since
the transactions are taken online, the user expects the system to work without interruption. In this
environment, invalid data is tracked at the participant level. We may convert thousands of individual
fields for participants (think of the entire pay history, hours history, etc.), but if anything is wrong in any
of the fields, we cannot process the transaction for the participants.
The larger the participant count the higher the data accuracy must be. If the plan has 5,000 participants
and we have a 5% error rate, there are 250 participants that we will need to handle off line when they try
to retire. If the plan has 500,000 and we have a 5% error rate we have 25,000 participants to handle in a
manual fashion. The type of error dictates the experience level of the ETL team member needed to solve
the error. The four types of errors considered for the analysis are:
Type 1 - This is incorrect data that requires very little effort to clean up. An example may be a population
that is missing a birth date, but the data is available in other data we have, so the correction rule might be
as simple as using our normal mapping if the data is there, else using this alternate mapping.
Type 2 - This is incorrect data that requires effort to correct, but does not require specialized knowledge
or skill set and does not require overly complicated code. Using the missing birth date example, if we had
the data, to correct the birth date, but had to evaluate several pieces of data to determine which source to
use, it is considered a Type 2 error.
Type 3 - This is incorrect data that requires a significant effort to correct or a specialized skill set or
knowledge e.g. an incorrect service amount that must be calculated at conversion.
Type 4 - This is incorrect data that simply cannot be cleaned in an automated fashion. The data does not
exist or can only be found manually with extensive research.
Table 1. Percentage of error types in each of the three conversions
Conversion
Simple
Average
Difficult
Total Errors
20%
40%
60%
Type 1
10%
20%
25%
Type 2
7%
10%
18%
Type 3
2%
6%
10%
Type 4
1%
4%
7%
The three types of conversion were expressed in terms of the percentages of the three types of errors are
described in Table 1. The first column defines the conversion type: Simple, Average and Complex. Next is
the percentage of total errors. These figures were derived from actual empirical data and then validated by
the three experts to ensure that the case was not atypical. The following columns represent the
percentages of individual error types listed earlier.
Table 2. Parameters for each conversion type and the estimated constraints
Process
E
T
L
8
Resource
Programmer
Analyst
Analyst
Lead
Analyst
Lead
Programmer
Analyst
Analyst
Analyst
Lead
Lead
Node
3
4
6
7
10
11
13
14
17
18
20
21
Simple Conversion
Average Conversion
Complex Conversion
t2
e42
t2
e42
t2
e42
12
6
24
56
26
48
80
120
128
80
240
240
= 528ℎ(),
= 0.95,
= $9.22-
c2 e42
π2
2.58
1.50
6.02
20.44
6.68
17.52
17.20
30.12
32.12
20.08
87.60
87.60
0.038
0.013
0.047
0.042
0.333
0.524
0.429
0.405
0.202
0.081
0.073
0.069
= 880ℎ(),
= 0.95,
= $20.39-
20
12
8
72
40
120
160
168
56
56
56
240
c2 e42
π2
4.30
3.01
2.01
26.28
10.04
43.80
34.40
42.17
14.06
14.06
20.44
87.60
0.035
0.024
0.044
0.035
0.522
0.522
0.191
0.119
0.080
0.080
0.054
0.080
Thirty Fifth International Conference on Information Systems, Auckland 2014
= 1584ℎ(),
= 0.95,
= $59.68-
24
16
8
8
53
171
256
240
80
80
80
80
c2 e42
π2
5.16
4.02
2.01
2.92
13.39
62.49
55.04
60.24
20.08
20.08
29.20
29.20
0.035
0.020
0.044
0.044
0.436
0.532
0.197
0.117
0.088
0.088
0.066
0.066
How Clean is Clean Enough?
Parameter Description
The parameters identified in the model in the previous section were defined based on the empirical data
for each of the three conversion types. The target accuracy, budget (cost), time permitted for the
conversion process and the types of resources available were also derived from the sample data. Since the
three experts were from different firms, they were able to validate the estimates for the sample data
independently to ensure that the estimates were reasonable. The estimates of the accuracy of cleansing
process and time taken for manual and automated cleaning were also extracted from the sample datasets
used for the study. It must be explained here that while our network implementation of the nodes in
Figure 2 shows only two stages for each of the ETL processes, in reality there can be multiple iterations in
each stage, before the data is passed on to the next process. The parameters for each of the three
conversion types along with the percentage improvement at each node are presented in Table 2.
Results and Discussion
The data described in the previous section was used in the optimization model. The output of the model
was used to record the optimal path. The optimal path identified different stages of the ETL process to
achieve target accuracy level, while conforming to the budgetary and the processing time constraints.
The optimal path for Simple conversion is: Start-1-2-8-9-11-12-13-15-16-End
The optimal path for Average conversion is: Start-1-2-3-5-6-8-9-10-12-13-15-16-17-19-21-End
The optimal path for Complex conversion is: Start-1-2-3-5-6-8-9-11-12-13-15-16-17-19-20-End
The numbers in the optimal paths represent the nodes in Figure 2 and the order of nodes represents the
sequence of stages in the optimal ETL process.
It was seen that for the case of simple conversions, optimal path identified no data cleaning in the extract
phase, while resolving most of the errors, through a combination of manual and automated cleansing
processes, in the transform phase. This could be indicative of the nature of the conversion task that makes
cleansing process improvement during extract and load phase marginal. For the case of conversions of
average complexity, the optimal path indicated that cleansing at each node improves the accuracy
significantly. This result also shows that manual processing is optimal compared to the automated
processes for data errors with higher complexity. For the case of most complex conversion, optimal path is
similar to the case of average conversion, with a difference that an additional cleansing step was used in
the transform stage. In this step, automated cleansing provided a better data cleaning option than the
manual process. Hence, the nature and type of errors in a complex conversion made it different than the
cases of average and simple conversion. Furthermore, analysis of the results showed that budget and
available time for the three types of conversions also play key roles in the composition of optimal paths to
satisfy the data cleansing requirements.
Conclusion
In this research in progress, we develop an optimization model of the ETL process. We identify the major
stages and the sub processes within the ETL process that impact the quality of data cleansing. We use the
parameters of time, expert resources, budget, and target data accuracy requirements to formulate a
mathematical model of the system. We then use empirical data for conversions of three different
complexity levels from one domain. The estimates of the parameters of the model are derived from the
sample data and validated by three industry experts independently. Preliminary results have been
presented that represent how complexity of the task, accuracy requirements, budget and time available
can impact the optimal path for data cleansing. Sometimes cleansing upfront in the extract process is not
the best solution and vice versa. This preliminary analysis establishes the feasibility of the approach.
Future studies will use additional data sources and include additional sub stages within the ETL process.
This research will help us understand the various facets of data cleansing process in ETL process and
provide with optimal combinations of the decisions regarding resource allocation, budget and data
accuracy.
Thirty Fifth International Conference on Information Systems, Auckland 2014
9
Decision Analytics, Big Data, and Visualization
References
Erhard, E., and Do, H. 2000. “Data Cleaning: Problems and Current Approaches,” Bulletin of the IEEE
Computer Society Technical Committee on Data Engineering (23:4), pp. 3-13.
Galhardas, H., Lopes, A., and Santos, E. 2011. “Support for User Involvement in Data Cleaning,” in
Proceedings of the 2011 International Conference of Data Warehousing and Knowledge Discover,
pp. 136-151.
Fourer, R. Gay, D.M., and Kernighan, B.W. 2002. AMPL: A Modeling Language for Mathematical
Programming. Duxbury Press / Brooks/Cole Publishing Company.
Friedman, T., and Bitterer, A. 2006. Magic quadrant for data quality tools. Gartner Group, 184.
Haug, A., Zachariassen, F., and Liempd, D. 2011. “The Costs of Poor Data Quality,” Journal of Industrial
Engineering and Management (4:2), pp. 168-193.
Lee, Y., Strong, D., Kahn, B., and Wang, R. 2002. “AIMQ: a methodology for information quality
assessment,” Information & Management (40:2), pp. 133-146.
Maletic, J., and Marcus, A. 2000. “Data Cleansing: Beyond Integrity Analysis,” in Proceedings of the 2000
Conference on Information Quality, pp. 200-209
Marsh, R. 2005. “Drowning in dirty data? It’s time to sink or swim: a four-stage methodology for total
data quality management,” The Journal of Database Marketing & Customer Strategy Management
(12:2), pp. 105-112.
Pipino, L., Lee, Y., and Wang, R. 2002. “Data Quality Assessment,” Communications of the ACM (45:4),
pp. 211-218.
Rahm, E., and Do, H. H. 2000. “Data cleaning: Problems and current approaches,” IEEE Data Eng. Bull.
23(4), 3-13.
Redman, T. 1998. “The Impact of Poor Data Quality on the Typical Enterprise,” Communications of the
ACM (41:2), pp. 79-82.
Vassiliadis, P., Simitsis, A., and Baikousi, E. 2009. “A Taxonomy of ETL Activities,” in Proceedings of the
12th ACM International Workshop on Data Warehousing, pp. 25-32.
Watts, S., Shankaranarayanan, G., and Even A. 2009. “Data Quality Assessment in Context: A Cognitive
Perspective,” Decision Support Systems (48:1), pp. 202-211.
Wixom, B. and Watson, H. 2001. “An Empirical Investigation of the Factors Affecting Data Warehousing
Success,” MIS Quarterly (25:1), pp. 17-3.
10 Thirty Fifth International Conference on Information Systems, Auckland 2014