download

Beyond ETL and Data Warehousing
Page 1 of 6
Beyond ETL and Data Warehousing
InfoManagement Direct, February 19, 2009
Rick Sherman
Data integration suffers from an image problem. It has become synonymous with extract,
transform and load. Likewise, ETL has been regarded as a data warehousing
technology. Both of these viewpoints fail to reflect current capabilities, and they greatly
inhibit enterprises in their attempt to integrate data to provide the information their
business needs. Because of this short-sightedness, companies have lost opportunities to
harness information as a corporate asset. It increases the cost of integrating data,
encourages the creation of data silos and forces businesspeople to spend an inordinate
amount of time filling the information gaps themselves through data shadow systems or
reconciling data.
ADVERTISEMENT
More than Simply ETL Tasks
The basic tasks required in data integration are to gather data, transform it and put it into
its target location. If that sounds like ETL, that’s because that’s exactly what it is (see
Figure 1). ETL tools have automated these tasks and empowered developers with a
http://www.information-management.com/infodirect/2009_109/10014988-1.html?zkPrint... 19/01/2010
Beyond ETL and Data Warehousing
Page 2 of 6
toolkit beyond what they could have easily hand coded themselves. ETL tools, for
example, include prebuilt transformations from the elementary tasks of converting data
and performing lookups to the more complex processes of change data capture and
slowly changing dimensions. These prebuilt transformations greatly enhance developer
productivity and improve the consistency of results. Data integration tools offer many
significant processes and technologies that extend beyond the basic ETL tasks (see
Figure 2). These extensions are necessary to turn data into comprehensive, consistent,
clean and current information. The extended processes include data profiling, data
quality and operational processing. This creates the ability to determine the state of the
source systems, perform cleansing, ensure consistency and manage all the processing,
including error handling and performance monitoring.
Data integration suites have expanded to incorporate enterprise application integration,
enterprise information integration and service-oriented architecture coupled with ETL to
offer data integration in batch, interoperating with applications, or in real time from
business intelligence applications.
More than Batch Processes
There are many integration initiatives in an enterprise. With data integration being
associated with batch-driven ETL processes that load a data warehouse (DW),
integration initiatives that did not involve a DW sought out other technologies. These
technologies included EAI, EII and SOA. Although each of these technologies has
fundamental applications, the reality is that organizations had to reinvent the wheel for
http://www.information-management.com/infodirect/2009_109/10014988-1.html?zkPrint... 19/01/2010
Beyond ETL and Data Warehousing
Page 3 of 6
every data integration task. The results have been integration silos built with different
technologies, producing inconsistent business information and generally with data
integration built as an afterthought.
The good news is that data integration vendors that now combine all of the above
technologies into data integration suites have emerged from the ETL ranks. These suites
enable an enterprise to integrate data in one consistent manner, yet deploy using
whatever transport technology (i.e., ETL, EAI, EII or SOA) is appropriate.
More than Data Warehousing
With the emergence of the more powerful suites, data integration has moved beyond
data warehousing to include other integration initiatives in an enterprise, such as:
•
Data migration,
•
Application consolidation,
•
Operational and real-time BI and
•
Master data management, customer data integration and product information
management.
Companies often undertake data migration or application consolidation projects because
of mergers and acquisition or because they need to streamline applications. In the past,
these projects were seen as one-offs and typically hand coded. As systems integrators
became proficient in ETL tools from DW projects, they realized that they would be much
more productive at data migrations and application consolidation projects if they used
these same data integration tools. Even though they are one-offs, data integration tools
enabled the SIs to reuse code, leverage prebuilt transformations, better manage
processes and produce documentation without a laborious manual effort. In addition,
they did not have to deploy a cadre of coding gurus but could leverage the data
integration developers they already employed.
Several market forces have converged to produce the perfect storm, enabling
operational or real-time BI with the same data integration and BI tools as used in DW
projects. These forces include enterprise applications built on relational databases and
data integration tools no longer bound to batch ETL constraints. In addition, with the
major enterprise application vendors also offering data integration and BI tools, these
vendors are bundling this convergence is more consistent, comprehensive and current
information (business benefit) with the same data integration and BI infrastructure (IT
benefit).
http://www.information-management.com/infodirect/2009_109/10014988-1.html?zkPrint... 19/01/2010
Beyond ETL and Data Warehousing
Page 4 of 6
MDM, CDI and PIM all deal with conforming and maintaining master data or reference
data for data subjects, such as customers and products. The initial wave of technology
solutions bundled a set of tools and applications that were business process or industry
specific. What got lost in many of the initial implementations was that these applications
relied heavily on data integration and that it made sense to leverage a company’s
existing data integration platform to create MDM, CDI and PIM solutions.
Moving to Pervasive Data Integration
Data integration tools aren’t pervasive yet, despite the fact that their use is a best
practice. The primary inhibitors have been cost and resources, lack of understanding of
the tool capabilities and a market unawareness of tool offerings.
Although Fortune 1000-size corporations tend to use these tools to build their data
warehouses, they still hand code their data marts, online analytical processing cubes
and other reporting databases. The barriers to pervasive use in these enterprises
include:
•
Expense. The licensing costs of these tools often inhibit more widespread use.
•
Resources. Often, the data integration developers are tied up in DW development
and are not available for other database loading.
•
Fit. The data integration tool selected as the corporate standard may not be the best
match for creating the data marts, cubes and reporting databases required by
individual groups.
The issue of expense is best addressed when selecting a tool or negotiating pricing with
the tool vendor. Licensing cost should not be the barrier it once was with many vendors
offering more scalable pricing options.
Companies can address the issue of resources by forming a data integration
competency center. Identifying and elevating the visibility of the extent of the data
integration tasks occurring throughout an enterprise will enable IT to justify either more
resources or a prioritization of all integration projects so that the enterprise DW does not
consume all available resources.
The issue of fit is political for large enterprises because the solution is to go against the
grain of selecting one corporate standard. The pragmatic solution would be to create two
corporate standards: one for enterprise-class data integration and the other for
“downstream” databases such as data marts or cubes. The downstream databases do
not require the many sources, data cleansing or data conforming that data warehousing
requires. Most of these downstream databases could be loaded by more cost- and
resource-effective ETL solutions than required for the enterprise.
http://www.information-management.com/infodirect/2009_109/10014988-1.html?zkPrint... 19/01/2010
Beyond ETL and Data Warehousing
Page 5 of 6
Not only do corporate IT departments balk at this solution, but tool vendors often think
they should “own” the account and can do everything that every other tool can do. The
pragmatic answer I give both groups is that without the two-tier standard approach, the
downstream databases will continue to be hand coded with all the business and IT costs
and risks associated with those applications. In addition, more data shadow systems will
arise to plug the information gap because hand coded applications will take longer and
incur more maintenance costs as they age.
For firms smaller than the Fortune 1000, such as those with annual revenue between
$250 million and $1.5 billion, the inhibitors to using data integration tools are costs,
resources and being unaware of the breadth of the data integration market. These firms
do not have the IT budget or resources of the Fortune 1000 to dedicate to data
integration solutions. Although this means they would spend less on solutions, it should
not mean they are forced to hand code.
However, they might not be aware of the tools – at least those that are in their budget. If
they are aware of any data integration tools at all, it is most likely that the high-end tools
are expensive and require highly skilled resources. They are aware of these tools
because industry analysts and publications seem to only mention the expensive ones.
They also hear about them from employees who used to be employed by Fortune 1000
firms. From their perspective, you either have to pay for high-end tools or you hand code,
and hand coding usually wins out.
The data integration tool market includes products supporting a wide range of
capabilities, skills and costs. As discussed, there may be multiple products available that
would be an excellent fit for a midmarket firm’s needs, skills and budget.
Next Steps
If you work in a Fortune 1000 company, ask yourself if you’re stuck in the ETL rut and if
business groups throughout your enterprise are hand coding or, worse, building data
shadow systems. If your company is smaller than the Fortune 1000, determine if you are
hand coding your data integration processes and why.
After you get your answers, become an advocate to make data integration more
pervasive in your enterprise and unleash the information your business needs.
Data Profiling
Data profiling is the examination and assessment of your source systems’ data quality,
integrity and consistency. Sometimes called source systems analysis, it helps ensure
your data integration works properly so you can avoid being surprised by bad data.
Data profiling tools, which automate the process, assist your staff in truly understanding
not just the data definitions but, more importantly, the real content of the data. Without
these tools, data profiling requires writing hand-coded SQL statements and comparing
query results with expected results. Not only is this laborious and time-consuming, it is
likely to be incomplete because people generally will not hand code every permutation of
http://www.information-management.com/infodirect/2009_109/10014988-1.html?zkPrint... 19/01/2010
Beyond ETL and Data Warehousing
Page 6 of 6
a source system’s tables, views and columns. Data warehousing and BI projects are
often late or encounter data surprises when the staff hand codes data profiling. It’s even
worse when, overwhelmed, people skip it completely.
Data profiling should be established as a best practice for every data warehouse, BI, and
data migration project. In addition to meeting project requirements, data profiling should
be an ongoing activity to ensure that you maintain data quality levels.
Rick Sherman has more than 20 years of business intelligence and data warehousing
experience, having worked on more than 50 implementations as a director/practice
leader at PricewaterhouseCoopers and while managing his own firm. He is the founder
of Athena IT Solutions, a Boston-based consulting firm that provides data warehouse
and business intelligence consulting, training and vendor services. Sherman is a
published author of over 50 articles, an industry speaker, a DM Review World Class
Solution Awards judge, a data management expert at searchdatamanagement.com and
has been quoted in CFO and Business Week. Sherman can be found blogging on
performance management, data warehouse and business intelligence topics at The Data
Doghouse.You can reach him at [email protected] or (617) 835-0546.
In addition to teaching at industry conferences, Sherman offers on-site data
warehouse/business intelligence training, which can be customized and teaches public
courses in the Boston area. He also teaches data warehousing at Northeastern
University 's graduate school of engineering.
For more information on related topics, visit the following channels:
• Best Practices/Benchmarking
• Data Integration
• Data Profiling
• ETL
http://www.information-management.com/infodirect/2009_109/10014988-1.html?zkPrint... 19/01/2010