Recent Developments in Data Warehousing

Recent Developments in
Data Warehousing: A Tutorial
Hugh J. Watson
Terry College of Business
University of Georgia
[email protected]
http://www.terry.uga.edu/~hwatson/dw_tutorial.ppt
Tutorial Objectives
Provide an overview of data
warehousing
 Provide materials to support the
teaching of data warehousing
 Discuss recent developments in data
warehousing

Topics Covered








Definitions and concepts
The data mart and enterprise-wide data
warehouse strategies
Data extraction, cleansing, transformation and
loading
Meta data
Data stores
Online analytical processing (OLAP)
Warehouse users, tools, and applications
Case study: Harrah’s Entertainment
The Importance of Data
Warehousing




Provide a “single version of the truth”
Improve decision making
Support key corporate initiatives such as
performance management, B2C and B2B
e-commerce, and customer relationship
management
Estimated to be a $113.5 billion market in
2002 for systems, software, services, and
in-house expenditures (Palo Alto
Management Group)
A Simple Definition
A data warehouse is a collection of
data created to support decisionmaking applications.
Data Warehouse
Characteristics




Subject oriented -- data are organized
around sales, products, etc.
Integrated -- data are integrated to
provide a comprehensive view
Time variant -- historical data are
maintained
Nonvolatile -- data are not updated by
users
Another Definition
Data warehousing is the entire
process of data extraction,
transformation, and loading of data to
the warehouse and the access of the
data by end users and applications.
Data Mart
 A data mart stores data for a limited number of
subject areas, such as marketing and sales data. It is
used to support specific applications.
 An independent data mart is created directly from
source systems.
 A dependent data mart is populated from a data
warehouse.
Operational Data Store
 An operational data store consolidates data from
multiple source systems and provides a near realtime, integrated view of volatile, current data.
 Its purpose is to provide integrated data for
operational purposes. It has add, change, and delete
functionality.
 It may be created to avoid a full blown ERP
implementation.
Data Sources
ETL Software
S
T
A
G
I
N
G
Transaction Data
Prod
IBM
Mkt
IMS
HR
Fin
VSAM
Ascential
Oracle
Extract
Acctg
Syba se
Other Internal Data
ERP
SAP
Infor mix
SAS
HarteHanks
Users
ANALYSTS
Cognos
Teradata
IBM
Load
Informatica
D
A
T
A
External Data
Demographic
A
R
E
A
O
P
E
R
A
T
I
O
N
A
L
Data Analysis
Tools and
Applications
SQL
Sagent
Web Data
Clickstream
Data Stores
S
T
O
R
E
Clean/Scrub
Trans form
Firstlogic
Data
Warehouse
Data Marts
SAS
MANAGERS
Finance
Essbase
Marketing
Queries,Reporting,
DSS/EIS,
Data Mining
EXECUTIVES
Micro Strategy
Meta
Data
Sales
Microsoft
Siebel
Business
Objects
OPERATIONAL
PERSONNEL
Web
Browser
CUSTOMERS/
SUPPLIERS
Two Data Warehousing
Strategies
Enterprise-wide warehouse, top
down, the Inmon methodology
 Data mart, bottom up, the Kimball
methodology
 When properly executed, both result
in an enterprise-wide data
warehouse

The Data Mart Strategy







The most common approach
Begins with a single mart and architected marts
are added over time for more subject areas
Relatively inexpensive and easy to implement
Can be used as a proof of concept for data
warehousing
Can perpetuate the “silos of information”
problem
Can postpone difficult decisions and activities
Requires an overall integration plan
The Enterprise-wide Strategy





A comprehensive warehouse is built
initially
An initial dependent data mart is built
using a subset of the data in the
warehouse
Additional data marts are built using
subsets of the data in the warehouse
Like all complex projects, it is expensive,
time consuming, and prone to failure
When successful, it results in an
integrated, scalable warehouse
Data Sources and Types




Primarily from legacy, operational
systems
Almost exclusively numerical data at the
present time
External data may be included, often
purchased from third-party sources
Technology exists for storing unstructured
data and expect this to become more
important over time
Extraction, Transformation,
and Loading (ETL) Processes
The “plumbing” work of data
warehousing
 Data are moved from source to
target data bases
 A very costly, time consuming part
of data warehousing

Recent Development:
More Frequent Updates
Updates can be done in bulk and
trickle modes
 Business requirements, such as
trading partner access to a Web site,
requires current data
 For international firms, there is no
good time to load the warehouse

Recent Development:
Clickstream Data




Results from clicks at web sites
A dialog manager handles user
interactions. An ODS helps to custom
tailor the dialog
The clickstream data is filtered and
parsed and sent to a data warehouse
where it is analyzed
Software is available to analyze the
clickstream data
Data Extraction



Often performed by COBOL routines
(not recommended because of high
program maintenance and no
automatically generated meta data)
Sometimes source data is copied to the
target database using the replication
capabilities of standard RDMS (not
recommended because of “dirty data” in
the source systems)
Increasing performed by specialized ETL
software
Sample ETL Tools






Teradata Warehouse Builder from
Teradata
DataStage from Ascential Software
SAS System from SAS Institute
Power Mart/Power Center from
Informatica
Sagent Solution from Sagent Software
Hummingbird Genio Suite from
Hummingbird Communications
Reasons for “Dirty” Data










Dummy Values
Absence of Data
Multipurpose Fields
Cryptic Data
Contradicting Data
Inappropriate Use of Address Lines
Violation of Business Rules
Reused Primary Keys,
Non-Unique Identifiers
Data Integration Problems
Data Cleansing




Source systems contain “dirty data” that
must be cleansed
ETL software contains rudimentary data
cleansing capabilities
Specialized data cleansing software is
often used. Important for performing
name and address correction and
householding functions
Leading data cleansing vendors include
Vality (Integrity), Harte-Hanks (Trillium),
and Firstlogic (i.d.Centric)
Steps in Data Cleansing

Parsing

Correcting

Standardizing

Matching

Consolidating
Parsing
Parsing locates and identifies
individual data elements in the
source files and then isolates these
data elements in the target files.
 Examples include parsing the first,
middle, and last name; street
number and street name; and city
and state.

Correcting
Corrects parsed individual data
components using sophisticated data
algorithms and secondary data
sources.
 Example include replacing a vanity
address and adding a zip code.

Standardizing
Standardizing applies conversion
routines to transform data into its
preferred (and consistent) format
using both standard and custom
business rules.
 Examples include adding a pre
name, replacing a nickname, and
using a preferred street name.

Matching
Searching and matching records
within and across the parsed,
corrected and standardized data
based on predefined business rules
to eliminate duplications.
 Examples include identifying similar
names and addresses.

Consolidating

Analyzing and identifying
relationships between matched
records and consolidating/merging
them into ONE representation.
Data Staging





Often used as an interim step between data
extraction and later steps
Accumulates data from asynchronous sources
using native interfaces, flat files, FTP sessions,
or other processes
At a predefined cutoff time, data in the staging
file is transformed and loaded to the warehouse
There is usually no end user access to the
staging file
An operational data store may be used for data
staging
Data Transformation
Transforms the data in accordance
with the business rules and
standards that have been
established
 Example include: format changes,
deduplication, splitting up fields,
replacement of codes, derived
values, and aggregates

Data Loading
Data are physically moved to the
data warehouse
 The loading takes place within a
“load window”
 The trend is to near real time
updates of the data warehouse as
the warehouse is increasingly used
for operational applications

Meta Data




Data about data
Needed by both information technology
personnel and users
IT personnel need to know data sources
and targets; database, table and column
names; refresh schedules; data usage
measures; etc.
Users need to know entity/attribute
definitions; reports/query tools available;
report distribution information; help desk
contact information, etc.
Recent Development:
Meta Data Integration



A growing realization that meta data is
critical to data warehousing success
Progress is being made on getting
vendors to agree on standards and to
incorporate the sharing of meta data
among their tools
Vendors like Microsoft, Computer
Associates, and Oracle have entered the
meta data marketplace with significant
product offerings
Database Vendors
High end (i.e., terabyte plus)
vendors include NCR-Teradata
(Teradata) and IBM (DB2)
 Oracle (8i) and Microsoft (SQL
Server 7) are major players for
smaller databases

On-line Analytical
Processing (OLAP)
A set of functionality that facilitates
multidimensional analysis
 Allows users to analyze data in ways
that are natural to them
 Comes in many varieties -- ROLAP,
MOLAP, DOLAP, etc.

ROLAP





Relational OLAP
Uses a RDBMS to implement and OLAP
environment
Typically involves a star schema to
provide the multidimensional capabilities
OLAP tool manipulates RDBMS star
schema data
Called slowlap by MOLAP vendors
MOLAP
Multidimensional OLAP
 Uses a MDDBS (e.g., Essbase) to
store and access data
 Usually requires proprietary
(non SQL) data access tools
 Provides exceptionally fast response
times

Star Schema
Creates non-normalized data
structures
 Easier for users to understand
 Optimized for OLAP
 Uses fact (facts or measures in the
business) and dimension
(establishes the context of the facts)
tables

OLAP Tools

Products come from vendors such as Brio, Cognos, Hyperion,
and BusinessObjects

Typically available as a fat or thin (i.e., browser) client

In a web environment, the browser communicates with a
web server, which talks to an application server, which
connects to backend databases

The application server provides query, reporting, and OLAP
analysis functionality over the web

Java applets or downloaded components augment the thin
client

A broadcast server may be used to schedule, run, publish,
and broadcast reports, alerts, and responses over the LAN,
email, or personal digital assistant.
Dimension Table Examples




Retail -- store name, zip code, product
name, product category, day of week
Telecommunications -- call origin, call
destination
Banking -- customer name, account
number, branch, account officer
Insurance -- policy type, insured party
Fact Table Examples
Retail -- number of units sold, sales
amount
 Telecommunications -- length of
call in minutes, average number of
calls
 Banking -- average monthly
balance
 Insurance -- claims amount

The Fact Table Key Concatenates
the Dimension Keys
Assume that you want to know the
number of television sets sold
to Best Buys on January 15, 2001.
The query might be:
SELECT CLIENT.CUSNAME, SALES.NOSOLD
FROM CLIENT, PRODUCT, TIME, SALES
WHERE CLIENT.CUSNAME=SALES.CUSNAME AND
PRODUCT.PRODNAME=SALES.PRODNAME AND
TIME.DATE=SALES.DATE AND CLIENT.CUSNAME=“BEST BUYS”
AND PRODUCT.PRODNAME=“TELEVISION” AND
TIME.DATE=#01/15/2001#
Warehouse Users
Analysts
 Managers
 Executives
 Operational personnel
 Customers and suppliers

Warehouse Tools and
Applications








SQL queries
Managed query environments
Structured and ad hoc reports
DSS/EIS
Portals
Data mining
Packaged applications
Custom-built applications
Recent Development:
Enterprise Intelligence Portals



Offers users an effective way to access
information scattered across networked
enterprise systems through a simple and
personalized Web interface
Provides access to structured and
unstructured data
Potentially integrates data warehousing
and knowledge management
Harrah’s Entertainment





Harrah’s Entertainment -- data warehousing
supported a successful shift to a CRM oriented
corporate strategy. Winner of the 2000 TDWI
Leadership Award
Operates 21 casinos across the country
In 1993, the gaming laws changed, which
allowed Harrah’s to expand
Harrah’s decided to compete using a brand
strategy supported by information technology
Needed to know their customers exceptionally
well
Harrah’s Data Warehousing
Architecture
WINet sources data from the casino,
hotel, and event systems
 The patron data base serves as an
operational data store
 The marketing workbench serves as
the data warehouse

Sample Applications
Operational personnel use PDB to
check the preferences, history, and
value of customers
 Analysts use PDB and MWB to create
offers to visit a Harrah’s casino
 Analysts use MWB to support
predictive modeling efforts

 Predict the value
of a customer
 Market based on
that expected value
 Track transactions
that are linked to
marketing
initiatives
 Evaluate the
effectiveness
 Track profitability
 Refine Marketing
Approaches
Define:
 Objectives
 Tests
 Control cells
Learn



Right Offer
Right Message
Right Time
Customer
Treatment
Measure:
 Profit & Loss
 Behavior change
 New test report
Execute
Track
Customer
Action/
Non-Action
Customer Relationship Lifecycle
Establish
Strengthen
Annual
Revenue
Length of Relationship
Reinvigorate
Articles





Cooper, B.L., H.J. Watson, B.H. Wixom, and D.L. Goodhue, "Data Warehousing
Supports Corporate Strategy at First American Corporation," MIS Quarterly,
(December 2000), pp. 547-567. Provides a case study of how the First
American Corporation turned their strategy and fortunes around through the
use of data warehousing.
Stoller, Wixom, and Watson, “WISDOM Provides Competitive Advantage at
Owens & Minor,” (http://terry.uga.edu/~watson/owens&minor.doc) Provides a
case study of how data warehousing can support supply chain integration.
Watson, Wixom, Buonamica, and Revak, “Sherwin-Williams' Data Mart
Strategy: Creating Intelligence Across the Supply Chain,” Communications of
ACIS, April 2001. Provides a textbook example of how to implement a data
mart strategy.
Watson, H.J., D.A. Annino, B.H. Wixom, K.L. Avery, and M. Rutherford,
“Current Practices in Data Warehousing,” Information Systems Management,
(Winter, 2001), pp. 47-55. Provides data on companies’ data warehousing
experiences, with an emphasis on the benefits being realized.
Watson, H.J. and L. Volonino, “Harrah’s High Payoff from Customer
Information,” (http://www.terry.uga.edu/~hwatson/harrahs.doc) Provides a
case study of how Harrah’s Entertainment has implemented a CRM strategy
facilitated by data warehousing.
Books

Devlin, Data Warehouse -- Architecture to Implementation, AddisonWesley, 1997.

Gray and Watson, Decision Support in the Data Warehouse, Prentice-Hall,
1998.

Kimball, The Data Warehouse Toolkit, Wiley, 1996.

Kimball and Merz, The Data Webhouse Toolkit, Wiley, 2000.

Inmon, Building the Operational Data Store, second edition, Wiley, 1999.

Inmon, Imhoff, and Sousa, Corporate Information Factory, Wiley, 1999.
Websites





http://www.olapreport.com
(provides detailed information about the OLAP
market, products, and applications)
http://www.firstlogic.com
(includes an interactive demo of their data
cleansing tool)
http://www.billinmon.com
(a wealth of current information from “the
father of data warehousing”)
http://www.metagenix.com
(illustrates recent advances in ETL tools)
http://www.microstrategy.com
(excellent materials from one of the leading
DSS vendors)