download

Matakuliah : M0584 - Data Warehouse
Tahun
: Sep - 2009
The Data Warehouse and Design
Pertemuan 3-4
Summary
• The design of the data warehouse begins with the data
model
• The primary concern of the data warehouse developer is
managing volume
• The data warehouse is fed data as it passes from the
legacy operational environment. Data goes through a
complex process of conversion, reformatting, and
integration as it passes from the legacy operational
environment into the data warehouse environment
• The data model exist at three levels – high level, mid
level, and low level
Bina Nusantara University
3
Summary
• The creation of a data warehouse record is
triggered by an activity or an event that has
occurred in the operational environment
• A profile record is a composite record made up
of many different historical activities.
• The star join is a database design technique that
is sometimes mistakenly applied to the data
warehouse environment
Bina Nusantara University
4
Beginning with Operational Data
• Three types of loads are made into the data warehouse
from the operational environment:
– Archival data
– Data currently contained in the operational
environment
– Ongoing changes to the data warehouse environment
from the changes (updates)that have occurred in the
operational environment since the last refresh
Bina Nusantara University
5
Beginning with Operational Data (cont’d)
•
Five common techniques are used to limit the amount
of operational data scanned
1.
2.
3.
4.
5.
Scan data that has been timestamped
Scan a ‘delta’ file
Scan a log file or an audit file
Modify application code
Rubbing a ‘before’ and an ‘after’ image of the operational file
together
Bina Nusantara University
6
Data/Process Model and the Architected
Environment
• The process model applies only to the operational environment
• The data model applies to both the operational environment and the
data warehouse environment
• A process model typically consists of the following (in whole or in
part)
– Functional decomposition
– Context-level zero diagram
– Data Flow Diagram
– Structure Chart
– State Transition Diagram
– HIPO chart
– Pseudocode
Bina Nusantara University
7
The Data Warehouse and Data Models
Bina Nusantara University
8
Bina Nusantara University
9
Bina Nusantara University
10
The Data Warehouse data model
• There are three levels of data modeling
– High-level modeling (ERD)
– Middle level modeling (DIS=Data Item Set)
– Low-level modeling (physical model)
Bina Nusantara University
11
Snapshots in the Data Warehouse
• Snapshots are created as a result of some event
occuring.
• The snapshot triggered by an event has four basic
components:
– A key
– A unit of time
– Primary data that relates only to the key
– Secondary data captured as part of the snapshot
process that has no direct relationship to the primary
data or key
Bina Nusantara University
12
Complexity of Transformation and
Integration
• At first glance, when data is moved from the legacy
environment to the data warehouse environment, it
appears that nothing more is going on than simple
extraction of data from one place to the next
Bina Nusantara University
13
Complexity of Transformation and Integration
(cont’d)
• Some lists of functionality required as data passes from the
operational, legacy environment to the data warehouse
environment
– The extraction of data from operational environment to the
data warehouse environment require a change in technology
(DBMS technology)
– The selection data may be very complex
– Operational input keys need to be restructured and
converted
– Nonkey data is reformatted
– Data is cleansed
– Multiple input sources of data exist and must be merged
– Key resolution must be done
– Input files need resequencedd
Bina Nusantara University
– Default values must be supplied,
14
Profile records
• Profile records represent snapshots of data, just like
individual activity records. The difference between the
two is that individual activity records in the data
warehouse represent a single event, while profile
records in the data warehouse represent multiple events.
• A profile record is created from the grouping of many
detailed records
• See figure 3.43 for details
Bina Nusantara University
15
Managing Volume
• In many cases, the volume of data to be managed in the
data warehouse is a significant issue. Creating a profile
records is an effective technique for managing the
volume of data. The reduction of the volume of data
possible in moving detailed records in the operational
environment into a profile record is remarkable
• It is possible (indeed, normal) to achieve a 2-to-3 order
of magnitude reduction of data by the creation of profile
records in a data warehouse.
• Because of this benefit, the ability to create profile
records is a powerful one that should be in the portfolio
16
of every data architect
Bina Nusantara University
Creating Multiple Profile Records
• Multiple profile records can be created
from the same detail. In the case of a
phone company, individual call records
can be used to create a customer profile
record, a district traffic profile record, a line
analysis profile record, and so forth.
Bina Nusantara University
17
Creating Multiple Profile Records
• The profile records can be used to go into the
data warehouse or a data mart that is fed by the
data warehouse. When the profile records go
into data warehouse, they are for generalpurpose use. When the profile records go into
the data mart, they are customized for the
department that will uses the data mart.
• The aggregation of the operational records into
a profile record is almost always done on the
18
operational server.
Bina Nusantara University
Direct Access of Data Warehouse Data
• See figure 3.46
Bina Nusantara University
19
Indirect Access of Data Warehouse Data
• See figure 3.47
Bina Nusantara University
20
Star Joins
• Data Warehouse design is decidedly a world in which a
normalized approach is the proper one. There are
several very good reasons why normalization produces
the optimal design for a data warehouse:
–
–
–
–
It produces flexibility
It fits well with very granular data
It is not optimized for any given set of processing requirement
It fits very nicely with the data model
Bina Nusantara University
21
Star Joins (cont’d)
• A different approach to a database design sometimes
mentioned in the context of data warehousing is the
multidimensional approach. This approach entails star
joins, fact tables, and dimensions. The multidimensional
approach applies exclusively to data marts, not data
warehouse.
• Unlike data warehouse, data marts are very much
shaped by requirements. To build a data mart, you have
to know a lot about the processing requirements that
surround the data mart.
• Once those requirements are known, the data mart can
22
be shaped into an optimal star join structure.
Bina Nusantara University
Star Joins (cont’d)
• Data Warehouses are essentially different because they
serve a very large community, and as such, they are not
optimized for the convenience or performance of any
one set of requirements.
• Data Warehouses are shaped around the corporate
requirements for information, not the departmental
requirements for information.
• Therefore, creating a star join for the data warehouse is
a mistake because the end result will be a data
warehouse optimized for one community at the expense 23
of all other community.
Bina Nusantara University
Star Joins (cont’d)
• See Figure 3.51
Bina Nusantara University
24
Star Joins (cont’d)
• See Figure 3.52
Bina Nusantara University
25
Star Joins (cont’d)
• See Figure 3.53
Bina Nusantara University
26
Star Joins (cont’d)
• See Figure 3.54
Bina Nusantara University
27
Star Joins (cont’d)
• See Figure 3.55
Bina Nusantara University
28
Star Joins (cont’d)
• See Figure 3.56
Bina Nusantara University
29
Supporting the ODS
• In general, there are three classes of ODS
– class I, class II, and class III.
• In a class I ODS, updates of data from the
operational environment to the ODS are
synchronous.
• In class II ODS, the updates between the
operational environment and the ODS
occur within a two-to-three-hour time
frame.
30
Bina Nusantara University
Supporting the ODS
• And in class III ODS, the synchronization
of updates between the operational
environment and the ODS occurs
overnight.
• But there is another type of ODS structure
– a class IV ODS, in which updates into
the ODS from the data warehouse are
unscheduled.
• See Figure 3.57
31
Bina Nusantara University