Step 1 : Identify and choose business processes

A data warehouse is a relational database just like any other relational database.
However, the database model is designed in a different way. It should facilitate
business users querying large amounts of data, but also let them explore the data
from different points-of-view by means of BI applications. Six steps will guide you
through the design process and address some key fundamentals in data warehouse
modeling. Finally, five additional tips will help you save time and money with future
enhancements of your data warehouse solution.
Step 1 : Identify and choose business processes
Identify candidate business processes that you want to see covered by your data warehouse.
These are the business processes that your key users want to analyze and run reports on.
There may be several processes put forward by different stakeholders, e.g. product sales,
production quality control, consumption of energy, helpdesk issue management. It is
important to identify all of the processes, but you should not try to have each one of them
implemented at once. You can use a prioritization matrix to help you choose which project
to model first. In the ideal situation you start with the process that has a high business value
and is feasible both to realize for the IT department and to introduce in your organization.
Step 2 : Describe the business process.
Once you have chosen your business process, you describe it in terms of measures. The
most trivial measure is the counting measure. Each process cycle can be counted.
Depending on the size and type of business some of the process cycles can be counted on a
daily basis, such as sales processes, others on a yearly basis, like marketing budget
allocations. Besides this counting measure, most processes can be further described by
other measures such as its cost, profit, duration, spatial volume, et cetera. How much did
that sale cost ? How many liters of milk were produced ? All the measures that describe the
same process, can be grouped into a so-called fact table.
Step 3 : Describe the context of the measures
A measure has no meaning on its own. The number 1000 does not tell you much. You need
to interprete numbers in their proper context. This is where dimension attributes come into
play. Sales amounts are relevant for products, and for customers, on a given date, and
maybe for sales representatives. These contextual nouns are called dimension attributes.
The values of these attributes can be used to filter your measures, to drill-down/up and to
categorize measures in columns or rows. In the grid below, you’ll see an example of a sales
quantity measure described in the context of three dimension attributes: product, date and
customer.
Likely, you can provide more interesting information for your dimension attributes. A
customer also has a name and resides at a location. A product may have a color, and
belongs to a product category. A date is part of a month and a year. These related attributes
can be grouped into so-called dimension tables, and are in turn very handy when you filter,
drill-down/up, or categorize your data.
Step 4 : Define dimension tables in the model
Now that you have identified dimension attributes, you can design the proper dimension
tables. Each dimension table consist of its business key attribute, the one that uniquely
identifies a row in the table. The business key is the attribute that is known in the
operational system by the business users. Along with the business key, the related attributes
are put in place. The ETL, the process that uploads new data from the operational system,
e.g. all product information, will look up the business key in the dimension table.
If the dimension member is already in the table, some related attributes may be updated,
such as a change in color for a product, or if not, then the new member should be inserted
into your dimension table.
Further, we will introduce a surrogate key in the dimension table, an identity column with
automatically generated incremental values for each row that is inserted. The surrogate key
will become the primary key of the table. The reason why, is explained further below. Your
database administrator can help you with the implementation.
Step 5 : Design a calendar (time) dimension table
The Calendar dimension table contains all the dates from some point in the past up to some
point in the future, e.g. 2000 Jan 1 – 2049 Dec 31, a period relevant to your business users.
The business key here, can be defined as any data type, even a datetime data type. But for
performance reasons, and for easy query development, an integer is provided taking the
format YYYYMMD. An example: the date id 20140820 represents 2014 Aug 20. The
calendar table is populated with records only once at the start of your data warehouse
project.
Step 6 : Define the fact table for your measures
Finally, we’ll define the fact table that will hold the measures of our sales process. Provide
a column for every measure. Provide a column for every foreign key that links the measures
to the corresponding dimension rows . Also here, a surrogate key is introduced in the fact
table that will become the primary key of the table.
Why using surrogate keys?
Well you’ll need them for a number of reasons:
• It protects you from the unexpected. It may occur that you have to change a business key
for an existing member. With surrogate keys introduced, a business key change only
requires an update of the business key column in the dimension table. No need to update all
the fact rows which are related to that surrogate key.
• Using integer keys improves better join performance for the database; also it yields
smaller tables (smaller indexes) which ends up also into better query performance.
• We’ll introduce also a surrogate key in the fact table. Sooner or later a fact table might in
turn become a dimension table for another fact table. E.g. a fact table that holds measures
for projects might act as a dimension for a fact table holding measures for individual project
steps.
• Note: the Calendar dimension, as an exception, has no surrogate key as told in step 5
earlier.
Congratulations, you’ve just designed your first dimensional model. For the implementation
of the model, please keep in mind the following five tips. Your users will stay happy.
Tip 1 : Use atomic detailed data in your dimension model
Most often, users don’t need to see a single record at a time. Nevertheless you can’t predict
how they want to display and drill down into the details. When you introduce summarized
data into the data warehouse, e.g. sales per region instead of sales per shop, then you make
assumptions about data usage patterns. Sooner or later your users will not appreciate that.
So model the facts on the lowest level.
Tip 2 : All facts in a single fact table are modeled at the same grain
Some might be tempted to fill the same fact table with measures from different types of
processes, e.g. car production and automobile services, or maybe some may want to mix
marketing budget allocations for single key products with budgets for entire product
categories. When multiple levels of granularity are present in the same fact table you
introduce not only user confusion but also erroneous results into the BI applications.
Tip 3 : Strive for conformed dimensions to integrate data across the organization
Conformed dimensions are dimensions that can be reused across multiple fact tables. They
are managed once in the ETL system and support the ability to drill across and integrate
data from multiple business processes. A product dimension table can be used with a sales
fact table, but also in a production order fact table.
Tip 4 : Avoid NULLs, instead use special dimension members
NULLs, “empty values”, may lead to wrong result in calculations. Also, a NULL may be
interpreted by business users in different ways :
• as Unknown, e.g. a sale for an non existing product, which is an erroneous situation
• as Not Specified, e.g. an empty phone number in an online customer survey, which can be
considered as common situation
• or as Not Applicable, e.g. an optional feature in a business process such as a travel
insurance when booking a trip
To deal with these ‘null’ situations, we introduce special members in the dimension tables.
The ETL process should map facts to the corresponding special members. These rows are
defined once at the start of the project.
Below is an example of a dimension table with special members: Unknown, Not Applicable,
Not Specified and Archived. The usage of the former three special members has just been
explained. The latter one, Archived, comes in handy when you sooner or later need to
upload facts from old historical data sets with missing dimension values and there is no way
to retrieve those original values. E.g. old sales figures with date and customer codes but
without product codes. Uploads of such historical datasets only occur once, and by using
Archived members, you can easily distinguish these members from the others.
Tip 5 : Keep the data warehouse well performing and consistent.
You should consider your data warehouse as a library. You don’t want go through all the
library’s bookshelves just to find one book. You expect all books to be neatly ordered by
theme and stored in the corresponding bookshelf. Accordingly, have the data warehouse
model implemented with indexes and foreign keys :
• Implement the tables as clustered index structures with the primary key (our surrogate
key) as the clustered index column. In some cases larger tables, sometimes implemented as
a partitioned table, may require the cluster index key to be defined on another column, e.g.
the date column.
• Define a unique non-clustered index on the business key.
• Define foreign key constraints.
• Add non-clustered indexes to the foreign key columns.
If you are not familiar with these database terms, please ask us, or your database
administrator, for further explanation.
Ready for your first data warehouse model ? I hope the six steps guide above and the
additional tips may encourage you to further explore the path of dimensional modeling!