UNIT-1 Compelling Need for Data Warehousing I t d ti Introduction

MCA 204, Data Warehousing & Data Mining
UNIT-1
Compelling Need for Data
Warehousing
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania,
U1.1
Learning Objective
•
•
•
•
Escalating need for strategic information
Building blocks of data warehouse
Data warehouse components
Defining the business requirements
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.2
I t d ti
Introduction
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania,
U1.‹#›
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.1
MCA 204, Data Warehousing & Data Mining
DBMS and Data Warehouse
• Databases and data warehouses are methods for
organizing and managing information and business
intelligence.
• Database management systems and data mining
tools are IT tools used to work with information and
business intelligence.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.4
Business Intelligence
Business intelligence - is knowledge about :
 Customers
 Competitors
 Partners
 Competitive environment
 Internal operations
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.5
Data Warehousing
What Is A Data Warehouse?
• Data warehousing is a rapidly expanding area of technology
and one that still has a number of different definitions.
• Stephen R. Gardner claims :
 it is “a process, not a product, for assembling and
g g data from various sources for the p
purpose
p
of
managing
gaining a single, detailed view of part warehouse is a
place to store detailed data and a way to combine data to
get a detailed picture of the business.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.6
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.2
MCA 204, Data Warehousing & Data Mining
Cont....
• Another definition by Lawrence Fischer states:
 “A data warehouse is just another database. What sets it
apart is that the information it contains is not used for
operational purposes, but rather for analytical tasks.”
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.7
Data Warehousing (Definition)
A subject-oriented, integrated, time-variant, and non-volatile
collection of data in support of management’s decision-making
process’ [Inmon, 1993].
• SUBJECT-ORIENTED:
The warehouse is organized around the major subjects of an
enterprise (e.g. customers, products, and sales) rather than the
major
j application
li ti
areas (e.g.
(
customer
t
i
invoicing,
i i
stock
t k control,
t l
and order processing).
• INTEGRATED DATA:
• The data warehouse integrates corporate application-oriented
data from different source systems, which often includes data
that is inconsistent.
• Such data, must be made consistent to present a unified view of
the data to the users.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.8
Cont....
• TIME VARIANT:
•Data in the warehouse is only accurate and valid at some
point in time or over some time interval.
•Time-variance is also shown in the extended time that the
data is held, the association of time with all data, and the fact
that data represents a series of historical snapshots.
snapshots
• NON-VOLITILE:
•Data in the warehouse is not updated in real-time but is
refreshed from operational systems on a regular basis.
•New data is always added as a supplement to the database,
rather than a replacement.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.9
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.3
MCA 204, Data Warehousing & Data Mining
Cont…
• Data warehouses are not transaction-oriented.
• Data warehouses
processing (OLAP).
support
online
analytical
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.10
Data Warehouses
What Is A Data Warehouse?
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.11
Why Data Warehousing?
• Collect information centrally
• Organize information consistently
• Deliver information conveniently
• Hence, significant cost benefits, time savings and
productivity gain associated with using a data
warehouse for information processing.
• Conclusion:
• Data warehouse enables information processing
to be done in a credible, efficient manner.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.12
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.4
MCA 204, Data Warehousing & Data Mining
Scope
• The ability to use information to make insightful
decisions depends on having appropriate tools to
extract specific data, convert it into business
information and monitor changes.
• Data warehouse delivers not only summary
information but also the ability to drill down,
develop forecast and export the information to
other decision-support tools.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.13
Practical Implications
Methodologies and design principles needed for building a
production data warehouse. The data warehouse positions the
enterprise to satisfy four interrelated demands on corporations
to :
• Prepare their systems and their users for constant
evolution.
• Improve the productivity and revenue contribution of every
employee.
• Maximize profits by performing core business processes
better than their competitors and by eliminating as many
resource -draining practices as possible.
• Apply science to information.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.14
Creating a Data Warehouse
Data Collection
There need to be extraction routines to gather data from
the various operational data sources that interface with
the Data Warehouse.
Data Cleaning & Transformation
Data must be checked for validity and accuracy and
differences in syntax and semantics must be resolved
Data Loading
Data must be loaded into the Data Warehouse after
carrying out appropriate summarisation and aggregation.
Often this will be done using parallelism (as it could take
weeks to serially load a terabyte of data!).
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.15
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.5
MCA 204, Data Warehousing & Data Mining
Creating a Data Warehouse
Data Refresh
Updates to base data (operational data) must periodically
be propagated to the Data Warehouse.
Data Storage
Appropriate storage structures must exist to allow the
Data Warehouse to support fast access for search and
analysis of differing data types (text, graphic, picture, …).
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.16
Structures of Data Warehouse
Different levels of summarization detail that describes
the data warehouse:
• Current data
• Older data
• Summarized data
• Meta data
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.17
Current Data
Such data:
• Reflects the most recent happenings, which are
always of great interest.
• It is voluminous because it is stored at the lowest
level of granularity.
• It is almost always stored on disk storage, which
is fast to access, but expensive and complex to
manage.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.18
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.6
MCA 204, Data Warehousing & Data Mining
Older Data
• Older data is the data that is infrequently accessed and
stored at a level of detail consistent with current detailed
data.
• Summarized data : summarized data are of two
categories, according to the processing need and storage.
• Lightly summarized data
• Highly summarized data
• (compact and easily accessible)
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.19
Data Marts
• A feature of data warehouses that sets it apart from
databases is the data mart, where data is divided into a
subset of the information in the data warehouse.
• The size of a data warehouse typically ranges from 1-10
GB.
• The data mart is typically populated using the data
warehouse, but occasionally the information will come
directly from the source; it is safer to populate the data
mart using data directly from the warehouse because it is
already cleaned and checked for consistency.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.20
Applications of Data Warehouses
•
Some of the many ways in which a data warehouse gets used
by businesses include:
Create reports for analysis.
Build information about important customers in order to
strengthen customer relations.
Maintain information about inventory and supply.
Measure success of promotions.
Predict the effects of price changes.
Improve the effectiveness of the business by implementing new
market strategies.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.21
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.7
MCA 204, Data Warehousing & Data Mining
Considerations and Issues
Cost to Business
• A typical warehouse costs, overall, more than $1 million.
• a big risk to take on a project that has an initial failure
rate as high as 50%.
• The
Th high
hi h costt can be
b attributed
tt ib t d to
t the
th amountt off time
ti
and
d
money.
• It takes to collect, clean and integrate the data from
different sources.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.22
Ease of Use
• When designing the data warehouse ease of use should be
on the top of the list. “a data warehouse by itself does not
create value; value comes from the use of data in the
warehouse.”
• The most successful data warehouses are ones that provide
users with information they need without a lot of training.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.23
How Warehouse Works
Data warehouses are based largely on four main processes:
 Extracting and loading the data
 Cleaning and transforming the data
 Query management
 Backup and archiving of the data.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.24
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.8
MCA 204, Data Warehousing & Data Mining
Aggregations
• Aggregations are a way of dividing the information so
queries can be run on the aggregated part and not the
whole set of data.
• The warehouse manager is responsible for creating
Aggregations.
• Most aggregations can be created in a single complex
query .and saves time.
U1.25
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Access: Operational and External Data
• The access mechanism required to retrieve data from
Heterogeneous Operational databases
• i.e. retrieved from DB2, SYBASE, ORACLE etc.
Transform
T
f
• Cleans
• Reconcile
• Enhance
• Summarize
• Aggregate
Distribute
Di
t ib t
• Stage
• Join Multiple Sources
• Populate on demand
Store
St
• Relational Data
• Specialized caches
• Multiple Platforms &
H/W
U1.26
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Data Warehouse Functions
• It depicts the flow of data from the original source to the user,
and includes management and implementation capabilities.
• Access mechanisms required to
heterogeneous operational databases.
retrieve
data
from
• Data is then transformed and delivered to the “data warehouse
store” based on a selected model.
• The data transformation and movement processes are
executed whenever an update to the warehouse data is
desired.
• The information that describes the model and definition of the
source data elements is called “metadata”.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.27
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.9
MCA 204, Data Warehousing & Data Mining
Data Flow Within the Data Warehouse
• There is a normal and predictable flow of data within the
data warehouse.
• Most data enters the data warehouse from the
operational environment.
• As data enters the data warehouse from the operational
environment, it is transformed.
• Upon entering the data warehouse, data goes into the
current level of detail. It resides there and is used there
until one of three events occurs;
 It is purged
 It is summarized and/or
 It is archived
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.28
Usage
• The different level of data within the data
warehouse receive different levels of usage.
• The more summarized the data, the quicker and
the more efficient response time.
• Good from security point of view.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.29
Building Considerations
Building & Administration of DW requires the following:
Indexing
• Data at the higher levels of summarization can be
indexed and constructed relatively easier than that at the
lower levels.
• The data model and formal design activities do not apply
to the levels of summarization,
summarization in almost each case.
case
Partitioning :
Partitioning can be done at either of the following two levels
• DBMS level : DBMS is aware of the partitions and manages
them accordingly.
• Application level : Only application programmer is aware
of partition and responsible for the management.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.30
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.10
MCA 204, Data Warehousing & Data Mining
Other Considerations
• Public summary data is stored and managed in the data
warehouse, even through its calculation is well outside the
data warehouse scope.
• Here, the data is stored for Ethical and legal reasons as
required by the corporation.
• In summary, a data warehouse is a subject-oriented,
integrated, time-variant, non-volatile collection of data in
support of management’s decision needs. Each of the
salient aspects of a data warehouse carries its own
implications.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.31
Differences
• Classic SDLC
(requirement driven)
• Requirement gathering
• Analysis
• Design
• Programming
• Testing
• Integrating
• Implementation
• Data warehouse SDLC
(data driven)
• Implement warehouse
• Integrate data
• Test for bias
• Program against data
• Design DSS
• Analyze result
• Understand requirements
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.32
Data Mining
• Data mining is the process of extracting previously
unknown but significant information from large database
and using it to make crucial business decision.
• Data mining has major implications across the enterprise
– for productivity, profitability, customer satisfaction, and
overallll competitiveness.
titi
• Data mining is about discovering facts.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.33
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.11
MCA 204, Data Warehousing & Data Mining
Data Mining Process
• There are two stages in the process of data mining to used
when searching for information.
• Initial searches should be carried out on summary
information . information.
• Focus on the detailed data in order to provide a clearer
view.
• The concept of data mining provides organizations with the
ability to analyze and monitor trends and variations
within their business that provide information to aid the
decision-making process.
• Data mining process requires following steps:
Data warehouse  Extracted data  Data mined 
Extracted information  Select  Transform  Mine
 Assimilate
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.34
Enabling Components
Middleware
• The emergence of middleware is the single most significant
development that enables data mining.
• Without this software connecting heterogeneous data
sources, the resulting information would not provide a
complete picture and could not reap the same reward.
Network
• The advances in networking are a key factor in providing
increased bandwidth across heterogeneous protocols and
therefore the necessary performance to provide train of
thought processing.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.35
Cont...
Data Source
• Many of the DBMS vendors now provide parallel support to
enable rapid query against large volumes of data.
• This enables gigabytes of data to be queried in seconds
where previously it would have taken minutes.
Operating System
• Multiple processor architectures enable high-performance
computers to provide the train of through response times
required for successful data mining analysis
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.36
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.12
MCA 204, Data Warehousing & Data Mining
Related Technologies and Rules
• To make data mining feasible, the appropriate data has to be collected
and stored in a data warehouse, and adequate system resources have
to be available to make the data mining process feasible:
• Many statistical analysis systems such as SAS have been used to
detect unusual patterns and explain patterns using linear statistical
models
• Ad-hoc querying and report generation are commonly used by many
businesses to provide input to their decision making.
• Multidimensional spreadsheets and databases are becoming
popular for data analyses that require summary views of the data along
multiple dimensions.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.37
Cont....
• Neural networks have been applied successfully in a few
applications that involve classification.
• Data mining, when complemented by the techniques descried
above adds significant volume beyond the use of the
above,
traditional techniques.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.38
Data Mining Platform
• Data mining technologies are characterized by :
• intensive computations on large volumes of data.
• Significant processing power and parallelism is a key
to enabling significant data mining.
• The system can be upgraded to provide the necessary
analysis
l i in
i a timely
ti l and
d cost-effective
t ff ti fashion.
f hi
• A balanced system architecture that supports I/O,
computation, and sealing in a cost effective fashion is
desirable.
• Hence, the highest capacity and performance systems are
of interest in this area.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.39
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.13
MCA 204, Data Warehousing & Data Mining
Data Mining Tools
• Data mining has been around for some years but has only
recently come of age because of the following:
• Variety of tools and technological trends.
• Improved hardware cost/performance ratio.
• Improved performance in parallel technology.
technology
• More flexible and intuitive query software.
• Greatly advanced middleware connectively.
• Data mining tools provide access to the data warehouse
(which is the logical view of the organization’s data ) and
enable query, analysis, and presentation of data.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.40
Example
To visualize where data mining techniques can be used
most effectively
Examples :
Fashions change frequently in the retail trade and timely
analysis of information can be used to predict the latest
trends on a store-by-store
store by store basis.
basis
This analysis can be used to reduce stock levels,
reduce capital outlay, and ensure stock is placed where it
vides competitive advantage.
An increase of 1% increase of 1% profit margin can
make the difference between success and failure in the
highly competitive retail trade.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.41
Data Mining Tool Characteristics
• Many tools are integrated part of a total data warehouse
solution.
• The mining tools requires creative analysis to detect
trends, although some have an element of intelligence to
detect patterns.
• To provide coherent information from an unstructured
data requires sophisticated tools.
• To get the desired results from the data requires
manipulation and synchronization into a format usable
by the tool.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.42
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.14
MCA 204, Data Warehousing & Data Mining
Operational Warehouse
Two fundamental
enterprise:
types
of
data
within
any
• Operational data is the data that directly supports
the business functions and for which the majority of
applications have been written.
• Informational data that supports the decisionmaking process of an organization.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.43
Cont....
Data warehouse overcomes the problems of operational environment for
decision-support analysis, such as the following :
• Lack of Integration: Built on diverse types of databases and run on
heterogeneous mainframe, so difficult to integrate for decision support .
• Lack of History : The operational environment provides no historical perspective
due to space limitation and to maintain performance level.
• Lack of Credibility : Difficult to access the accuracy or timeliness of the data
• Performance Considerations: Data store in a format designed to optimize
transaction performance rather than to support business analysis.
• Difficulty in Gaining Enterprise-Wide Perspective : To make cross functional
analysis of information contained in separate databases difficult.
The data warehouse address these problems by providing the architecture to
model, map, filter, integrate, condense and transform operational data into a
separate database to meaningful information that can be accessed, analyzed.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.44
Types of Data Warehouses
• A majority of the enterprises prefer to build and implement a
single centralized data warehouse environment for the
following reasons :
• A single repository makes sense if the volume of data
can be managed easily.
• The data is integrated across the enterprise and only
that view is used at the headquarters.
• However, it may be impractical to integrate and access the
data at a single site if it is dispersed over multiple locations.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.45
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.15
MCA 204, Data Warehousing & Data Mining
Cont....
Hence types of DW depends on the number of business
factors such as the following :
• Business objective
• Location of the Current Data
• Need to move the data
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.46
Cont....
• Business objectives :
• The enterprise should know the need of data warehouse
and their priorities such as DW size, location, frequency
of use and maintenance.
• A properly scoped and executed DW can prove
extremely cost effective in building a DW.
• Location of current data :
• It is extremely important to know where the data is and
what are its characteristics and attributes in order to
select the proper tools.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.47
Cont...
Need to move data :
The data movement can only be decided by considering a
combination of
• Quality of existing data
• Size of usable data
• Data design
• Performance impact of a direct query
• Performance impact on the current production systems
• Availability and ease of use of the tool
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.48
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.16
MCA 204, Data Warehousing & Data Mining
Different Configuration of Data
Different configuration
implementation are
of
data
to
satisfy
DW
• Real time data (operational data):
Operational data used by operational applications contains
all individual detailed data records where each update
overlays the previous entry.
entry
• Reconciled data :
Contains detailed records from the real time level which has
been cleaned, adjusted or enhanced so that data can be
used for informational applications.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.49
Different Configuration of Data
• Derived Data:
A summarized , averaged, from multiple sources of the real
time data or reconciled data for improved processing
capability.
• Changed Data :
It contains a record of all the changes to the selected real
time data.
• Meta Data :
The information that describes the model and definition of
the source data elements
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.50
Why Do Enterprise Really Need Data Warehouses?
•
Operational computer



•
Information to run day to day business
Event driven
Not directly suitable for review from different point
E
Executives
i

Different kind of information for Strategic decisions
 e.g. which product line to expand, which market should be
strength
 Trend over time
 Review
– Sales quantities by product, salesperson, region etc.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.51
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.17
MCA 204, Data Warehousing & Data Mining
Organizations’ Use of Data Warehousing
• Retail
 Customer loyalty
 Market planning
• Financial
 Risk management
 Fraud detection
• Airlines
 Route profitability
 Yield management
• Manufacturing
 Cost reduction
 Logistics management
• Utilities
 Asset management
 Resource management
• Government
 Manpower planning
 Cost control
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.52
Escalating Need for Strategic Information
• Failures of Past decision-support systems
• Operational versus decision-support systems
• Data warehousing – the only viable solution
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.53
Need for Strategic Information
• After 1990s,business grew more complex.
• Corporate spread globally
• More competition is there
 Operational systems did provide info.
info To run day-today to
day operations but managers, executives needed
different kinds of information that could be used to
make strategic decisions.
• DW is a new paradigm specifically intended to provide
vital strategic info.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.54
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.18
MCA 204, Data Warehousing & Data Mining
Need for Strategic Information
• Why do enterprises really need data ware?
• Escalating Need For Strategic Information.
The executives & managers who are responsible for
keeping the enterprise competitive need information to
make proper decisions.
They need info to formulate the business strategies,
establish goals ,set objectives & monitor results.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.55
Escalating Need for Strategic Information
• Who needs strategic information in an enterprise?
 Executives and managers
 To make proper decision
For keeping the enterprise competitive
To
formulate
and
execute
business
t t i
strategies
Establish goals,
Set objectives
Monitor results.
• What exactly do we mean by strategic information?
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.56
Some Business Objectives
• Retain the present customer base
• Increase the customer base by 15% over the next 5 years.
• Bring new product in 2 yrs
• Improve product quality levels in top 5 product group
• Gain market share by 10% in next 3 years
• Increase sale by 10% in East division
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.57
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.19
MCA 204, Data Warehousing & Data Mining
Cont...
• For making business objectives
information for the following purpose:-
managers
needs
 depth knowledge of company’s operations.
time
 Monitor how the business factor change over time.
 Compare company’s performance relative to competition
and industry bench marks.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.58
Strategic Information
• Executives and managers
 need to focus their attention on customers’ need and
preferences,
 emerging technologies,
 sales and marketing results,
 quality levels of product and services.
• This type of information needed to make decisions in
formulation and execution of business strategies and
objectives :
 All these essentials information in one group is called
Strategic Information
 Strategic information is not for running the day to day
operations of the business.
 It is important for the continued growth and survival of
corporation.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.59
Characteristics of Strategic Information
Integrated
• Must have a single, enterprise wide view
Data Integrity
• Information must be accurate and must conform to business
rule.
Accessible
• Easily accessible with intuitive access path and responsive for
analysis.
Credible
• Every business factor must have one and only one value.
Timely
• Information must be available with in the stipulated time frame.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.60
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.20
MCA 204, Data Warehousing & Data Mining
Escalating Need For Strategic Information
• Information Crisis
• Technology trends
• Opportunities and risks
• Failure of past decision support systems
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.61
Information Crisis
• In IT Dept. of big or small organization.
 various computer applications in company.
 data bases and the Quantities of data that support the
operation of company.
• How many year’s worth of customer data is saved and
available?
• How many years’ worth of financial data is kept in storage?
 10years or 15 years
• Where is all this data ?
 On one platform?
 In legacy systems?
 In Client/server applications?
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.62
Cont…
• Facts faced by organization
 Organizations have lots of data.
 IT systems are NOT effective at turning all the data into useful strategic
information.
• In organization we have lot of data, then why executives and
managers uses this data for making strategic decisions?
 Information Crisis
 Data available not accessible
Old technology/different platform
 For proper decision making on over all corporate strategies and
objectives
 Information integrated from all systems.
 Data needed for strategic decision making must be in a
format suitable for analyzing trends.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.63
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.21
MCA 204, Data Warehousing & Data Mining
Technology Trends
Computing Technology
Main Frame
Mini
PC | Networking
Client/Server
Human/Machine Interface
Punch Card
Video Display
GUI
VOICE
Processing
g Options
p
Batch
1950
Online
60
70
Networked
80
90
2000
Growth of Information Technology
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.64
Opportunities and Risks
• Examples of the opportunities made available to companies
through the use of strategic information:
• A community- based pharmacy competes on a national
scale with more than 800 franchised pharmacies coast to
coast gains
 in-depth understanding of what customers buy,
buy
reduced inventory levels,
 improved effectiveness of promotions and marketing
campaigns
 improved profitability for the company.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.65
Cont...
• Consider the cases where risks and threats of failures
existed before strategic information was made available
for analysis and decision making.
Example:
• For a world leading supplier of systems and components
to automobile and light truck equipment manufacturer
across nearly 100 plants, inability to benchmark quality
matrices and time consuming manual collection of data.
Reports needed to support decision making tool weeks.
Not easy for company to get company wide integrated
information
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.66
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.22
MCA 204, Data Warehousing & Data Mining
Failures of Past Decision Support System
• A marketing department is concern about performance of the west cost
region.
 The marketing Vice President wants to get some reports from the IT
department to analyze the performance over the past two years, Product by
Product, and compared to monthly targets.
 CEO wants to deliver as soon as possible to manager and manager
immediately go to the sub ordinate, to give marketing report.
 There is no report available
gather the data from multiple application (different platform) and start
from scratch
These reports lacks the actual agenda, which causes in consistencies
among the data obtained from different applications.
 It is also possible the person from IT dept.
create a report from single application for his/her convenience, so
such information may not be helpful in strategic decisions making.
 So, from the scenario we come to know that when information is
scattered in different places with forms, it is difficult to use the
available information in strategic Decisions.
U1.67
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Operational Vs Decision Support Systems
• The fundamental reason for the in ability to provide
strategic information is
 Trying to provide strategic
operational systems.
information
from
the
These operational systems such as order processing,
inventory control, claims processing, out patient billing ,
and so on are not designed or intended to provide strategic
information.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.68
Cont...
• Making
the
wheels
of
Business Turn
• Get data in
 Take an order
 Process a claim
 Make a shipment
 Generate an invoice
 Receive cash
 Reserve an air line seat
• Operational systems
 support
the
basic
business processes of the
company
 Day to day business
• Watching the wheels of Business
Turn
• Get information out
 Shows the top-selling products.
 Shows the problem region.
 Shows the highest margins
 Alert whenever a district sells
below target.
Decision Support Systems
(DSS)
 run
the
core
business
processes.
 No immediate payout
 DSS systems are developed to
get strategic Info out of the data
base where as OLTP systems
are designed to put the data into
database
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.69
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.23
MCA 204, Data Warehousing & Data Mining
Differences
Primitive data/Operational data
• Application oriented
• Detailed
• Accurate, as of the moment of
process
• Serves the clerical community
• Can be updated
p
y
• Run repetitively
• Compatible with SDLC
• Accessed a unit at a time
• Transaction driven
• Control of updates a major
• concern in terms of ownership
• Small amount of data used in a
process
• Supports day today operation
• High probability of access
Derived data/DSS data
• Subject oriented
• Summarized, otherwise refined
• Represents values overtime,
snapshots
• Severs the managerial community
• Is not updated
y
• Run heuristically
• Completely different life cycle
• Accesses a set at a time
• Analysis driven
• Control of updates no issues
• Managed by subsets
• Large amount of data used for
managerial support
• Supports managerial needs
• Low, modest probability of access
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.70
History of Decision Support Systems
# Ad-Hoc Reports• This was the earliest stage
• Users would send the request the IT dept. for special
reports.
• IT would write special program typically one for each
request, and produce the ad Hoc reports.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.71
History of Decision Support Systems
# Special Extract Programs• That stage was attempt by IT to anticipate the reports that
would be requested from time to time.
• IT would write a suit of programs and run the programs
periodically
i di ll to
t extract
t t the
th data
d t from
f
various
i
applications
li ti
• IT would create and keep the extract files to fulfill any
request for special reports.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.72
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.24
MCA 204, Data Warehousing & Data Mining
Cont...
# Small Applications
• In this Stage It formalized the extract process
• Create simple application based on extracted files.
• User could specify the parameters for each special report.
• The Report printing programs would prints the reports based
on user-specified parameters
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.73
Cont...
# Information Center
• In early 1970s,Major corporations
centers.
created Information
• Information center,
center User could go to request ad hoc reports
or view special reports on screen.
• These were predetermined reports or screens.
• IT personnel were there to help the users to obtain desired
information.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.74
Cont...
# Decision Support Systems
• In this stage, companies began to build more
sophisticated systems to provide strategic information.
y
were menu driven and p
provided on line
• Systems
information.
• Systems were supported by extracted files.
• User could specify the parameters for each special
report.
• Ability to print the reports.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.75
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.25
MCA 204, Data Warehousing & Data Mining
Cont...
# Executive Information Systems
• This was first attempt to bring the strategic information to the
executive desktop.
• Systems were designed to display key info. every day.
• Straight forward reports.
reports
• Only preprogrammed screens and reports were available.
• It was not possible to see analysis by region, by product, or
by any dimension unless such break downs were already
programmed.
• This limitations caused frustration and executives
information Systems did not last long in many companies.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.76
Failure Reasons
• What is basic reason for failure of all previous attempts by IT
to provide strategic information?
• The fundamental reason for the inability to provide strategic
information is that Operational systems were used to
provide strategic information.
• These information System Like order processing, inventory
control, claims processing etc. are not designed to provide
strategic information.
• Only special designed decision support systems can provide
strategic information.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.77
Typical OLAP Operations
Roll up (drill-up): summarize data
 by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
 from higher level summary to lower level summary or detailed data, or
introducing new dimensions
Slice and dice:
 project and select
Pivot (rotate):
 reorient the cube, visualization, 3D to series of 2D planes.
Other operations
 drill across: involving (across) more than one fact table
 drill through: through the bottom level of the cube to its back-end
relational tables (using SQL)
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.78
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.26
MCA 204, Data Warehousing & Data Mining
Decision Support Systems
• A decision support system (DSS) is a set of expandable,
interactive IT technique and tools designed for processing
and analyzing data and for supporting managers in decision
making.
Strategic information
Value
Reports
Selected
information
Primary data source
Quantity
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.79
Classification of Decision Support Systems
System
Description
Passive DSS
Support decision making process but it does not
offer explicit suggestion on decision or suggestion
Active DSS
Offer suggestions and solutions
Collaborative DSS
Operate interactively and allows decision makers to
modify, integrate or refine suggestions given by the
system
Model driven DSS
Enhance management of statistical
statistical, financial
financial,
optimization and simulation model
Communication drive DSS
Supports a group of people working on a common
task
Data driven DSS
Enhance the access and management of time series
of corporate and external data.
Document driven DSS
Manages and processes non structured data in
many formats
Knowledge driven DSS
Provides problem solving features in the form of
facts, rules and procedures
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.80
Data Ware housing- The only viable Solutions
• Need for different types of DSS to provide Strategic information.
 for analysis,
 discerning trends
 monitoring performance.
• Escalating Need for strategic information
 data ware housing is the only viable solution for providing Strategic
information
• Data warehousing is a collection of methods, techniques
and tools used to support knowledge workers- senior
managers, directors, mangers and analyst to conduct data
analyses and help in performing decision making process
and improving information resources
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.81
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.27
MCA 204, Data Warehousing & Data Mining
New System Environment
•
Desirable features and processing requirements of new
type of system environment.







Data Base designed for analytical tasks.
Data from multiple applications.
Easy to use and Conducive to long interactive
sessions by users.
users
Content updated periodically and stable
Content to include current and historical data
Ability for users to run queries and get results online.
Ability for users to initiative reports.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.82
Processing Requirements in the New Environment
• New environment for strategic information are analytical
• 4 levels of analytical processing requirements
• Running of Simple queries and report against current and
historical data.
• Ability to perform “What if “ Analysis in many different
ways.
• Ability to Query, step back, analyze, and then continue to
process to any desired length.
• Spot historical trends and apply them for future results.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.83
Business Intelligence at the Data Ware House
Extraction,
Cleansing,
aggregation
Operational
Systems
Basic
Business
Processes
Data
Transformation
Key Measurements,
Business dimensions.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.84
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.28
MCA 204, Data Warehousing & Data Mining
Definition
• Data warehouse is an information environment.
• Provides an integrated and total view of the enterprise
• Makes the enterprise current and historical information easily
available for decision making
• Make decision support transaction possible without hindering
operational system.
• Renders organization’s information consistent
• Present a
information
flexible
and
interactive
source
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
of
strategic
U1.85
Conclusion
• Operational system are not for strategic information
• Data warehouse is an computing environment not
product to provide strategic information
 Data analysis and decision support
 Flexible and interactive
 User driven
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.86
Let’s Discuss
1. How strategic information can increase the quality
and realize opportunities with readily available
strategic information
 Insurance Company
 Airlines Company
 Proposal to explain problems with reasons
 Why data warehouse is viable ?
2. A Senior Analyst (IT Dept.) of a company
manufacturing automobile parts.
 Marketing VP complains about poor IT
response in providing strategic information.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.87
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.29
MCA 204, Data Warehousing & Data Mining
Data Warehouse: Building Block
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania,
U1.‹#›
Data Warehouse: Building Block
•
•
•
•
Defining Features
Data warehouses and data marts
Overview of the components
Metadata in the data warehouse
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.89
Defining Features
• Key Defining Features of the Data ware house
based on these Definitions.
• What is the nature of the Data in the Data
Warehouse?
• How is this Data Different from the Data in any
operational System?
• Why does it have to be different?
• How is the Data content in the Data Ware house
used?
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.90
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.30
MCA 204, Data Warehousing & Data Mining
What is a Data Warehouse?
Defined in many different ways, but not rigorously.
 A decision support database that is maintained
separately from the organization’s operational database
 Support information processing by providing a solid
platform of consolidated, historical data for analysis.
“A data warehouse is a subject-oriented, integrated, timevariant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
Data warehousing:
 The process of constructing and using data warehouses
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.91
Data Warehouse—Subject-Oriented
• Organized around major subjects, such as customer,
product, sales.
• Focusing on the modeling and analysis of data for
d i i
decision
makers,
k
nott on daily
d il operations
ti
or transaction
t
ti
processing.
• Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.92
Data Warehouse—Subject-Oriented
• Operational Systems
• Subject-Oriented Data:
• Data stored by individual
applications.
• But in Data Ware house,
Data is stored by subjects.
• Data sets for an order
processing application,
application
• Business Subjects differ
from organization to
organization.
• These data sets provide the
Data for all the functions for
entering orders, Checking
stock, Verifying customer’s
credit, and assigning the
order for shipment.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.93
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.31
MCA 204, Data Warehousing & Data Mining
Data Warehouse—Integrated
• Constructed by integrating multiple, heterogeneous data
sources
 relational databases, flat files, on-line transaction records
• Data cleaning and data integration techniques are applied.
 Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
E.g., Hotel price: currency, tax, breakfast covered, etc.
 When data is moved to the warehouse, it is converted.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.94
Data Warehouse—Time Variant
The time horizon for the data warehouse is significantly longer
than that of operational systems.
 Operational database: current value data.
 Data warehouse data: provide information from a historical perspective
(e.g., past 5-10 years)
Everyy keyy structure in the data warehouse
 Contains an element of time, explicitly or implicitly
 But the key of operational data may or may not contain “time element”.
• The time-variant nature of the Data in a Data Warehouse.
 Allows for analysis of the past.
 Relates information to the present.
 Enables forecasts for the future.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.95
Data Warehouse—Non-Volatile
• A physically separate store of data transformed from the
operational environment.
• Operational update of data does not occur in the data
warehouse environment.
 Does not require transaction processing, recovery, and
concurrency control mechanisms
 Requires only two operations in data accessing:
initial loading of data and access of data.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.96
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.32
MCA 204, Data Warehousing & Data Mining
Data Warehouse—Non-Volatile
• Data from an operational
system is added, deleted as
each transaction happens
• No update, once the data is
captured in the data ware
house,
• Data updates are common
place and operational
Database.
• Do not run individual
transactions to change the
data there.
• Volatile data in the
Operational Databases
• Non
volatile
warehouse
in
data
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.97
Data Granularity
• Operational system
 Lowest level of detail
lot of Data
Daily details
• Data warehouse
 Data Granularity in a Data ware house refers to the
level of details.
 Data summarized at different levels.
Monthly/quarterly summary
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.98
Data Warehouse vs. Heterogeneous DBMS
Traditional heterogeneous DB integration
 Build wrappers/mediators on top of heterogeneous databases
 Query driven approach
 When a query is posed to a client site, a meta-dictionary is
used to translate the query into queries appropriate for
individual heterogeneous sites involved, and the results are
integrated into a global answer set
 Complex information filtering, compete for resources
Data warehouse: update-driven, high performance
 Information from heterogeneous sources is integrated in advance and
stored in warehouses for direct query and analysis
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.99
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.33
MCA 204, Data Warehousing & Data Mining
Data Warehouse vs. Operational DBMS
OLTP (on-line transaction processing)
 Major task of traditional relational DBMS
 Day-to-day operations: purchasing, inventory, banking, manufacturing,
payroll, registration, accounting, etc.
OLAP (on-line analytical processing)
 Major task of data warehouse system
 Data analysis and decision making
Distinct features (OLTP vs. OLAP):
 User and system orientation: customer vs. market
 Data contents: current, detailed vs. historical, consolidated
 Database design: ER + application vs. star + subject
 View: current, local vs. evolutionary, integrated
 Access patterns: update vs. read-only but complex queries
U1.100
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
OLTP vs. OLAP
OLTP
OLAP
users
clerk, IT professional
knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
titi
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
dh
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
usage
access
complex query
U1.101
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Why Separate Data Warehouse?
High performance for both systems
 DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
 Warehouse—tuned
for
OLAP:
complex
multidimensional view, consolidation.
OLAP
queries,
Different functions and different data:
 missing data: Decision support requires historical data which
operational DBs do not typically maintain
 data consolidation: Decision support requires consolidation
(aggregation, summarization) of data from heterogeneous sources
 data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.102
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.34
MCA 204, Data Warehousing & Data Mining
Data Ware Houses and Data Marts Cont...
Data Ware House
Data Mart
Enterprise-wide
Departmental
Union of all Data marts
A Single Business Process.
Data Received from Staging Area
Facts and Dimensions
Structure for corporate view of
Data
Technology optimal for data
access and analysis.
Organized on E-R model
Structure to Suit the departmental
View of data
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.103
Data Ware Houses and Data Marts Cont...
Data Warehouse
• Is a collection of data that supports decision making process
• It provides following features: subject oriented; integrated
and consistent, shows evolution over time and it is not
volatile
l til
Data marts
• Is subset of the data stored to a primary data warehouse.
• It includes set of information pieces relevant to a specific
business area corporate department or category of users.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.104
Data Warehousing and OLAP Technology for
Data Mining
What is a data warehouse?
A multi-dimensional data model
Data warehouse building blocks
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.105
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.35
MCA 204, Data Warehousing & Data Mining
Data Warehouse Components
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania,
U1.‹#›
Overview of Components
Information Delivery Component
Source Data Component
Mgt &
Mgt.
Control
Component
Data Staging
Component
Data Storage Component
& Meta data Component
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.107
Data Ware house Components Cont...
1. Source Data Component: grouped into four broad
categories
Production Data:
• This category of data comes from various operational
y
of the enterprise.
p
systems
Internal Data:
• In every organization, user keep their “private” spread
sheets, documents, customer profiles and some times
even departmental Databases.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.108
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.36
MCA 204, Data Warehousing & Data Mining
Cont...
Archived Data:
 In operational systems, periodically take the old data
and store it in archived files. The Data in these
archived files is referred to as Archived Data.
External Data:
g y, the data included the data from the
• In this Category,
external sources.
• For Example: Market share data of competitors.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.109
Cont...
2) Data Staging Component:
• Data extracted from various operational systems and
external source
• Prepare data for storing in the data ware house.
• The Extracted data from several disparate sources
needs to be
 Changed
 Converted
 Make data ready to be stored in format suitable for
querying and analysis.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.110
Cont...
• The 3 major functions need to be performed for getting
the data ready.
• Data Extraction / Extract the Data:
 For data ware house extract the data using appropriate
techniques from large amount of data received from
the operational system
• Data Transformation:
 involves many forms of combining pieces of data from
the different sources.
Merging, sorting in large scale in the staging area
• When data transformation functions ends (collection of
integrated data is cleaned, standardized and
summarized). The data is ready to be loaded data in data
warehouse.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.111
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.37
MCA 204, Data Warehousing & Data Mining
Cont...
• Data Loading: In this phase initial movement of moves
large volumes of data using up substantial amount of
time.
• As data warehouse function
 continuous extraction the changes to source data
Transform, revision, feed incremental data revision.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.112
Data Movement in Data Warehouse
Yearly refresh
Quarterly refresh
Data
Sources
Data Warehouse
Monthly refresh
Daily refresh
Base data load
•Time consuming
•Initial load moves large volume of data
•Business condition determine refresh cycle
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.113
Cont.
3)Data Storage Component:
• The data storage for the data warehouse is a separate
repository.
• The operational systems of enterprise support the dayto-day operations.
• The Data repositories of the operational systems
typically contain only the current data, while the data
repository for a data warehouse, need to keep large
volumes of historical data for analysis.
• So the data in the data warehouse need to be kept in the
structures suitable for analysis, and not for quick retrieval
of individual pieces of Information.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.114
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.38
MCA 204, Data Warehousing & Data Mining
Cont...
4) Informational Delivery Component:
• Who are the user who need information from data warehouse.
• To Provide information to the wide community of Data Warehouse
users.
• Novoice user
 No training
 Prefabricated reports and present queries
• Casual user
 Need information once in while
 Need prepackaged information
 Navigate through data warehouse, create customer report,
adhoc queries
• The information delivery component includes a variety of information
delivery. Such as, we may include several information delivery
mechanisms, we provide for online queries and reports.
U1.115
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Information Delivery Component
Data
Warehouse
Information
Delivery
e ve y
Component
Data
Marts
Online
Ad hoc reports
Intranet
Complex queries
•No voice
•Casual user
•MD Analysis
MD Analysis
Internet
Statistical Analysis
E-mail
Executive Info System
(EIS)
feed
•Business Analyst
•Senior Manager
•High Level Managers
Data Mining
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.116
Cont...
5) Meta Data Component:
• Meta Data in a Data ware house is similar to the Data
dictionary or the Data Catalog in a Data Base
Management System.
• In data dictionary
 information about the logical data Structures,
 information about the files and addresses,
 information about the indexes.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.117
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.39
MCA 204, Data Warehousing & Data Mining
Cont...
6) Management and Control Component:
• This component of the data ware house architecture sits on top of all
the other components.
• The management and control component co-ordinates the services and
activities with in the data warehouse.
• Moderates the information delivery to the users.
• Works with the database mgt. systems and enables data to be properly
stored in the repositories.
• Monitors the movement of the data into the staging area to the data
warehouse storage.
• Management and control component interact with metadata component
to perform the management and control functions
• Metadata : source of information for management module
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.118
Meta Data in the Data Warehouse
• Meta Data component serve as a directory of contents of
data warehouse.
• Meta data in a data warehouse fall in three major categories.
1)Operational Meta Data:
• Operation meta data gets its data from operational data
sources.
• These sources contains different data structures for storing
data from various operational system.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.119
Meta Data in the Data Warehouse Cont...
2) Extraction and Transformation Meta Data:
 Extraction and transformation metadata contains data about
the extraction of data from the source system like extraction
frequency, extraction methods for data extraction.
 This also contains the information about all the data
transformation that take place in the data staging area.
area
3) End-User Meta Data:
 The end-user meta data is the navigational map of the data
ware house.
 It enables the end-users to find information from the data
warehouse.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.120
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.40
MCA 204, Data Warehousing & Data Mining
Data Warehouse Architecture
• Architecture properties essential for data warehouse system
(Kelly, 1997).
• Separation
 Analytical and transaction processing should be kept apart
• Scalability
 Hardware and software architectures should be easily upgradeable
as the volume of data increases
• Extensibility
 Architecture should be able to host new applications and
technologies without redesigning the whole system
• Security
 Monitoring access is essential because of strategic data stored in
data warehouse
• Administrablility
 Data warehouse management should not be over difficult
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.121
Classification of Data Warehouse Architecture
Two different classification are commonly adopted for data
warehouse architecture
• Structure oriented
 Single layer architecture
 Two layer architecture
 Three layer architecture
• Depend on how different layers are employed to create
enterprise or department oriented views of data warehouse
 Independent data marts
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.122
Single Layer Architecture
Operational data
• Only one layer available
Source Layer
 Source layer
•Goal
 Reduce amount of data
 by removing redundancies
Middleware
Data Warehouse
Analysis
Reporting
tool
OLAP tools
• Not frequently used in practice
• In this data warehouse is virtual
 Means data warehouse is
implemented
as
a
multidimensional
view
of
operational data created by
specific middleware, or internal
processing layer (Devlin, 1997)
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.123
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.41
MCA 204, Data Warehousing & Data Mining
Single Layer Architecture
• Weakness of this architecture lies in its failure to meet the
requirement for separation of analytical and transactional
processing.
• Analytical queries are submitted to operational data after the
middleware interprets them . In this way queries affect regular
transactional workload.
workload
• Although this architecture can meet the requirement for integration
and correctness of data, it cannot log more data than source do.
• For these reasons, a virtual approach to data ware houses can be
successful only if analysis needs are particularly restricted and
data volume to analyze is huge.
U1.124
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Two Layer Architecture
Operational data
External data
Source Layer
ETL tools
Data Staging
Data Warehouse
Data marts
Meta data
Data Warehouse
Layer
Analysis
Reporting
tool
OLAP tools Data Mining
tools
What-if analysis
tools
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.125
Two Layer Architecture
Consist of four subsequent stages
• Source layer
 Use heterogeneous sources of data that is originally
stored to corporate relational data bases or legacy
(applications running on mainframes and mini computers
used for operational task but does not meet modern
architecture) database or may come from information
systems outside the corporate walls.
• Data staging
 Data stored should be extracted, cleansed to remove
inconsistencies and fill gaps and integrate to merge
heterogeneous sources into one common schema. ( ETL
tool)
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.126
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.42
MCA 204, Data Warehousing & Data Mining
Two Layer Architecture
• Data warehouse layer
 Information stored to one logically centralized repository : a
data warehouse.
 Data warehouse can be directly accessed and can be
used as a source to create data marts which partially
replicate data warehouse content and are designed for
specific enterprise department.
 Meta data store information on sources,
sources access procedure,
procedure
data staging, users, data marts etc.
• Analysis
 In this layer integrated data is efficiently and flexibly
accessed to issue reports, dynamically analyze the
information and hypothetical business scenarios.
 Technologically it features aggregated data navigators,
complex query optimizers, user friendly GUIs
U1.127
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Two Layer Architecture
Benefits of two layer architecture, in which data warehouse
separated from analysis applications
• In data warehouse system good quality information is always
available even when access to sources is denied for technical or
organizational reasons.
• Data warehouse analysis queries do not affect the management of
transactions, the reliability of which is vital for enterprises to work at
an operational level
• Data warehouse are logically structured according to the
multidimensional model while operational sources are generally
based on relational or semi structured model.
• Data warehouses can use specific solutions aimed at performance
optimization of analysis and report applications
U1.128
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Three Layer Architecture
Operational data
External data
Source Layer
Data Staging
ETL tools
Reconciled data
Reconciled layer
Meta data
ETL tools
Loading
Data Warehouse
Data marts
Reporting
tool
What-if analysis tools
OLAP tools Data Mining
tools
© Bharati Vidyapeeth’s Institute of Computer
Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Data Warehouse
Layer
Analysis
U1.129
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.43
MCA 204, Data Warehousing & Data Mining
Three Layer Architecture
• In this architecture, third layer is the reconciled layer or
operational data store.
• This layer materializes operational data obtained after
integrating and cleansing source data.
• Fi
Figure shows
h
th t data
that
d t warehouse
h
i nott populated
is
l t d from
f
it
its
sources directly but from reconciled data.
• Advantage of reconciled data
 Create common reference for a whole enterprise.
U1.130
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Additional Architecture Classification
• Independent data marts
 Different data marts are separately designed and build in
a non integrated fashion.
 This approach can be initially adopted when the
organizational division in company are loosely coupled.
 It tends to be soon replaced by other architectures that
better achieves data integration and cross reporting.
U1.131
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Independent Data Marts Architecture
Operational data
ETL tools
Operational data
ETL tools
Operational data
Data mart
Data mart
M t data
Meta
d t
Meta data
Reporting tools
OLAP tools
Data mining toolsWhat if analysis tools
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.132
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.44
MCA 204, Data Warehousing & Data Mining
Additional Architecture Classification
• Bus architecture
 Similar to independent data marts with a difference that a
basic set of conformed dimensions (that is, analysis
dimensions that preserve the same meaning throughout
all facts they belong to), derived by a careful analysis of
the main enterprise processes, is adopted and shared as
a common design guideline.
 It ensures logical integration of data marts and a
enterprise wide view of information
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.133
Additional Architecture Classification
• Hub and spoke
 Most used architecture in medium to large context,
there is much attention on scalability and extensibility,
and to achieve an enterprise-wide view of information.
 Atomic, normalized data is stored in a reconciled layer
that feeds a set of data marts containing summarized
data in multidimensional form.
form
 Users mainly access the data marts but they may
occasionally query the reconciled data
• Centralized architecture
 Particular implementation of hub and spoke
architecture, where reconciled layer and data marts are
collapsed into a single physical repository
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.134
Hub and Spoke Architecture
Operational data
External data
ETL tools
Reconciled data
Meta Data
Loading
Data marts
Reporting tools
OLAP tools
Data mining tools What if analysis tools
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.135
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.45
MCA 204, Data Warehousing & Data Mining
Additional Architecture Classification
• Federated architecture
 Sometime adopted in dynamic contexts where
preexisting data warehouses/data marts are to be
noninvasively integrated to provide a single, cross
organization decision support environment (for
instance, in case of mergers and acquisition).
 Each data warehouse/ data mart is either virtually or
physically integrated with other, leaning on a variety of
advanced techniques such as distributed querying,
ontologies and meta data interoperability
U1.136
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Federated Architecture
Operational data
Operational data
Operational data
ETL tools
ETL tools
ETL tools
Data marts
Data marts
Data marts
Logical physical integration
Reporting tools
OLAP tools
Data mining tools
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
What if analysis tools
U1.137
ETL
Operational and external data
Extraction
Validation
Cleansing
filtering
Transformation
Reconciled data
Loading
Data warehouse
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.138
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.46
MCA 204, Data Warehousing & Data Mining
ETL
• ETL consist of four separate four separate phases: extraction (or
capture), cleansing (pr cleaning or scrubbing), transformation and
loading.
• Extraction
 Relevant data is obtained from source in the extraction phase.
 Static extraction
 data warehouse needs populating for first time
 Incremental extraction
 update
p
data warehouse regularly,
g
y, seizes the change
g applied
pp
to source
data since last extraction
• Cleansing
 Main cleansing feature in ETL tools are rectification and homogenization
 Supposed to improve data quality
 Duplicate data
 Missing data
 inconsistent values that are logically associated
 impossible or wrong data
 Unexpected use of fields
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.139
Cont…
• Transformation
 Reconciliation phase change operation data into a specific data warehouse
format.
 conversion and normalization to make data uniform
 matching that associates equivalent field in different source
 selection that reduces the number of source fields and records
When populating a data warehouse , normalization is replaced by
denormalization because data warehouse are typically denormalized and
aggregation is required to sum up data properly
• Cleansing
 Main cleansing feature in ETL tools are rectification and homogenization
 Supposed to improve data quality
 Duplicate data
 Missing data
 inconsistent values that are logically associated
 impossible or wrong data
 Unexpected use of fields
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.140
ETL
• Loading
 Last step carried out in two ways
 Refresh
 Completely rewritten : older data replaced. Refresh is
normally used in combination with static extraction to
initially populate a data ware house.
 Update
 Only those changes applied to source data are added to the
data warehouse.
 Carried out without deleting or modifying preexisting data
 Used in combination with incremental extraction to update
data warehouse regularly.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.141
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.47
MCA 204, Data Warehousing & Data Mining
Example of Cleansing and Transforming Customer
Data
John White
Downing St. 10
TW1A 2AA London (UK)
Normalization
first name: John
Last name: White
Address: 10, Downing St.
Zipcode: TW1A 2AA
City: London
Country :United Kingdom
Correction
first name: John
Last name: White
Address: Downing St. 10
Zipcode: TW1A 2AA
City: London
Country: UK
Standardization
first name: John
Last name: White
Address: 10, Downing St.
Zipcode: SW1A 2AA
City: London
Country :United Kingdom
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.142
Conclusion
The Data ware house is an informational environment that
• Provides an integrated and total view of the enterprise.
• Makes the enterprise’s current and historical information
easilyy available for Decision Making.
g
• Makes Decision-Support transactions possible without
hindering Operational Systems.
• Renders the Organization’s consistent information.
• Presents a flexible and interactive source of strategic
information.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.143
Let’s Discuss
1. Data Analyst on project building a data warehouse
for an insurance company.

List all possible data sources from which data will be
brought too data warehouse (State assumptions).
2. For an airlines company,


Identify
Id
tif three
th
operational
ti
l applications
li ti
th t would
that
ld feed
f d
into the data ware
What would be the data load and refresh cycle
3. Identify potential users and information delivery
methods for a data warehouse supporting large
national grocery chain.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.144
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.48
MCA 204, Data Warehousing & Data Mining
Defining the Business
Requirements
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania,
U1.‹#›
Defining the Business Requirements
•
•
•
•
Dimensional analysis
Information packages
Requirements gathering methods
Requirements definition
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.146
Dimensional Analysis
• A data warehouse is an information delivery system.
• It is not about technology, but about solving users’
problems and providing strategic information to the
user.
 Requirement defining phase
What information users need, not how the information will be
provide
• B
Building
ildi a data
d t ware house
h
i different
is
diff
t from
f
b ildi an
building
operational system.
 Users cannot fully describe what they want in a data
warehouse but they provide with important insights into how
they think about business.
 Analysis required
Business dimensions
Measurement unit
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.147
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.49
MCA 204, Data Warehousing & Data Mining
Manager Think in Business Dimension (Number)
Marketing VP
• How much did the new product generate
• Month by month, in southern division, by user demographic, by sales
office, relative to previous version, plan
Marketing Manager
• Sales statistics
• By product, summarized by product categories, daily, weekly, monthly, by
sale districts, by distribution channel
Financial Controller
• Show expenses
• Listing actual vs budget, by months, quarters, annual, by budget line item,
by district, by division, , summarized for whole company
U1.148
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
From Tables and Spreadsheets to Data Cubes
• A data warehouse is based on a multidimensional data model which
views data in the form of a data cube
• A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
 Dimension tables,, such as item ((item_name,, brand,, type),
yp ), or time(day,
( y, week,,
month, quarter, year)
 Fact table contains measures (such as dollars_sold) and keys to each of the
related dimension tables
•
In data warehousing literature, an n-D base cube is called a base cuboid. The
top most 0-D cuboid, which holds the highest-level of summarization, is called
the apex cuboid. The lattice of cuboids forms a data cube.
U1.149
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Multidimensional Data
Juice
Cola
Milk
Cream
10
47
30
12
3/1 3/2 3/3 3/4
Sales
Volume
as a
function
of time,
city and
product
Date
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.150
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.50
MCA 204, Data Warehousing & Data Mining
Cube: A Lattice of Cuboids
all
time
time,item
0-D(apex) cuboid
item
time,location
location
item,location
time,supplier
supplier
1-D cuboids
location,supplier
2-D cuboids
item,supplier
time,location,supplier
time,item,location
3-D cuboids
time,item,supplier
item,location,supplier
4-D(base) cuboid
time, item, location, supplier
U1.151
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Dimensional Nature of Business Data
Delhi
Product
TV sets
Jan
Slice of product sale info
(units sold)
Time
• can be extended to multiple dimension
• Multidimensional cubes : Hypercube
U1.152
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Examples of Business Dimensions
Time
Customer
Time
Agent
Flight
Frequent
flights
Status
Fare class
Claims
Type
Airport
Airlines Company
Time
Status Policy
Insured
Party
Promotion
Insurance Business
Sales units
Product
Status
Store
Supermarket chain
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.153
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.51
MCA 204, Data Warehousing & Data Mining
OLAP for Decision Support
• Goal of OLAP is to support ad-hoc querying for the business
analyst
• Business analysts are familiar with spreadsheets
• Extend spreadsheet analysis model to work with warehouse
data
 Large data set
 Semantically enriched to understand business terms (e.g., time,
geography)
 Combined with reporting features
• Multidimensional view of data is the foundation of OLAP
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.154
OLAP for Decision Support
• Pivot table - a multidimensional spreadsheet
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.155
What is Dimension Modeling?
• Dimensional modeling gets its name from the business
dimensions we need to incorporate into the logical data model. It
is a logical design technique to structure the business dimensions
and the metrics that are analyzed along these dimensions.
• Using dimensional modeling, measurements and relevant
dimensions must be captured and kept in the data warehouse.
For this,, information p
package
g diagram
g
can be drawn for the
specific subject.
• It enables in packaging the data in a symmetric format which will
help in:




High Performance for queries and analysis.
Captures critical measures
Views along dimensions
Intuitive to business users
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.156
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.52
MCA 204, Data Warehousing & Data Mining
Dimensional Modeling
• In dimension modeling, there are two types of
tables: Dimension Table and Fact Table
• Facts are stored in FACT Tables
• Dimensions are stored in DIMENSION tables
• Dimension tables contains textual descriptors of
business
• Fact and dimension tables form a Star Schema
• “BIG” fact table in center surrounded by
“SMALL” dimension tables
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.157
Multidimensional Data Model
• Database is a set of facts (points) in a multidimensional
space
• A fact has a measure dimension
 quantity that is analyzed, e.g., sale, budget
• A set of dimensions on which data is analyzed
 e.g. , store, product, date associated with a sale amount
• Dimensions form a sparsely
p
y p
populated
p
coordinate
system
• Each dimension has a set of attributes
 e.g., owner city and county of store
• Attributes of a dimension may be related by partial
order
 Hierarchy: e.g., street > county >city
 Lattice: e.g., date> month>year, date>week>year
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.158
Fact Table
Fact Table
• The metrics or facts from the information package diagram will form the
fact table. They are facts for analysis.
• For example, for automaker sales, actual sale price is a fact about what
the actual price was for the sale. Similarly, the other facts are as follows:
 MSRP sale price
 Options price
 Full price
 Dealer add-ons
 Dealer credits
 Dealer invoice
 Amount of downpayment
 Manufacturer proceeds
 Amount financed
• All the facts can be grouped into a single data structure, called the fact
table. These contribute to forming the fact table for the automaker sales
fact table.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.159
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.53
MCA 204, Data Warehousing & Data Mining
Properties of Fact Table
Concatenated key
• A row in the fact table relates to a combination of rows from all the
dimension tables.
• Then a single row in the fact table must relate to a particular
product, a specific calendar date, a specific customer, and an
individual sales representative.
• This means the row in the fact table must be identified by the
primary keys of these four dimension tables. Thus, the primary
key of the fact table must be the concatenation of the primary
keys of all the dimension tables.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.160
Cont..
Data Grain:
• Data grain is the level of detail for the measurements or
metrics.
• In this example, the metrics are at the detailed level.
• The quantity ordered relates to the quantity of a particular
product on a single
p
g order,, on a certain date,, for a specific
p
customer, and procured by a specific sales representative. If
we keep the quantity ordered as the quantity of a specific
product for each month, then the data grain is different and
is at a higher level.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.161
Cont..
• Fully additive measures: Some attributes may be summed up by
simple addition, like order_dollars, quantity_sold. These
measures are known as fully additive measures.
• Semi additive measures: Some of the attributes are not fully
additive, but derived calculated metric of the attributes in fact
table. For example, margin percentage can be calculated using
order_dollars and extended_cost.
• Table Deep, not Wide: Fact table contains lesser attributes but
more number of table rows.
• Sparse Data: Fact table can have gaps as for some dimension
attributes, there would be no rows in the fact table. Hence, this
type of sparse data is not present in fact table.
• Degenerate Dimensions:
They also contain, sometimes
degenerate dimensions that are reference numbers likes order
numbers, average_per_order which are neither facts nor
dimensions.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.162
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.54
MCA 204, Data Warehousing & Data Mining
Dimension Table
• The product business dimension is used when analysis is to be done
of the facts by products.
• Sometimes analysis could be a breakdown by individual models.
Another analysis could be at a higher level by product lines.
• Yet another analysis could be at even a higher level by product
categories.
• The list of data items relating to the product dimension are as follows:
• Model name, Model year, Package styling,
• Product line, Product category
• Exterior color, Interior color
• First model year
• All of these are related to the product in some way.
• All of these data items can be grouped in one data structure or one
relational table. This table is called the product dimension table. The
data items in the above list would all be attributes in this table.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.163
Properties of Dimension Table:
• Dimension table key: Primary key of the dimension table uniquely
identifies each row in the table.
• Large number of attributes (wide): Typically, a dimension table has
many columns or attributes. Thus, the dimension table is wide.
• Textual attributes: In the dimension table you will seldom find any
numerical values
used for calculations. The attributes in a
dimension table are of textual format.
• Attributes not directly related: some of the attributes in a dimension
table are not directly related to the other attributes in the table.
• Flattened out, not normalized: The attributes in a dimension table
are used over and over again in queries. For efficient query
performance, it is best if the query picks up an attribute from the
dimension table and goes directly to the fact table and not through
other intermediary tables. Therefore, a dimension table is flattened
out, not normalized.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.164
Cont..
• Ability to drill down / roll up: The attributes in a dimension
table provide the ability to get to the details from higher
levels of aggregation to lower levels of details.
• Multiple hierarchies: dimension tables often provide for
multiple hierarchies, so that drilling down may be performed
along any of the multiple hierarchies.
• Less number of records: A dimension table typically has
fewer number of records or rows than the fact table.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.165
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.55
MCA 204, Data Warehousing & Data Mining
Sample Data Cube
Diploma
1st
2nd
3rd
Counttry
M.Sc.
B.Sc.
German students
in the 4th term
pursuing
a diploma
4th
∑
GermGerm
anyy anyy
S it Switzerland
Switzerland
S lit d l d
∑
Coun
ntry
Term
U.S.A.
U.S.A.
∑
∑
∑
∑
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.166
Operations in Multidimensional Data Model
• Aggregation (roll-up)
 dimension reduction: e.g., total sales by city
 summarization over aggregate hierarchy: e.g., total sales by
city and year -> total sales by region and by year
• Navigation to detailed data (drill-down)
 e.g.,
g , ((sales - expense)
p
) by
y city,
y, top
p 3% of cities by
y average
g
income
• Selection (slice) defines a subcube
 e.g., sales where city = Palo Alto and date = 1/15/96
• Visualization Operations (e.g., Pivot)
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.167
Information Packages-A New Concept
• Information Packages: A methodology for determining
requirement for a data warehouse based on business
dimensions
 for analysis on business dimension.
 It incorporates basic measurements and business
dimensions
• Information package enables to









Define
D
fi the
h common subject
bj
areas.
Design key business metrics.
Decide how data must be presented
Determine how users will aggregate or roll up.
Decide the data quantify for user analysis or query.
Decide how data will be accessed.
Establish data granularity
Estimate data ware house size
Determine the frequency for data refreshing
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.168
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.56
MCA 204, Data Warehousing & Data Mining
Information Subject : Sales Analysis
Dimensions
Locations
Products
Age
Groups
Year
Country
Class
Group 1
Hierarchies
Time
Period
Measured Facts : Forecast Sales, Budget Sales, Actual Sales
An Information Packages
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.169
Cont...
• Business dimensions basis of IP
• Hierarchical levels for further processing
 Drilling down and rolling up for analysis
• Categories
g
:
 Data elements within business dimensions
 e.g. sales on holiday
• Key business metrics or facts
 number
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.170
Business Dimension for Auto Sales Analysis
• Hierarchies and categories for each dimension
• Product : Model name, Model year, package styling,
product line, product category, exterior color, interior color,
first model year
• Dealer : Dealer name, city, state, single brand flag, date first
operation
• Customer demographics: Age, gender, income, marital
status, house hold size, vehicle owned, home value, own or
rent
• Payment method: Financial type, term in months, interest
rate, agent
• Time: Date, month, quarter, year, day of week, day of
month, season, holiday flag w
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.171
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.57
MCA 204, Data Warehousing & Data Mining
Cont...
• Metrics for analyzing automobile
 Actual sale price
 Option price
 Full price
 Dealer add-ons
 Dealer credits
 Dealer invoice
 Amount of down
 Amount financed
U1.172
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Information Subject : Automaker Sales
Hierarchiess
Dimensions
Time
Product
Payment
Method
Year
Model Name
Financial type Age
Quarter
Model Year
Month
Package
Customer
Demo
Graphics
Gender
Dealer
Dealer
Name
City
State
Date
Single
Brand flag
Week
Month
Season
Holiday Flag
Measured Facts : Actual sale price, Option price, Full price, Dealer add-ons, etc
An Information Packages
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.173
Classification of Users of Data Warehouse
• Senior executive ( including sponsors)
 Have sense of direction, Involved in focused area
• Key departmental manager
 Report to executive in the area of focus
• Business analysts
 Prepare reports and analyses for executive and manager
• Operational system DBA
 Only gives info
• Other nominated by above
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.174
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.58
MCA 204, Data Warehousing & Data Mining
What Requirements to Gather?
Broad list:
• Data elements: fact classes, dimensions
• Recording of data in terms of time
• Data extracts from source systems
• Business rules: attributes,
operational records
ranges,
domains,
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.175
Interviews
• Interviewing is an important method for collecting data on human
and system information requirements.
• Kimball et al. (1998) stated that two basic procedure can be used
to conduct user requirement analysis : interviews and facilitated
sessions.
• Interviews are conducted with single or small, homogeneous
groups
groups.
• Everyone can participate results in very detailed list of
specifications
• Facilitated sessions involve large heterogeneous groups
• Encourage creative brain storming
• Session aim at setting general priorities typically follow
interviews following on detail specification
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.176
Requirements Gathering Methods
• Interviews
 one to one sessions
 Group Sessions
Not good initial state
Useful for confirming requirements
• JAD (Joint
(J i t Application
A li ti Development)
D
l
t) sessions
i
 Joint approach
 concerned group for a well defined purpose
• Review the existing documents
 Documentation from user department
 Documentation from IT
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.177
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.59
MCA 204, Data Warehousing & Data Mining
Interview Process Task Before Project Launches
• Select and train team member conducting interview
• Assign roles for team member
• Prepare questionnaire




Current information sources
Subject areas
Key performance matrices
Information frequency
• Pre interview research






History and current structure of business unit
No. of employee and roles and responsibilities
Location of user
Primary purpose of business unit
Company market
Competitor in market
• List of user to be interviewed
• List expectations
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.178
Initial Document for Requirement Definition
•
•
•
•
•
•
•
•
•
Interview write ups
User profile
Background and objective
Information requirement
Analysis requirement
Current tools used
Success criteria
Useful business metrics
Relevant business dimensions
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.179
Types of Interview Question
• Open ended
• What do you think of data quality?, What are the key
objectives your unit has to face?
• Closed
• Are you interested in sorting out purchase in storing out
purchase by hour? Do you want to receive a sales report
every week?
• Evidential
• Could you please give me an example of how you
calculate your business unit budget?, Could you please
describe the issues with poor data quality that your
business unit is experiencing?
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.180
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.60
MCA 204, Data Warehousing & Data Mining
Expectations From Interviews
•Senior executive
• Dep. Managers /Analyst
 Organization executive
 Criteria for measuring success
 Key business issues, current
and future
 Problem identification
 Vision and direction of
organization
 Anticipated usage of DW






Departmental objective
Success metrics
Factor limiting success
Key business issues
Product and services
Useful business dimensions for
l i
analysis
 Anticipated usage of DW
•IT Dept. Professional






Key operational source system
Current information deliver process
Type routing analysis
Known quality issue
Current IT support for information requests
Concerns about proposed DW
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.181
Information Gathering:
g Interactive
Methods
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania,
U1.‹#›
Objectives
• Recognize the value of interactive methods for information
gathering.
• Construct interview questions to elicit human information
requirements.
• Structure interviews in a way that is meaningful to users.
p of JAD and when to use it.
• Understand the concept
• Write effective questions to survey users about their work.
• Design and administer effective questionnaires.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.183
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.61
MCA 204, Data Warehousing & Data Mining
Major Topics
• Interviewing
 Interview preparation
 Question types
 Arranging questions
 The interview report
• Joint Application Design (JAD)
 Involvement
 Location
• Questionnaires
 Writing questions
 Using scales
 Design
 Administering
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.184
Interviewing
• Interviewing is an important method for collecting
data on human and system information
requirements.
• Interviews reveal information about:
Interviewee
Interviewee opinions
Interviewee feelings
Goals
Key HCI concerns
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.185
Interview Preparation
• Reading background material.
• Establishing interview objectives.
• Deciding whom to interview.
• Preparing the interviewee.
• Deciding on question types and structure.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.186
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.62
MCA 204, Data Warehousing & Data Mining
Question Types
• Open-ended
• Closed
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.187
Open-Ended Questions
• Open-ended interview questions allow interviewees
to respond how they wish, and to what length they
wish.
• Open-ended interview questions are appropriate
when the analyst is interested in breadth and depth
of reply.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.188
Advantages of Open-Ended Questions
• Puts the interviewee at ease.
• Allows the interviewer to pick up on the
interviewee’s vocabulary.
• Provides richness of detail.
• Reveals avenues of further questioning that may
have gone untapped.
untapped
• Provides more interest for the interviewee.
• Allows more spontaneity.
• Makes phrasing easier for the interviewer.
• Useful if the interviewer is unprepared.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.189
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.63
MCA 204, Data Warehousing & Data Mining
Disadvantages of Open-Ended Questions
• May result in too much irrelevant detail
• Possibly losing control of the interview.
• May take too much time for the amount of useful
i f
information
ti gained.
i d
• Potentially seeming
unprepared.
that
the
interviewer
is
• Possibly giving the impression that the interviewer
is on a “fishing expedition”.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.190
Closed Interview Questions
• Closed interview questions limit the number of
possible responses.
• Closed interview questions are appropriate for
generating precise, reliable data that is easy to
analyze.
• The methodology is efficient, and it requires little
skill for interviewers to administer.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.191
Benefits of Closed Interview Questions
• Saving interview time.
• Easily comparing interviews.
• Getting to the point.
• Keeping control of the interview.
• Covering a large area quickly.
• Getting to relevant data.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.192
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.64
MCA 204, Data Warehousing & Data Mining
Disadvantages of Closed Interview Questions
• Boring for the interviewee.
• Failure to obtain rich details.
• Missing main ideas.
• Failing to build rapport between interviewer and
interviewee.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.193
Cont...
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.194
Bipolar Questions
• Bipolar questions are those that may be answered
with a “yes” or “no” or “agree” or “disagree.”
• Bipolar questions should be used sparingly.
• A special kind of closed question.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.195
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.65
MCA 204, Data Warehousing & Data Mining
Probes
• Probing questions elicit more detail about previous
questions.
• The purpose of probing questions is:
 To get more meaning.
 To clarify
clarify.
 To draw out and expand on the interviewee’s point.
• May be either open-ended or closed.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.196
Arranging Questions
• Pyramid
 Starting with closed questions and working toward openended questions.
• Funnel
open-ended
ended questions and working toward
 Starting with open
closed questions.
• Diamond
 Starting with closed, moving toward open-ended, and
ending with closed questions.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.197
Pyramid Structure
• Begins with very detailed, often closed questions.
• Expands by allowing open-ended questions and
more generalized responses.
• Is useful if interviewees need to be warmed up to
the topic or seem reluctant to address the topic.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.198
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.66
MCA 204, Data Warehousing & Data Mining
Pyramid Structure
Pyramid Structure for Interviewing Goes from Specific to General
Questions
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.199
Funnel Structure
• Begins with generalized, open-ended questions.
• Concludes by narrowing the possible responses
using closed questions.
• P
Provides
id an easy, non threatening
th t i way to
t begin
b i an
interview.
• Is useful when the interviewee feels emotionally
about the topic.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.200
Funnel Structure
Funnel Structure for Interviewing Begins with Broad Questions then
Funnels to Specific Questions
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.201
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.67
MCA 204, Data Warehousing & Data Mining
Diamond Structure
• A diamond-shaped structure begins in a very
specific way.
• Then more general issues are examined
• Concludes
C
l d with
ith specific
ifi questions
ti
• Combines the strength of both the pyramid and
funnel structures
• Takes longer than the other structures
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.202
Diamond-Shaped Structure
Diamond-Shaped Structure for Interviewing Combines the
Pyramid and Funnel Structures
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.203
Closing the Interview
• Always ask “Is there anything else that you would
like to add?”
• Summarize and
impressions.
provide
feedback
on
your
• Ask whom you should talk with next.
• Set up any future appointments.
• Thank them for their time and shake hands.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.204
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.68
MCA 204, Data Warehousing & Data Mining
Interview Report
• Write as soon as possible after the interview.
• Provide an initial summary, then more detail.
• Review the report with the respondent.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.205
Joint Application Design (JAD)
• Joint Application Design (JAD) can replace a series
of interviews with the user community.
• JAD is a technique that allows the analyst to
accomplish requirements analysis and design the
user interface with the users in a group setting.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.206
Conditions that Support the Use of JAD
• Users are restless and want something new.
• The organizational culture supports joint problemsolving behaviors.
• A
Analysts
l t forecast
f
t an increase
i
i the
in
th number
b
off
ideas using JAD.
• Personnel may be absent from their jobs for the
length of time required.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.207
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.69
MCA 204, Data Warehousing & Data Mining
JAD Five Phased Approach
•
•
•
Project definition
 Complete high level interviews
 Conduct management interviews
 Prepare management definition guide
Research
 Become familiar with the business are
and systems
 Document user information
requirements
 Document
D
b
business
i
process
 Gather preliminary information
 Prepare agenda for the session
Preparation
 Create working documents from
previous phase
 Train the scribes
 Prepare visual aids
 Conduct pre session meetings
 Set up a venue for session
 Prepare checklist for objective
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.208
Cont...
• JAD sessions







Open with review of agenda and purpose
Review assumptions
Review data requirement
Review business metrics and dimensions
Discuss dimensions hierarchies and roll ups
Resolve open issues
Close sessions with the list of action items
• Final document








Convert the working document
Map the gathered information
List all data sources
Identify all business dimensions and hierarchies
Assemble and edit the document
Conduct review sessions
Get final approvals
Establish procedure to change requirements
• Success of project using JAD depend on JAD team
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.209
JAD Involves
• All project team members must be committed to the JAD approach and
become involved.
• Executive sponsor – a senior person who will introduce and conclude the
JAD session.
• Analyst – gives an expert opinion about any disproportionate costs of
solutions proposed
• Users – try to select users that can articulate what information they need
to perform
f
their
h i jobs
j b as well
ll as what
h they
h desire
d i in
i anew or improved
i
d
computer system.
• Session leader – someone who has excellent communication skills to
facilitate appropriate interactions.
• Observers – analysts or technical experts from other functional areas to
offer technical explanations and advice.
• Scribe – formally write down everything that is done.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.210
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.70
MCA 204, Data Warehousing & Data Mining
JAD Team
• Executive sponsor
 Person controlling the funding, providing direction, empowering team
member
• Facilitator
 Person guiding the team through JAD process
• Scribe
 Person designated to record all decision
• Full time participants
 Involved in decision making for data warehouse
• On call participants
 Person affected by project but only in specific area
• Observers
 Person for specific session without participating in decision
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.211
Where to Hold JAD Meetings
• Offsite
 Comfortable surroundings
 Minimize distractions
• Attendance
 Schedule when participants can attend
 Agenda
 Orientation meeting
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.212
Benefits of JAD
• Time is saved, compared with traditional
interviewing
• Rapid development of systems
• Improved user ownership of the system
• Creative idea production is improved
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.213
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.71
MCA 204, Data Warehousing & Data Mining
Drawbacks of Using JAD
• JAD requires a large block of time to be available
for all session participants.
• If preparation or the follow-up report is incomplete,
the session may not be successful.
• The organizational skills and culture may not be
conducive to a JAD session.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.214
Requirements Definition
Scope And Content:
• Formal documentation is often neglected
• requirements definition Phase.
 conduct interviews and GD .
 review
i
th existing
the
i ti documentation
d
t ti
• requirements definition document is the basis for
the next phases in the system development life
cycle.
 But often skip the detailed documentation of the
requirements definition.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.215
Questionnaires
Questionnaires are useful in gathering information
from key organization members about:




Attitudes
Beliefs
Behaviors
Characteristics
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.216
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.72
MCA 204, Data Warehousing & Data Mining
When to Use Questionnaires
• People to be questioned are widely dispersed.
• Many people are involved with the project, and need to
know the approval level of a proposed system.
opinion
• Exploratory work is needed to gauge opinion.
• Need to identify and address problems with the current
system.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.217
Question Types
Questions are designed as either:
 Open-ended
Try to anticipate the response you will get.
Well suited for getting opinions.
 Closed
Use when all the options may be listed.
When the options are mutually exclusive.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.218
Tradeoffs between the Use of Open-Ended and Closed
Questions on Questionnaires
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.219
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.73
MCA 204, Data Warehousing & Data Mining
Questionnaire Language
• Simple
• Specific
• Short
patronizing
g
• Not p
• Free of bias
• Addressed to those who are knowledgeable
• Technically accurate
• Appropriate for the reading level of the respondent
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.220
Measurement Scales
• The two different forms of measurement scales are:
 Nominal
 Interval
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.221
Nominal Scales
• Nominal scales are used to classify things.
• It is the weakest form of measurement.
• Used to get totals for each category.
What type of software do you use the most?
1 = Word Processor
2 = Spreadsheet
3 = Database
4 = An Email Program
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.222
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.74
MCA 204, Data Warehousing & Data Mining
Interval Scales
• An interval scale is used when the intervals are equal.
• There is no absolute zero.
How useful is the support given by the Technical Support Group?
NOT USEFUL
EXTREMELY
AT ALL
USEFUL
1
2
3
4
5
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.223
Validity and Reliability
• Validity is the degree to which the question
measures what the analyst intends to measure.
• Reliability of scales refers to consistency in
response, or the likelihood of getting the same
results
lt
if the
th
same questionnaire
ti
i
was
administered again under the same conditions.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.224
Problems with Scales
• Leniency
 Caused by easy raters
 Solution: move the “average” category to the left or right of center
• Central tendency
 Central tendency occurs when respondents rate everything as
average.
 Improve
I
by
b making
ki th
the diff
differences smaller
ll att th
the ttwo ends.
d
 Adjust the strength of the descriptors.
 Create a scale with more points.
• Halo effect
 When the impression about an item in one question carries into the
next question.
 Solution: change the focus from items to traits, by placing one
trait and several items on each page.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.225
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.75
MCA 204, Data Warehousing & Data Mining
Designing the Questionnaire
• Allow ample white space.
• Allow ample space to write or type in responses.
• Make it easy for respondents to clearly mark their
answers.
• Be consistent in style.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.226
Order of Questions
• Place most important questions first.
• Cluster items of similar content together.
• Introduce less controversial questions first.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.227
Different Ways to Capture Responses
When Designing a Web Survey, Keep in Mind that There Are Different
Ways to Capture Responses
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.228
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.76
MCA 204, Data Warehousing & Data Mining
Methods of Administering the Questionnaire
• Convening all concerned respondents together at
one time
• Personally administering the questionnaire
• All
Allowing
i
respondents
d t
questionnaire
t
to
self-administer
lf d i i t
th
the
• Mailing questionnaires
• Administering over the Web or via email
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.229
Electronically Submitting Questionnaires
• Reduced costs.
• Collecting and storing the results electronically.
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.230
Summary
• Interviewing
 Interview preparation
 Question types
 Arranging questions
 The interview report
• Joint Application Design (JAD)
 Involvement and location
• Questionnaires
 Writing questions
 Using scales and overcoming problems
 Design and order
 Administering and submitting
KendallInstitute
& Kendall
Copyright
© Management,
2011 PearsonNew
Education,
Prentice Hall
© Bharati Vidyapeeth’s
of Computer Applications
and
Delhi-63,Inc.
byPublishing
Dr. Deepalias
Kamthania
U1.231
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.77
MCA 204, Data Warehousing & Data Mining
Data Sources
•
The requirement definition document
include the following information:

Available Data sources

Data Structures with in the data sources

Location of the Data Sources

Data extraction procedures

Availability of historical data.
should
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.232
Cont...
• Data Transformation
 Data Transformation necessarily involve mapping of
source data to the data in the data ware house.
• Data Storage:
 requirement definition document must include
sufficient details about storage requirement.
• Information Delivery:
 Drill-Down Analysis.
 Roll-Up Analysis
 Slicing
 Ad hoc reports
• Information Package Diagram
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.233
Cont…
• Information Package Diagram
 The information packages diagrams crystallize the information
requirements for the data warehouse.
 It contains the critical matrices measuring the performance of the
b i
business
units,
it the
th business
b i
di
dimensions
i
along
l
which
hi h the
th metrics
t i are
analyzed, and the details how drill-down & roll-up analyses are done.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.234
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.78
MCA 204, Data Warehousing & Data Mining
Requirements Definition Document Outline
1. Introduction (Purpose and Scope of the Project)
2. General Requirements description (Source system review e.g.
interview Summary). State what type information are required in
data warehouse.
3. Specific Requirements ( data transformation and Storage
requirements)
4. Information Package (form of IP dig)
5. Other Requirements ( data extract frequency, Includes Data
Loading Methods, location for info delivery etc.)
6. User Expectations (How the users expect to use the data ware
House)
7. User Participation (List of tasks in which users expected to
participate through out the development life cycle)
8. General Implementation Plan: (give a high level plan for
implementation).
U1.235
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Let’s Discuss
1.
2.
3
3.
4.
VP of marketing for nation wide appliance manufacturer with three
production plants. Describe three ways to analyze sales. What are
business dimension for analysis.
BigBook Inc is a large book distributor with domestic and international
distributors to all leading bookseller. Initially build data ware house to
analyze shipments that are ,made from the company many data
warehouse. Determine, metrics, and business dimensions. Prepare an
information package diagram.
F
For
a data
d t warehouse
h
on AuctionsPlus.com,
A ti
Pl
an Internet
I t
t auction
ti
upscale for works of art gather requirement for sales analysis. Find out
key metrics, business dimensions, hierarchies and categories. Draw
the information package diagram.
Create a detailed outline formal requirements definition document for a
data warehouse to analyze profitability of large departmental store
chain
U1.236
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Business Requirements as the Driving Force
Business Requirements
Planning &
Management
Maintenance
Design
Architecture
Infrastructure
Construction
Architecture
Infrastructure
Data Acquisition
Data Storage
Information Delivery
Data Acquisition
Data Storage
Information Delivery
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Deployment
U1.237
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.79
MCA 204, Data Warehousing & Data Mining
Data Design
• In design phase data models are required for
 Staging area
Transform, cleanse and integrate data from source system
 Data
D t warehouse
h
repository
it
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.238
Requirements Driving the Data Model
Information
Package
Diagram
Data Marts
(Conformed/Dependent)
Dimensional
Model
Enterprise
Data Model
Relational
Model
Enterprise data warehouse
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.239
Composition of the Components
• Source data




Operational source systems
Computing platforms, O/S, database files
Departmental data such as files, documents & spreadsheets
External data sources
• Data staging




Data mapping between data sources and staging area data structure
Data transformation
Data cleansing
Data integration
• Data Storage




Size of extracted and integrated data
DBMS features
Growth potential
Centralized or distributed
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.240
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.80
MCA 204, Data Warehousing & Data Mining
Cont…
• Information delivery




Types and number of users
Types of queries and reports
Classes of analysis
Front end DSS applications
• Metadata Operational




Operational meta data
ETL (data extraction/transformation/loading) metadata
End user meta data
Metadata storage
• Management & control




Data loading
External sources
Alert systems
End user information delivery
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.241
Impact of Requirement on Architecture
Business
Managing & Control
Source Data
Metadata
Information Delivery
Data Staging
Data Storage
Requirements
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.242
Data Quality
Bad data leads to based decisions
• Data Pollution Sources
• System conversions &
Migrations
• Heterogeneous system
integration
• Inadequate database design of
source systems
• Data aging
• Incomplete information from
customers
• Input errors
• Internationalization/localization of
systems
• Lack of data management
policies/procedures
• Type of data quality problems
• Dummy values in source system
fields
• Absence of data in source system
fields
• Multipurpose fields
yp data
• Cryptic
• Contradicting data
• Improper use of name
• Violation of rules
• Reused primary key
• Non-unique identifiers
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.243
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.81
MCA 204, Data Warehousing & Data Mining
Impact of Requirement on Metadata
Business R
Requirements
Operational
Source system data structure,
External data formats
Data
Warehouse
metadata
Extraction/Transformation
D t cleansing,
Data
l
i conversion,
i
integration
End-user
Querying, reporting, analysis,
OLAP, special apps
U1.244
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Data Storage Specifications
• DBMS should be compatible with back and front end
• Business elements that effect the choice of DBMS







Level of experience
Type of queries
Need for openness
Data loads
Metadata management
Data repository location
Data warehouse growth
• Size estimation




Data staging area
Overall corporate data warehouse
Data marts, dependent or conformed
Multi dimensional database
U1.245
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Requirement d
definition on
Users, location, queriees, reports, analysis
Business Reequirements
Impact of Business Requirement on Information Delivery
Ad hoc reports
•No voice
•Casual user
Online
Complex queries
•MD Analysis
Intranet
Information
Delivery
Component
MD Analysis
Internet
Statistical Analysis
E-mail
Executive Info System
(EIS)
feed
•Business Analyst
•Senior Manager
•High Level Managers
Data Mining
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.246
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.82
MCA 204, Data Warehousing & Data Mining
Conclusion
• Gathering requirement for data warehouse is not same as
for an operational system.
• Requirement definition guides the whole process of system
design and development.
• D
Data
t warehouse
h
environment
i
t is
i an information
i f
ti
d li
delivery
system where user themselves access the data repository
and create their own output whereas in operational system
user is provided with predefined outputs.
• It is essential to have right elements of information in the
most optimal format.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.247
Review Questions
Objective Questions:
1) A data warehouse is which of the following?
a) Can be updated by end users.
b) Contains numerous naming conventions and formats.
c) Organized around important subject areas.
d) Contains only current data.
2)An operational system is which of the following?
a) A system that is used to run the business in real time and is based on
historical data.
b) A system that is used to run the business in real time and is based on
current data.
c) A system that is used to support decision making and is based on current
data.
d) A system that is used to support decision making and is based on
historical data.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.248
Review Questions Cont...
3)The generic two-level data warehouse architecture
includes which of the following?
a) At least one data mart
b) Data that can extracted from numerous internal and
external sources
c) Near real-time updates
d) All of the above.
4)The active data warehouse architecture includes which
of the following?
a) At least one data mart
b) Data that can extracted from numerous internal and external
sources
c) Near real-time updates
d) All of the above.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.249
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.83
MCA 204, Data Warehousing & Data Mining
Review Questions Cont...
5)Reconciled data is which of the following?
a) Data stored in the various operational systems throughout the
organization.
b) Current data intended to be the single source for all decision support
systems.
c) Data stored in one operational system in the organization.
d) Data that has been selected and formatted for end-user support
applications.
6)Transient data is which of the following?
a) Data in which changes to existing records cause the previous version of
the records to be eliminated
b) Data in which changes to existing records do not cause the previous
version of the records to be eliminated
c) Data that are never altered or deleted once they have been added
d) Data that are never deleted once they have been added
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.250
Review Questions Cont...
7)The extract process is which of the following?
a) Capturing all of the data contained in various operational systems
b) Capturing a subset of the data contained in various operational systems
c) Capturing all of the data contained in various decision support systems
d) Capturing a subset of the data contained in various decision support
systems
8)Data
8)D
t scrubbing
bbi is
i which
hi h off the
th following?
f ll i ?
a) A process to reject data from the data warehouse and to create the
necessary indexes
b) A process to load the data in the data warehouse and to create the
necessary indexes
c) A process to upgrade the quality of data after it is moved into a data
warehouse
d) A process to upgrade the quality of data before it is moved into a data
warehouse
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.251
Review Questions Cont...
9)The load and index is which of the following?
a) A process to reject data from the data warehouse and to create the
necessary indexes
b) A process to load the data in the data warehouse and to create the
necessary indexes
c) A process to upgrade the quality of data after it is moved into a data
warehouse
d) A process to upgrade the quality of data before it is moved into a data
warehouse
10)Data transformation includes which of the following?
a) A process to change data from a detailed level to a summary level
b) A process to change data from a summary level to a detailed level
c) Joining data from one source into various sources of data
d) Separating data from one source into various sources of data
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.252
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.84
MCA 204, Data Warehousing & Data Mining
Review Questions Cont...
Short answer type Questions
Q1. Explain the need of metadata in a data warehouse?
Q2. What do you mean by Strategic Information?
Q3. Differentiate between Data Warehouse and Data Mart?
Q4. What do you mean by a Web-enabled data warehouse?
Q5 Define OLTP?
Q5.
Q6. What type of Processing take Place in a data warehouse?
Q7. Define ETL routine?
Q8. What data does an information package contain?
Q9. In which situations can JAD methodology be successful for
collecting requirements?
Q10. List various data sources that feed the data warehouse?
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.253
Review Questions Cont...
Long answer type Questions
Q1. Explain Data warehouse Architecture in detail?
Q2. Explain business Dimensions. Why and how can business
dimensions be useful for defining requirements for the data
warehouse?
growth
Q3. State anyy three factors that indicate the continued g
in data warehousing. Can you think of some examples?
Q4. Discuss the top-down and bottom up approach of creating
a data warehouse?
Q5. For a commercial bank, name five types of strategic
objectives and explain each objective in detail.
Q6. What do you mean by Information Packages and also
explain the need for information packages?
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.254
Review Questions Cont...
Q7. A data warehouse is an environment, not a product.
Discuss.
Q8. Explain various type of data ware house meta data in
detail.
Q9. For an airlines company, how can strategic information
q
flyers?
y
Discuss g
giving
g
increases the number of frequent
specific details.
Q10.Examine the opportunities that can be provided by
strategic information for a medical center. Can you explain
five such opportunities
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.255
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.85
MCA 204, Data Warehousing & Data Mining
Suggested Reading/References
1. Paul Raj Poonia, “Fundamentals of Data Warehousing”,
John Wiley & Sons, 2003.
2. Sam Anahony, “Data Warehousing in the Real World: A
Practical Guide for Building Decision Support Systems”,
John Wiley, 2004
3. W. H. Inmon, “Building the Operational Data Store”, 2nd Ed.,
John Wiley, 1999.
4. Kamber and Han, Data Mining Concepts and Techniques”,
Hartcourt India P. Ltd.,2001”.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.256
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U1.86