download

Data Warehousing and
Mining:
Concepts, Methodologies,
Tools, and Applications
John Wang
Montclair State University, USA
Information Science reference
Hershey • New York
Acquisitions Editor:
Development Editor:
Senior Managing Editor: Managing Editor:
Typesetter: Cover Design:
Printed at:
Kristin Klinger
Kristin Roth
Jennifer Neidig
Jamie Snavely
Michael Brehm, Jeff Ash, Carole Coulson, Elizabeth Duke, Jamie Snavely, Sean Woznicki
Lisa Tosheff
Yurchak Printing Inc.
Published in the United States of America by
Information Science Reference (an imprint of IGI Global)
701 E. Chocolate Avenue, Suite 200
Hershey PA 17033
Tel: 717-533-8845
Fax: 717-533-8661
E-mail: [email protected]
Web site: http://www.igi-global.com/reference
and in the United Kingdom by
Information Science Reference (an imprint of IGI Global)
3 Henrietta Street
Covent Garden
London WC2E 8LU
Tel: 44 20 7240 0856
Fax: 44 20 7379 0609
Web site: http://www.eurospanonline.com
Library of Congress Cataloging-in-Publication Data
Data warehousing and mining : concepts, methodologies, tools and applications / John Wang, editor.
p. cm.
Summary: "This collection offers tools, designs, and outcomes of the utilization of data mining and warehousing technologies, such as
algorithms, concept lattices, multidimensional data, and online analytical processing. With more than 300 chapters contributed by over 575
experts from around the globe, this authoritative collection will provide libraries with the essential reference on data mining and warehousing"--Provided by publisher.
Includes bibliographical references and index.
ISBN 978-1-59904-951-9 (hbk.) -- ISBN 978-1-59904-952-6 (e-book)
1. Data mining. 2. Data warehousing. I. Wang, John, 1955QA76.9.D343D398 2008
005.74--dc22
2008001934
Copyright © 2008 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by
any means, electronic or mechanical, including photocopying, without written permission from the publisher.
Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not
indicate a claim of ownership by IGI Global of the trademark or registered trademark.
British Cataloguing in Publication Data
A Cataloguing in Publication record for this book is available from the British Library.
Chapter 1.1
Administering and Managing
a Data Warehouse
James E. Yao
Montclair State University, USA
Chang Liu
Northern Illinois University, USA
Qiyang Chen
Montclair State University, USA
June Lu
University of Houston - Victoria, USA
INTRODUCTION
As internal and external demands on information
from managers are increasing rapidly, especially
the information that is processed to serve managers’ specific needs, regular databases and decision support systems (DSS) cannot provide the
information needed. Data warehouses came into
existence to meet these needs, consolidating and
integrating information from many internal and
external sources and arranging it in a meaningful
format for making accurate business decisions
(Martin, 1997). In the past five years, there has been
a significant growth in data warehousing (Hoffer,
Prescott, & McFadden, 2005). Correspondingly,
this occurrence has brought up the issue of data
warehouse administration and management. Data
warehousing has been increasingly recognized as
an effective tool for organizations to transform
data into useful information for strategic decisionmaking. To achieve competitive advantages via
data warehousing, data warehouse management
is crucial (Ma, Chou, & Yen, 2000).
BACKGROUND
Since the advent of computer storage technology
and higher level programming languages (Inmon,
2002), organizations, especially larger organiza-
Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Administering and Managing a Data Warehouse
tions, have put enormous amount of investment
in their information system infrastructures. In
a 2003 IT spending survey, 45% of American
company participants indicated that their 2003
IT purchasing budgets had increased compared
with their budgets in 2002. Among the respondents, database applications ranked top in areas
of technology being implemented or had been
implemented, with 42% indicating a recent implementation (Information, 2004). The fast growth of
databases enables companies to capture and store
a great deal of business operation data and other
business-related data. The data that are stored
in the databases, either historical or operational,
have been considered corporate resources and an
asset that must be managed and used effectively
to serve the corporate business for competitive
advantages.
A database is a computer structure that
houses a self-describing collection of related data
(Kroenke, 2004; Rob & Coronel, 2004). This
type of data is primitive, detailed, and used for
day-to-day operation. The data in a warehouse is
derived, meaning it is integrated, subject-oriented,
time-variant, and nonvolatile (Inmon, 2002). A
data warehouse is defined as an integrated decision
support database whose content is derived from
various operational databases (Hoffer, Prescott, &
McFadden, 2005; Sen & Jacob, 1998). Often a data
warehouse can be referred to as a multidimensional
database because each occurrence of the subject
is referenced by an occurrence of each of several
dimensions or characteristics of the subject (Gillenson, 2005). Some multidimensional databases
operate on a technological foundation optimal for
“slicing and dicing” the data, where data can be
thought of as existing in multidimensional cubes
(Inmon, 2002). Regular databases load data in
two-dimensional tables. A data warehouse can use
OLAP (online analytical processing) to provide
users with multidimensional views of their data,
which can be visually represented as a cube for
three dimensions (Senn, 2004).
With the host of differences between a database
for day-to-day operation and a data warehouse
for supporting management decision-making
process, the administration and management of
a data warehouse is of course far from similar.
For instance, a data warehouse team requires
someone who does routine data extraction, transformation, and loading (ETL) from operational
databases into data warehouse databases. Thus
the team requires a technical role called ETL
Specialist. On the other hand, a data warehouse
is intended to support the business decision-making process. Someone like a business analyst is
also needed to ensure that business information
requirements are crossed to the data warehouse
development. Data in the data warehouse can be
very sensitive and cross functional areas, such as
personal medical records and salary information.
Therefore, a higher level of security on the data
is needed. Encrypting the sensitive data in data
warehouse is a potential solution. Issues as such in
data warehouse administration and management
need to be defined and discussed.
MAIN THRUST
Data warehouse administration and management
covers a wide range of fields. This article focuses
only on data warehouse and business strategy,
data warehouse development life cycle, data
warehouse team, process management, and security management to present the current concerns
and issues in data warehouse administration and
management.
Data Warehouse and Business
Strategy
“Data is the blood of an organization. Without
data, the corporation has no idea where it stands
and where it will go” (Ferdinandi, 1999, p. xi).
With data warehousing, today’s corporations
Administering and Managing a Data Warehouse
can collect and house large volumes of data.
Does the size of data volume simply guarantee
you a success in your business? Does it mean
that the more data you have the more strategic
advantages you have over your competitors? Not
necessarily. There is no predetermined formula
that can turn your information into competitive
advantages (Inmon, Terdeman, & Imhoff, 2000).
Thus, top management and data administration
team are confronted with the question of how to
convert corporate information into competitive
advantages.
A well-managed data warehouse can assist
a corporation in its strategy to gain competitive
advantages. This can be achieved by using an
exploration warehouse, which is a direct product
of data warehouse, to identify environmental
factors, formulate strategic plans, and determine
business specific objectives:
•
•
•
Identifying Environmental Factors:
Quantified analysis can be used for identifying a corporation’s products and services,
market share of specific products and services, financial management.
Formulating Strategic Plans: Environmental factors can be matched up against
the strategic plan by identifying current
market positioning, financial goals, and
opportunities.
DeterminingSpecificObjectives: Exploration warehouse can be used to find patterns; if
found, these patterns are then compared with
patterns discovered previously to optimize
corporate objectives (Inmon, Terdeman, &
Imhoff, 2000).
While managing a data warehouse for business
strategy, what needs to be taken into consideration is the difference between companies. No
one formula fits every organization. Avoid using
so called “templates” from other companies.
The data warehouse is used for your company’s
competitive advantages. You need to follow your
company’s user information requirements for
strategic advantages.
Data Warehouse Development Cycle
Data warehouse system development phases are
similar to the phases in the systems development
life cycle (SDLC) (Adelman & Rehm, 2003).
However, Barker (1998) thinks that there are some
differences between the two due to the unique
functional and operational features of a data warehouse. As business and information requirements
change, new corporate information models evolve
and are synthesized into the data warehouse in the
Synthesis of Model phase. These models are then
used to exploit the data warehouse in the Exploit
phase. The data warehouse is updated with new
data using appropriate updating strategies and
linked to various data sources.
Inmon (2002) sees system development for
data warehouse environment as almost exactly the
opposite of the traditional SDLC. He thinks that
traditional SDLC is concerned with and supports
primarily the operational environment. The data
warehouse operates under a very different life
cycle called “CLDS” (the reverse of the SDLC).
The CLDS is a classic data-driven development
life cycle, but the SDLC is a classic requirementsdriven development life cycle.
The Data Warehouse Team
Building a data warehouse is a large system development process. Participants of data warehouse
development can range from a data warehouse
administrator (DWA) (Hoffer, Prescott, & McFadden, 2005) to a business analyst (Ferdinandi,
1999). The data warehouse team is supposed to
lead the organization into assuming their roles
and thereby bringing about a partnership with the
business (McKnight, 2000). A data warehouse
team may have the following roles (Barker, 1998;
Ferdinandi, 1999; Inmon, 2000, 2003; McKnight,
2000):
Administering and Managing a Data Warehouse
•
•
•
•
•
•
•
•
•
•
Data Warehouse Administrator (DWA):
responsible for integrating and coordinating
of metadata and data across many different
data sources as well as data source management, physical database design, operation,
backup and recovery, security, and performance and tuning.
Manager/Director: responsible for the
overall management of the entire team to
ensure that the team follows the guiding
principles, business requirements, and
corporate strategic plans.
Project Manager: responsible for data
warehouse project development, including
matching each team member’s skills and
aspirations to tasks on the project plan.
Executive Sponsor: responsible for garnering and retaining adequate resources for the
construction and maintenance of the data
warehouse.
Business Analyst: responsible for determining what information is required from
a data warehouse to manage the business
competitively.
System Architect: responsible for developing and implementing the overall technical
architecture of the data warehouse, from
the backend hardware and software to the
client desktop configurations.
ETL Specialist: responsible for routine
work on data extraction, transformation,
and loading for the warehouse databases.
Front End Developer: responsible for
developing the front-end, whether it is client-server or over the Web.
OLAPSpecialist: responsible for the development of data cubes, a multidimensional
view of data in OLAP.
Data Modeler: responsible for modeling
the existing data in an organization into a
schema that is appropriate for OLAP analysis.
•
•
Trainer: responsible for training the end-users to use the system so that they can benefit
from the data warehouse system.
End User: responsible for providing feedback to the data warehouse team.
In terms of the size of the data warehouse
administrator team, Inmon (2003) has several
recommendations:
•
•
•
•
•
•
large warehouse requires more analysts;
every 100gbs of data in a data warehouse
requires another data warehouse administrator;
a new data warehouse administrator is
required for each year a data warehouse is
up and running and is being used successfully;
if an ETL tool is being written manually,
many data warehouse administrators are
needed; if automation tool is needed much
fewer staffing is required;
automated data warehouse database management system (DBMS) requires fewer data
warehouse administrators, otherwise more
administrators are needed;
fewer supporting staff is required if the
corporate information factory (CIF) architecture is followed more closely; reversely,
more staff is needed.
McKnight (2000) suggests that all the technical roles be performed full-time by dedicated
personnel and each responsible person receives
specific data warehouse training.
Data warehousing is growing rapidly. As the
scope and data storage size of the data warehouse
change, the roles and size of a data warehouse
team should be adjusted accordingly. In general,
the extremes should be avoided. Without sufficient
professionals, job may not be done satisfactorily.
On the other hand, too many people will certainly
get the team overstuffed.
Administering and Managing a Data Warehouse
Process Management
Security Management
Developing data warehouse has become a popular
but exceedingly demanding and costly activity in
information systems development and management. Data warehouse vendors are competing
intensively for their customers because so much
of their money and prestige are at stake. Consulting vendors have redirected their attention toward
this rapidly expanding market segment. User
companies are facing with a serious question on
which product they should buy. Sen & Jacob’s
(1998) advice is to first understand the process of
data warehouse development before selecting the
tools for its implementation. A data warehouse
development process refers to the activities required to build a data warehouse (Barquin, 1997).
Sen & Jacob (1998) and Ma, Chou, & Yen (2000)
have identified some of these activities, which
need to be managed during the data warehouse
development cycle: initializing project, establishing the technical environment, tool integration,
determining scalability, developing an enterprise
information architecture, designing the data warehouse database, data extraction/transformation,
managing metadata, developing the end-user
interface, managing the production environment,
managing decision support tools and applications,
and developing warehouse roll-out.
As mentioned before, data warehouse development is a large system development process.
Process management is not required in every
step of the development processes. Devlin (1997)
states that process management is required in the
following areas: process schedule, which consists
of a network of tasks and decision points; process
map definition, which defines and maintains the
network of tasks and decision points that make
up a process; task initiation, which supports to
initiate tasks on all of the hardware/software
platforms in the entire data warehouse environment; status information enquiry, which enquires
about the status of components that are running
on all platforms.
In recent years, information technology (IT)
security has become one of the hottest and most
important topics facing both users and providers
(Senn, 2005). The goal of database security is the
protection of data from accidental or intentional
threats to its integrity and access (Hoffer, Prescott,
& McFadden, 2005). The same is true for a data
warehouse. However, higher security methods,
in addition to the common practices such as
view-based control, integrity control, processing
rights, and DBMS security, need to be used for the
data warehouse due to the differences between a
database and data warehouse. One of the differences that demand a higher level of security for
a data warehouse is the scope of and detail level
of data in the data warehouse, such as financial
transactions, personal medical records, and salary information. A method that can be used to
protect data that requires high level of security
in a data warehouse is by using encryption and
decryption.
Confidential and sensitive data can be stored in
a separate set of tables where only authorized users
can have access. These data can be encrypted while
they are being written into the data warehouse. In
this way, the data captured and stored in the data
warehouse are secure and can only be accessed on
an authorized basis. Three levels of security can
be offered by using encryption and decryption.
The first level is that only authorized users can
have access to the data in the data warehouse.
Each group of users, internal or external, ranging
from executives to information consumers should
be granted different rights for security reasons.
Unauthorized users are totally prevented from
seeing the data in the data warehouse. The second
level is the protection from unauthorized dumping and interpretation of data. Without the right
key an unauthorized access will not be allowed
to write anything into the tables. On the other
hand, the existing data in the tables cannot be
decrypted. The third level is the protection from
Administering and Managing a Data Warehouse
unauthorized access during the transmission process. Even if unauthorized access occurs during
transmission, there is no harm to the encrypted
data unless the user has the decryption code (Ma,
Chou, & Yen, 2000).
FUTURE TRENDS
Data warehousing administration and management is facing several challenges, as data
warehousing becomes a mature part of the infrastructure of organizations. More legislative
work is necessary to protect individual privacy
from abuse by government or commercial entities
that have large volumes of data concerning those
individuals. The protection also calls for tightened
security through technology as well as user efforts for workable rules and regulations while at
the same time still granting a data warehouse the
ability to perform large datasets for meaningful
analyses (Marakas, 2003).
Today’s data warehouse is limited to storage of
structured data in the form of records, fields, and
databases. Unstructured data, such as multimedia,
maps, graphs, pictures, sound, and video files are
demanded increasingly in organizations. How to
manage the storage and retrieval of unstructured
data and how to search for specific data items set
a real challenge for data warehouse administration
and management. Alternative storage, especially
the near-line storage, which is one of the two
forms of alternative storage, is considered to be
one of the best future solutions for managing the
storage and retrieval of unstructured data in data
warehouses (Marakas, 2003).
The past decade has seen a fast rise of the Internet and World Wide Web. Today, Web-enabled
versions of all leading vendors’ warehouse tools
are becoming available (Moeller, 2001). This
recent growth in Web use and advances in e-business applications have pushed the data warehouse
from the back office, where it is accessed by only
a few business analysts, to the front lines of the
organization, where all employees and every
customer can use it.
To accommodate this move to the frontline of
the organization, the data warehouse demands
massive scalability for data volume as well as
for performance. As the number of and types of
users increase rapidly, enterprise data volume is
doubling in size every 9 to 12 months. Around-theclock access to the data warehouse is becoming
the norm. The data warehouse will require fast
implementation, continuous scalability, and ease
of management (Marakas, 2003).
Additionally, building distributed warehouses,
which are normally called data marts, will be
on the rise. Other technical advances in data
warehousing will include an increasing ability to
exploit parallel processing, automated information
delivery, greater support of object extensions, very
large database support, and user-friendly Webenabled analysis applications. These capabilities
should make data warehouses of the future more
powerful and easier to use, which will further
increase the importance of data warehouse technology for business strategic decision making and
competitive advantages (Ma, Chou, & Yen, 2000;
Marakas, 2003; Pace University, 2004).
CONCLUSION
The data that organizations have captured and
stored are considered organizational assets. Yet
the data themselves cannot do anything until they
are put into intelligent use. One way to accomplish
this goal is to use data warehouse and data mining
technology to transform corporate information
into business competitive advantages.
What impacts data warehouses the most is
the Internet and Web technology. Web browser
will become the universal interface for corporations, allowing employees to browse their data
warehouse worldwide on public and private
networks, eliminating the need to replicate data
across diverse geographic locations. Thus strong
Administering and Managing a Data Warehouse
data warehouse management sponsorship and
an effective administration team may become a
crucial factor to provide an organization with the
information service needed.
REFERENCES
Adelman, S., & Relm, C. (2003, November 5).
What are the various phases in implementing a
data warehouse solution? DMReview. Retrieved
from http://www. dmreview.com/article_ sub.
cfm?articleId=7660
Barker, R. (1998, February). Managing a data
warehouse. Chertsey, UK: Veritas Software
Corporation.
Barquin, F. (1997). Building, using, and managing the data warehouse. Upper Saddle River, NJ:
Prentice Hall.
Devlin, B. (1997). Data warehouse: From architecture to implementation. Reading, MA:
Addison-Wesley.
Ferdinandi, P.L. (1999). Data warehouse advice
for managers. New York: AMACOM American
Management Association.
Gillenson, M.L. (2005). Fundamentals of database management systems. New York: John
Wiley & Sons Inc.
Hoffer, J.A., Prescott, M.B., & McFadden, F.R.
(2005). Modern database management (7t h ed.)
Upper Saddle River, NJ: Prentice Hall.
Information Technology Toolbox. (2004). 2003
IToolbox spending survey. Retrieved from http://
datawarehouse. ittoolbox.com/research/survey.
asp
Inmon, W.H. (2002). Building the data warehouse
(3r d ed.). New York: John Wiley & Sons Inc.
Inmon, W.H. (2000). Building the data warehouse:
Getting started. Retrieved from http://www.billinmon.com/library/whiteprs/earlywp/ttbuild.pdf
Inmon, W.H. (2003). Data warehouse administration. Retrieved from http://www.billinmon.
com/library/other/dwadmin.asp
Inmon, W.H., Terdeman, R.H., & Imhoff, C.
(2000). Exploration warehousing. New York:
John Wiley & Sons Inc.
Kroenke, D.M. (2004). Database processing:
Fundamentals, design, and implementation (9t h
ed.). Upper Saddle River, NJ: Prentice Hall.
Ma, C., Chou, D.V., & Yen, D.C. (2000). Data
warehousing, technology assessment and management. Industrial Management + Data Systems,
100 (3), 125-137.
Marakas, G.M. (2003). Modern data warehousing,
mining, and visualization: Core concepts. Upper
Saddle River, NJ: Prentice Hall.
Martin, J. (1997, September). New tools for decision making. DM Review, 7, 80.
McKnight Associates, Inc. (2000). Effective data
warehouse organizational roles and responsibilities. Sunnyvale, CA.
Moeller, R.A. (2001). Distributed data warehousing using web technology: How to build
a more cost-effective and flexible warehouse.
New York: AMACOM American Management
Association.
Pace University. (2004). Emerging technology.
Retrieved from http://webcomposer.pace.edu/
ea10931w/Tappert/Assignment2.htm
Post, G.V. (2005). Database management systems:
designing & building business applications (3r d
ed.). New York: McGraw-Hill/Irwin.
Rob, P., & Coronel, C. (2004). Database systems:
Design, implementation, and management (6t h
ed.). Boston, MA: Course Technology.
Sen, A., & Jacob, V.S. (1998). Industrial strength
data warehousing: Why process is so important
and so often ignored. Communication of the ACM,
41(9), 29-31.
Administering and Managing a Data Warehouse
Senn, J.A. (2004). Information technology: Principles, practices, opportunities (3r d ed.). Upper
Saddle River, NJ: Prentice Hall.
KEY TERMS
Alternative Storage: An array of storage
media that consists of two forms of storage: nearline storage and/or second storage.
“CLDS”: The facetiously named system development life cycle (SDLC) for analytical, DSS
systems. CLDS is so named because in fact it is
the reverse of the classical SDLC.
Corporate Information Factory (CIF):
The corporate information factory is a logical
architecture with a purpose of delivering business intelligence and business management capabilities driven by data provided from business
operations.
Data Mart: A data warehouse that is limited in
scope and facility, but for a restricted domain.
DatabaseManagementSystem(DBMS): A
set of programs used to define, administer, and
process the database and its applications.
Metadata: Data about data; data concerning
the structure of data in a database stored in the
data dictionary.
Near-line Storage: Near-line storage is siloed
tape storage where siloed cartridges of tape are
archived, accessed, and managed robotically.
OnlineAnalyticalProcess(OLAP): Decision
Support System (DSS) tools that uses multidimensional data analysis techniques to provide users
with multidimensional views of their data.
System Development Life Cycle (SDLC):
The methodology used by most organizations for
developing large information systems.
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 17-22, copyright
2005 by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).