Data Warehousing and Mining: Concepts, Methodologies, Tools, and Applications John Wang Montclair State University, USA Information Science reference Hershey • New York Acquisitions Editor: Development Editor: Senior Managing Editor: Managing Editor: Typesetter: Cover Design: Printed at: Kristin Klinger Kristin Roth Jennifer Neidig Jamie Snavely Michael Brehm, Jeff Ash, Carole Coulson, Elizabeth Duke, Jamie Snavely, Sean Woznicki Lisa Tosheff Yurchak Printing Inc. Published in the United States of America by Information Science Reference (an imprint of IGI Global) 701 E. Chocolate Avenue, Suite 200 Hershey PA 17033 Tel: 717-533-8845 Fax: 717-533-8661 E-mail: [email protected] Web site: http://www.igi-global.com/reference and in the United Kingdom by Information Science Reference (an imprint of IGI Global) 3 Henrietta Street Covent Garden London WC2E 8LU Tel: 44 20 7240 0856 Fax: 44 20 7379 0609 Web site: http://www.eurospanonline.com Library of Congress Cataloging-in-Publication Data Data warehousing and mining : concepts, methodologies, tools and applications / John Wang, editor. p. cm. Summary: "This collection offers tools, designs, and outcomes of the utilization of data mining and warehousing technologies, such as algorithms, concept lattices, multidimensional data, and online analytical processing. With more than 300 chapters contributed by over 575 experts from around the globe, this authoritative collection will provide libraries with the essential reference on data mining and warehousing"--Provided by publisher. Includes bibliographical references and index. ISBN 978-1-59904-951-9 (hbk.) -- ISBN 978-1-59904-952-6 (e-book) 1. Data mining. 2. Data warehousing. I. Wang, John, 1955QA76.9.D343D398 2008 005.74--dc22 2008001934 Copyright © 2008 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark. British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library. Chapter 1.1 Administering and Managing a Data Warehouse James E. Yao Montclair State University, USA Chang Liu Northern Illinois University, USA Qiyang Chen Montclair State University, USA June Lu University of Houston - Victoria, USA INTRODUCTION As internal and external demands on information from managers are increasing rapidly, especially the information that is processed to serve managers’ specific needs, regular databases and decision support systems (DSS) cannot provide the information needed. Data warehouses came into existence to meet these needs, consolidating and integrating information from many internal and external sources and arranging it in a meaningful format for making accurate business decisions (Martin, 1997). In the past five years, there has been a significant growth in data warehousing (Hoffer, Prescott, & McFadden, 2005). Correspondingly, this occurrence has brought up the issue of data warehouse administration and management. Data warehousing has been increasingly recognized as an effective tool for organizations to transform data into useful information for strategic decisionmaking. To achieve competitive advantages via data warehousing, data warehouse management is crucial (Ma, Chou, & Yen, 2000). BACKGROUND Since the advent of computer storage technology and higher level programming languages (Inmon, 2002), organizations, especially larger organiza- Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited. Administering and Managing a Data Warehouse tions, have put enormous amount of investment in their information system infrastructures. In a 2003 IT spending survey, 45% of American company participants indicated that their 2003 IT purchasing budgets had increased compared with their budgets in 2002. Among the respondents, database applications ranked top in areas of technology being implemented or had been implemented, with 42% indicating a recent implementation (Information, 2004). The fast growth of databases enables companies to capture and store a great deal of business operation data and other business-related data. The data that are stored in the databases, either historical or operational, have been considered corporate resources and an asset that must be managed and used effectively to serve the corporate business for competitive advantages. A database is a computer structure that houses a self-describing collection of related data (Kroenke, 2004; Rob & Coronel, 2004). This type of data is primitive, detailed, and used for day-to-day operation. The data in a warehouse is derived, meaning it is integrated, subject-oriented, time-variant, and nonvolatile (Inmon, 2002). A data warehouse is defined as an integrated decision support database whose content is derived from various operational databases (Hoffer, Prescott, & McFadden, 2005; Sen & Jacob, 1998). Often a data warehouse can be referred to as a multidimensional database because each occurrence of the subject is referenced by an occurrence of each of several dimensions or characteristics of the subject (Gillenson, 2005). Some multidimensional databases operate on a technological foundation optimal for “slicing and dicing” the data, where data can be thought of as existing in multidimensional cubes (Inmon, 2002). Regular databases load data in two-dimensional tables. A data warehouse can use OLAP (online analytical processing) to provide users with multidimensional views of their data, which can be visually represented as a cube for three dimensions (Senn, 2004). With the host of differences between a database for day-to-day operation and a data warehouse for supporting management decision-making process, the administration and management of a data warehouse is of course far from similar. For instance, a data warehouse team requires someone who does routine data extraction, transformation, and loading (ETL) from operational databases into data warehouse databases. Thus the team requires a technical role called ETL Specialist. On the other hand, a data warehouse is intended to support the business decision-making process. Someone like a business analyst is also needed to ensure that business information requirements are crossed to the data warehouse development. Data in the data warehouse can be very sensitive and cross functional areas, such as personal medical records and salary information. Therefore, a higher level of security on the data is needed. Encrypting the sensitive data in data warehouse is a potential solution. Issues as such in data warehouse administration and management need to be defined and discussed. MAIN THRUST Data warehouse administration and management covers a wide range of fields. This article focuses only on data warehouse and business strategy, data warehouse development life cycle, data warehouse team, process management, and security management to present the current concerns and issues in data warehouse administration and management. Data Warehouse and Business Strategy “Data is the blood of an organization. Without data, the corporation has no idea where it stands and where it will go” (Ferdinandi, 1999, p. xi). With data warehousing, today’s corporations Administering and Managing a Data Warehouse can collect and house large volumes of data. Does the size of data volume simply guarantee you a success in your business? Does it mean that the more data you have the more strategic advantages you have over your competitors? Not necessarily. There is no predetermined formula that can turn your information into competitive advantages (Inmon, Terdeman, & Imhoff, 2000). Thus, top management and data administration team are confronted with the question of how to convert corporate information into competitive advantages. A well-managed data warehouse can assist a corporation in its strategy to gain competitive advantages. This can be achieved by using an exploration warehouse, which is a direct product of data warehouse, to identify environmental factors, formulate strategic plans, and determine business specific objectives: • • • Identifying Environmental Factors: Quantified analysis can be used for identifying a corporation’s products and services, market share of specific products and services, financial management. Formulating Strategic Plans: Environmental factors can be matched up against the strategic plan by identifying current market positioning, financial goals, and opportunities. DeterminingSpecificObjectives: Exploration warehouse can be used to find patterns; if found, these patterns are then compared with patterns discovered previously to optimize corporate objectives (Inmon, Terdeman, & Imhoff, 2000). While managing a data warehouse for business strategy, what needs to be taken into consideration is the difference between companies. No one formula fits every organization. Avoid using so called “templates” from other companies. The data warehouse is used for your company’s competitive advantages. You need to follow your company’s user information requirements for strategic advantages. Data Warehouse Development Cycle Data warehouse system development phases are similar to the phases in the systems development life cycle (SDLC) (Adelman & Rehm, 2003). However, Barker (1998) thinks that there are some differences between the two due to the unique functional and operational features of a data warehouse. As business and information requirements change, new corporate information models evolve and are synthesized into the data warehouse in the Synthesis of Model phase. These models are then used to exploit the data warehouse in the Exploit phase. The data warehouse is updated with new data using appropriate updating strategies and linked to various data sources. Inmon (2002) sees system development for data warehouse environment as almost exactly the opposite of the traditional SDLC. He thinks that traditional SDLC is concerned with and supports primarily the operational environment. The data warehouse operates under a very different life cycle called “CLDS” (the reverse of the SDLC). The CLDS is a classic data-driven development life cycle, but the SDLC is a classic requirementsdriven development life cycle. The Data Warehouse Team Building a data warehouse is a large system development process. Participants of data warehouse development can range from a data warehouse administrator (DWA) (Hoffer, Prescott, & McFadden, 2005) to a business analyst (Ferdinandi, 1999). The data warehouse team is supposed to lead the organization into assuming their roles and thereby bringing about a partnership with the business (McKnight, 2000). A data warehouse team may have the following roles (Barker, 1998; Ferdinandi, 1999; Inmon, 2000, 2003; McKnight, 2000): Administering and Managing a Data Warehouse • • • • • • • • • • Data Warehouse Administrator (DWA): responsible for integrating and coordinating of metadata and data across many different data sources as well as data source management, physical database design, operation, backup and recovery, security, and performance and tuning. Manager/Director: responsible for the overall management of the entire team to ensure that the team follows the guiding principles, business requirements, and corporate strategic plans. Project Manager: responsible for data warehouse project development, including matching each team member’s skills and aspirations to tasks on the project plan. Executive Sponsor: responsible for garnering and retaining adequate resources for the construction and maintenance of the data warehouse. Business Analyst: responsible for determining what information is required from a data warehouse to manage the business competitively. System Architect: responsible for developing and implementing the overall technical architecture of the data warehouse, from the backend hardware and software to the client desktop configurations. ETL Specialist: responsible for routine work on data extraction, transformation, and loading for the warehouse databases. Front End Developer: responsible for developing the front-end, whether it is client-server or over the Web. OLAPSpecialist: responsible for the development of data cubes, a multidimensional view of data in OLAP. Data Modeler: responsible for modeling the existing data in an organization into a schema that is appropriate for OLAP analysis. • • Trainer: responsible for training the end-users to use the system so that they can benefit from the data warehouse system. End User: responsible for providing feedback to the data warehouse team. In terms of the size of the data warehouse administrator team, Inmon (2003) has several recommendations: • • • • • • large warehouse requires more analysts; every 100gbs of data in a data warehouse requires another data warehouse administrator; a new data warehouse administrator is required for each year a data warehouse is up and running and is being used successfully; if an ETL tool is being written manually, many data warehouse administrators are needed; if automation tool is needed much fewer staffing is required; automated data warehouse database management system (DBMS) requires fewer data warehouse administrators, otherwise more administrators are needed; fewer supporting staff is required if the corporate information factory (CIF) architecture is followed more closely; reversely, more staff is needed. McKnight (2000) suggests that all the technical roles be performed full-time by dedicated personnel and each responsible person receives specific data warehouse training. Data warehousing is growing rapidly. As the scope and data storage size of the data warehouse change, the roles and size of a data warehouse team should be adjusted accordingly. In general, the extremes should be avoided. Without sufficient professionals, job may not be done satisfactorily. On the other hand, too many people will certainly get the team overstuffed. Administering and Managing a Data Warehouse Process Management Security Management Developing data warehouse has become a popular but exceedingly demanding and costly activity in information systems development and management. Data warehouse vendors are competing intensively for their customers because so much of their money and prestige are at stake. Consulting vendors have redirected their attention toward this rapidly expanding market segment. User companies are facing with a serious question on which product they should buy. Sen & Jacob’s (1998) advice is to first understand the process of data warehouse development before selecting the tools for its implementation. A data warehouse development process refers to the activities required to build a data warehouse (Barquin, 1997). Sen & Jacob (1998) and Ma, Chou, & Yen (2000) have identified some of these activities, which need to be managed during the data warehouse development cycle: initializing project, establishing the technical environment, tool integration, determining scalability, developing an enterprise information architecture, designing the data warehouse database, data extraction/transformation, managing metadata, developing the end-user interface, managing the production environment, managing decision support tools and applications, and developing warehouse roll-out. As mentioned before, data warehouse development is a large system development process. Process management is not required in every step of the development processes. Devlin (1997) states that process management is required in the following areas: process schedule, which consists of a network of tasks and decision points; process map definition, which defines and maintains the network of tasks and decision points that make up a process; task initiation, which supports to initiate tasks on all of the hardware/software platforms in the entire data warehouse environment; status information enquiry, which enquires about the status of components that are running on all platforms. In recent years, information technology (IT) security has become one of the hottest and most important topics facing both users and providers (Senn, 2005). The goal of database security is the protection of data from accidental or intentional threats to its integrity and access (Hoffer, Prescott, & McFadden, 2005). The same is true for a data warehouse. However, higher security methods, in addition to the common practices such as view-based control, integrity control, processing rights, and DBMS security, need to be used for the data warehouse due to the differences between a database and data warehouse. One of the differences that demand a higher level of security for a data warehouse is the scope of and detail level of data in the data warehouse, such as financial transactions, personal medical records, and salary information. A method that can be used to protect data that requires high level of security in a data warehouse is by using encryption and decryption. Confidential and sensitive data can be stored in a separate set of tables where only authorized users can have access. These data can be encrypted while they are being written into the data warehouse. In this way, the data captured and stored in the data warehouse are secure and can only be accessed on an authorized basis. Three levels of security can be offered by using encryption and decryption. The first level is that only authorized users can have access to the data in the data warehouse. Each group of users, internal or external, ranging from executives to information consumers should be granted different rights for security reasons. Unauthorized users are totally prevented from seeing the data in the data warehouse. The second level is the protection from unauthorized dumping and interpretation of data. Without the right key an unauthorized access will not be allowed to write anything into the tables. On the other hand, the existing data in the tables cannot be decrypted. The third level is the protection from Administering and Managing a Data Warehouse unauthorized access during the transmission process. Even if unauthorized access occurs during transmission, there is no harm to the encrypted data unless the user has the decryption code (Ma, Chou, & Yen, 2000). FUTURE TRENDS Data warehousing administration and management is facing several challenges, as data warehousing becomes a mature part of the infrastructure of organizations. More legislative work is necessary to protect individual privacy from abuse by government or commercial entities that have large volumes of data concerning those individuals. The protection also calls for tightened security through technology as well as user efforts for workable rules and regulations while at the same time still granting a data warehouse the ability to perform large datasets for meaningful analyses (Marakas, 2003). Today’s data warehouse is limited to storage of structured data in the form of records, fields, and databases. Unstructured data, such as multimedia, maps, graphs, pictures, sound, and video files are demanded increasingly in organizations. How to manage the storage and retrieval of unstructured data and how to search for specific data items set a real challenge for data warehouse administration and management. Alternative storage, especially the near-line storage, which is one of the two forms of alternative storage, is considered to be one of the best future solutions for managing the storage and retrieval of unstructured data in data warehouses (Marakas, 2003). The past decade has seen a fast rise of the Internet and World Wide Web. Today, Web-enabled versions of all leading vendors’ warehouse tools are becoming available (Moeller, 2001). This recent growth in Web use and advances in e-business applications have pushed the data warehouse from the back office, where it is accessed by only a few business analysts, to the front lines of the organization, where all employees and every customer can use it. To accommodate this move to the frontline of the organization, the data warehouse demands massive scalability for data volume as well as for performance. As the number of and types of users increase rapidly, enterprise data volume is doubling in size every 9 to 12 months. Around-theclock access to the data warehouse is becoming the norm. The data warehouse will require fast implementation, continuous scalability, and ease of management (Marakas, 2003). Additionally, building distributed warehouses, which are normally called data marts, will be on the rise. Other technical advances in data warehousing will include an increasing ability to exploit parallel processing, automated information delivery, greater support of object extensions, very large database support, and user-friendly Webenabled analysis applications. These capabilities should make data warehouses of the future more powerful and easier to use, which will further increase the importance of data warehouse technology for business strategic decision making and competitive advantages (Ma, Chou, & Yen, 2000; Marakas, 2003; Pace University, 2004). CONCLUSION The data that organizations have captured and stored are considered organizational assets. Yet the data themselves cannot do anything until they are put into intelligent use. One way to accomplish this goal is to use data warehouse and data mining technology to transform corporate information into business competitive advantages. What impacts data warehouses the most is the Internet and Web technology. Web browser will become the universal interface for corporations, allowing employees to browse their data warehouse worldwide on public and private networks, eliminating the need to replicate data across diverse geographic locations. Thus strong Administering and Managing a Data Warehouse data warehouse management sponsorship and an effective administration team may become a crucial factor to provide an organization with the information service needed. REFERENCES Adelman, S., & Relm, C. (2003, November 5). What are the various phases in implementing a data warehouse solution? DMReview. Retrieved from http://www. dmreview.com/article_ sub. cfm?articleId=7660 Barker, R. (1998, February). Managing a data warehouse. Chertsey, UK: Veritas Software Corporation. Barquin, F. (1997). Building, using, and managing the data warehouse. Upper Saddle River, NJ: Prentice Hall. Devlin, B. (1997). Data warehouse: From architecture to implementation. Reading, MA: Addison-Wesley. Ferdinandi, P.L. (1999). Data warehouse advice for managers. New York: AMACOM American Management Association. Gillenson, M.L. (2005). Fundamentals of database management systems. New York: John Wiley & Sons Inc. Hoffer, J.A., Prescott, M.B., & McFadden, F.R. (2005). Modern database management (7t h ed.) Upper Saddle River, NJ: Prentice Hall. Information Technology Toolbox. (2004). 2003 IToolbox spending survey. Retrieved from http:// datawarehouse. ittoolbox.com/research/survey. asp Inmon, W.H. (2002). Building the data warehouse (3r d ed.). New York: John Wiley & Sons Inc. Inmon, W.H. (2000). Building the data warehouse: Getting started. Retrieved from http://www.billinmon.com/library/whiteprs/earlywp/ttbuild.pdf Inmon, W.H. (2003). Data warehouse administration. Retrieved from http://www.billinmon. com/library/other/dwadmin.asp Inmon, W.H., Terdeman, R.H., & Imhoff, C. (2000). Exploration warehousing. New York: John Wiley & Sons Inc. Kroenke, D.M. (2004). Database processing: Fundamentals, design, and implementation (9t h ed.). Upper Saddle River, NJ: Prentice Hall. Ma, C., Chou, D.V., & Yen, D.C. (2000). Data warehousing, technology assessment and management. Industrial Management + Data Systems, 100 (3), 125-137. Marakas, G.M. (2003). Modern data warehousing, mining, and visualization: Core concepts. Upper Saddle River, NJ: Prentice Hall. Martin, J. (1997, September). New tools for decision making. DM Review, 7, 80. McKnight Associates, Inc. (2000). Effective data warehouse organizational roles and responsibilities. Sunnyvale, CA. Moeller, R.A. (2001). Distributed data warehousing using web technology: How to build a more cost-effective and flexible warehouse. New York: AMACOM American Management Association. Pace University. (2004). Emerging technology. Retrieved from http://webcomposer.pace.edu/ ea10931w/Tappert/Assignment2.htm Post, G.V. (2005). Database management systems: designing & building business applications (3r d ed.). New York: McGraw-Hill/Irwin. Rob, P., & Coronel, C. (2004). Database systems: Design, implementation, and management (6t h ed.). Boston, MA: Course Technology. Sen, A., & Jacob, V.S. (1998). Industrial strength data warehousing: Why process is so important and so often ignored. Communication of the ACM, 41(9), 29-31. Administering and Managing a Data Warehouse Senn, J.A. (2004). Information technology: Principles, practices, opportunities (3r d ed.). Upper Saddle River, NJ: Prentice Hall. KEY TERMS Alternative Storage: An array of storage media that consists of two forms of storage: nearline storage and/or second storage. “CLDS”: The facetiously named system development life cycle (SDLC) for analytical, DSS systems. CLDS is so named because in fact it is the reverse of the classical SDLC. Corporate Information Factory (CIF): The corporate information factory is a logical architecture with a purpose of delivering business intelligence and business management capabilities driven by data provided from business operations. Data Mart: A data warehouse that is limited in scope and facility, but for a restricted domain. DatabaseManagementSystem(DBMS): A set of programs used to define, administer, and process the database and its applications. Metadata: Data about data; data concerning the structure of data in a database stored in the data dictionary. Near-line Storage: Near-line storage is siloed tape storage where siloed cartridges of tape are archived, accessed, and managed robotically. OnlineAnalyticalProcess(OLAP): Decision Support System (DSS) tools that uses multidimensional data analysis techniques to provide users with multidimensional views of their data. System Development Life Cycle (SDLC): The methodology used by most organizations for developing large information systems. This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 17-22, copyright 2005 by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
© Copyright 2026 Paperzz